2.3-2.9学习周报

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

文章目录

摘要
Abstract
一、相关概念
- 1.文本提取(DLE)
- 2.以样本为中心的情境学习（SAIL）
- - 2.1问题公式化
  - 2.2文档级文本相似性
  - 2.3实体级文本相似性
  - 2.4布局相似性
  - 2.5以样本为中心的ICL提示模板
二、实验
- 1.数据集和指标
- 2.实验代码
- 3.实验结果
总结

文章链接

摘要

本博客介绍了论文《SAIL: Sample-Centric In-Context Learning for Document Information Extraction》，提出了一种用于文档信息提取（DIE）的以样本为中心的上下文学习（SAIL）方法。针对无训练DIE任务中理解文档布局与文本实体关系、为预训练模型提供有效指导两大挑战，SAIL引入实体级文本相似度和布局相似度，制定统一的ICL提示模板。通过在多个基准测试和不同基础模型上的实验表明，SAIL性能优于无训练的基线方法，接近全训练方法，展现出优越性和泛化能力。

Abstract

This blog introduces the paper “SAIL: Sample-Centric In-Context Learning for Document Information Extraction”, which proposes a sample-centric contextual learning (SAIL) method for document information extraction (DIE). In order to solve the two major challenges of understanding the document layout and text entity relationship in the untrained DIE task and providing effective guidance for the pre-trained model, SAIL introduces entity-level text similarity and layout similarity to develop a unified ICL prompt template. Experiments on multiple benchmarks and different basic models show that the performance of SAIL is better than that of the untrained baseline method, close to the full training method, and shows superiority and generalization ability.

一、相关概念

1.文本提取(DLE)

文档信息提取（DIE）侧重于从可视化丰富文档（VRD）（如收据、表单和发票）中提取结构化信息。免培训DIE任务的主要挑战之一是仅使用几个示例来理解文档布局及其文本实体之间的复杂关系。VRD具有离散的文本元素以及灵活的、固有结构化的布局，使得文本实体之间的关系的建立和隐式布局信息的提取变得复杂。多模态大模型GPT-4在执行DIE任务时也表现出有限的有效性,如下图所示:
在这里插入图片描述
即使是强大的GPT-4 o也会错误地识别三个实体并错误地标记三个实体。

2.以样本为中心的情境学习（SAIL）

文本提取(DLE)当前主要面临两个挑战，1）理解VRD中布局和文本元素之间的复杂关系，以及。2）为预先训练的模型提供准确的指导。论文中所提出的以样本为中心的情境学习（SAIL）就是针对这两个主要问题，引入了细粒度的实体级文本相似性以便于LLM进行深入的文本分析，并结合布局相似性以增强VRD中的布局分析。此外，SAIL还为各种以样本为中心的示例制定了统一的In-Context Learning情境学习（ICL）提示模板，从而能够为每个样本的预训练模型提供精确指导的定制提示。

SAIL遵循两个原则:1)为了增强学习者对虚拟现实系统中布局和文本之间复杂相互作用的理解，所提供的提示必须从不同的角度深入分析问题。2）为确保精确的指导，必须为每个测试样本制定一个定制的提示。

2.1问题公式化

为了实现SALE,首先要将文本提取所遇到的问题抽象为公式。在论文中，实体文本T = {t1，t2，…，tne }和它们对应的框B = {b1，b2，.，bne }由OCR系统从I中识别，其中ne是文档图像中的实体的总数。为了有效地利用LLM，上下文提示C被设计为传达提取意图。对于基于情境学习ICL的DIE，通过选择几个演示如何解决DIE任务的示例来构建C。以这些上下文提示为例，LLM的任务是为所有检测到的实体生成标签Ypred。该过程通过最大化条件概率P（Y）来实现|T，B），同时将提示C作为附加条件：在这里插入图片描述
其中PLM是由LLM预测的条件概率，并且φ表示将实体文本和框转换成适合于LLM的输入的文本格式的操作。在无训练的DIE中，有效的语境提示语C的构建是至关重要的，这是本工作的主要重点。最后，使用F1分数相对于地面实况标签Ygt来评估预测标签Ypred。

为了最大化P（Y，|T，B）结合语境提示语C，SAIL专注于为单个样本设计C，方法是基于测试样本自动选择定制的布局示例、文档级文本相似性示例和实体级文本相似性示例，随后利用这些示例生成C。

SAIL整个体系结构包括五个步骤,如下图所示：在这里插入图片描述
首先，通过OCR处理测试文档图像和m个训练文档图像，以提取实体文本T和框B。然后将T转化为实体级文本嵌入Eent和文档级文本嵌入Edoc。B用于构造布局图像。第三，使用Eent、EDOI和Edoc为测试样本选择文本相似的实体、布局相似的文档和文本相似的文档。然后，将这些选择代入提示模板中以形成定制的上下文内提示C。最后，LLM利用C和问题φ（T，B）执行推理以生成预测标签Ypred。

2.2文档级文本相似性

为了提高情境学习ICL的能力，研究者采用文本语义搜索来为给定的测试样本选择最近的训练文档示例。从文档图像中提取的实体文本T被连接到单个句子中，并使用Sentence-BERT进行编码，从而为文档生成文本语义嵌入Edoc。通过使用余弦相似度得分计算测试嵌入Etest doc和m个训练嵌入Etrain doc之间的文档级文本相似度Tsim doc来确定最近的训练示例：在这里插入图片描述

2.3实体级文本相似性

由于长文本文档与找到的文本相似文档之间的文档级文本相似度明显偏低，所以研究者提出了实体级文本相似性。实体文本T = {t1，t2，…被OCR识别的文本被过滤以排除仅由数字组成的文本，其提供最小的语义内容。随后，使用Sentence-BERT对过滤后的mf个训练实体文本和nf个测试实体文本进行编码以导出语义嵌入Eent。实体级文本相似性Tsim ent通过采用余弦相似性分数从语义嵌入Eent计算，定义如下：
在这里插入图片描述

2.4布局相似性

为了识别具有相似布局的文档，研究者引入了一种布局相似性评估方法，如下图所示: 在这里插入图片描述
首先，来自盒B = {b1，b2，.，bne }在空白图像上呈现为黑色矩形。研究者将信息区域定义为包含所有实体文本的最小区域，并裁剪布局图像以保持信息区域和图像边界之间的10像素边距。接下来，通过XML标准化布局图像尺寸。最后，我们通过使用均方误差（MSE）损失计算训练布局图像I_train和测试布局图像I_test之间的布局相似度Lsim来选择ns个布局相似文档：
在这里插入图片描述
其中，U、V是布局图像中的像素的总数。

2.5以样本为中心的ICL提示模板

为了构造单个测试样本的有效的语境提示语C，研究者提出了一个自适应样本特定的ICL提示模板。该模板由5部分组成：候选标签插图Ccl、实体级文本演示Cet、布局演示Cl、文档级文本演示Cdt和测试问题φ（T，B），如下图所示：在这里插入图片描述

候选标签插图Ccl列举了DIE任务的所有潜在标签。对于缩写标签，附加相应的自然语言描述。文本级演示Cet表示文本上相似的实体。布局演示CI旨在帮助LLM分析测试文档的布局。文档级文本演示CDT以问答格式展示文本相似的文档，指导LLM以特定格式生成答案。

二、实验

1.数据集和指标

1)FUNSD数据集的候选标签包括“标题”、“问题”、“答案”和“其他”.它在训练集中包含149个表和7，411个实体，在测试集中包含50个表和2，332个实体。

2)SROIE是另一个扫描的收据理解数据集，训练集中包含626张收据，测试集中包含347张收据。DIE任务需要提取“公司”、“日期”、“地址”和“总计”信息。
3)CORD是一个接收理解数据集，包含800个训练数据、100个测试数据和100个验证数据。该数据集包含30个详细的分层标签，远远多于上述两个数据集。
研究者采用在这三个数据集上的实体级F1评分、精确度和召回率作为指标。

研究者使用三个LLM来评估我们的方法：开源ChatGLM 3和闭源GPT-3.5和GPT-4。将温度参数设置为0以增强重现性。在GPT-4 o中，仅提供文本提示作为输入，同时还通过提供文档图像和清晰的任务说明来测试其多模式功能。在实验中，由于提示符数量的限制，研究者选取了4个文本相似文档和4个版面相似文档作为实验样本。此外，对于每个过滤的测试实体，研究者选择四个文本相似的实体示例。

2.实验代码

以下是分别在FUNSD,SROIE,CORD三个数据集上实验核心代码:

def gpt_call(prompt_text):
    message = [
        {"role": "system", "content": prompt_text},
    ]
    response = client.chat.completions.create(
        model="gpt-3.5-turbo-0125",
        messages=message,
        temperature=0,
        max_tokens=1100,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )
    return response.choices[0].message.content

获取LLM的回复信息，限制最大token数。

# Calculate mean square error
def MSE(vector1, vector2):
    return np.sqrt(np.sum(np.square(vector1 - vector2)))/512

# Image Binarization
def binarynp(image, hash_size=512):
    image = image.convert('L').resize((hash_size, hash_size), Image.LANCZOS)
    np_image = np.array(image)
    binary_image = (np_image < 122).astype(int)
    return np.array(binary_image)

# Generate ground truth in prompt
def generategt(text):
    entities=text.split('}{')
    companytext='"company":'
    addresstext='"address":'
    datetext='"date":'
    totaltext='"total":'
    for en in entities:
        label=en.split('entity:')[-1]
        entext=en.split('text:')[1].split(',Box')[0]
        if label=='company':
            companytext=companytext+'{'+entext+'}'
        if label=='address':
            addresstext=addresstext+'{'+entext+'}'
        if label=='date':
            datetext=datetext+'{'+entext+'}'
        if label=='total':
            if entext not in totaltext:
                totaltext=totaltext+'{'+entext+'}'
    res=companytext+'\n'+addresstext+'\n'+datetext+'\n'+totaltext+'\n'
    return res

MSE用于计算归一化后的欧氏距离，得到一个0-1的值，值越小文本相似度越高。
binarynp方法负责生成文档布局的二值化哈希表示，用于方便进行文档对比。
generategt方法负责将标注文本转换为结构化提示模板。
处理逻辑：
1. 分割实体：通过}{字符分割多个实体
2. 解析字段：提取每个实体的文本内容(text)和类型(entity)
3. 分类聚合：将同类型实体合并到对应字段（company/address/date/total）
4. 去重处理：total字段自动过滤重复值

def find_similar_images(test_file_path, train_path, num_images=4, reverse=False):
    image1 = Image.open(test_file_path)
    hash1 = binarynp(image1)
    hash1 = hash1.flatten().T
    train_file = os.listdir(train_path)
    similarity_dict = {}
    for j in range(len(train_file)):
        image = Image.open(os.path.join(train_path, train_file[j]))
        hash2 = binarynp(image)
        hash2 = hash2.flatten().T
        similarity = MSE(hash1, hash2)
        similarity_dict[train_file[j]] = similarity

    # Choose the most similar images
    sorted_similarities = sorted(similarity_dict.items(), key=lambda x: x[1], reverse=False)
    sorted_similarities = sorted_similarities[:num_images]
    # Order in the prompt
    sorted_similarities = sorted(sorted_similarities, key=lambda x: x[1], reverse=reverse)
    print(sorted_similarities)
    return sorted_similarities[:num_images]


model = SentenceTransformer('all-MiniLM-L6-v2')


def check(num):
    flag = 0
    for i in range(len(num)):
        if num[i] >= 'a' and num[i] <= 'z' or num[i] >= 'A' and num[i] <= 'Z':
            flag = 1
    return flag

find_similar_images方法通过文本图像的布局特征来计算文本之间的相似度，check的作用是这个函数的作用是检查一个字符串中是否包含字母。

def find_similar_text(k, result_num=4, reverse=True):
    # Read the processed document text
    with open('../processfiles/ptext_sroie_test.txt', 'r', encoding='utf-8') as f:
        result_test_text=f.read().split('\n')[:-1]
    with open('../processfiles/ptext_sroie_train.txt', 'r', encoding='utf-8') as f:
        result_train_text=f.read().split('\n')[:-1]

    sentences = [result_test_text[k]]
    for i in range(len(result_train_text)):
        sentences.append(result_train_text[i])

    # encode the text
    embeddings = model.encode(sentences)
    
    # Choose the most similar fragment texts
    maxlist=[]
    maxidx=[]
    for i in range(1, len(result_train_text) + 1):
        cos_sim = cosine_similarity(embeddings[0].reshape(1, -1), embeddings[i].reshape(1, -1))
        if len(maxlist)< result_num:
            maxlist.append(cos_sim[0][0])
            maxidx.append(i)
        elif cos_sim[0][0]>min(maxlist):
            maxlist[maxlist.index(min(maxlist))]=cos_sim[0][0]
            maxidx[maxlist.index(min(maxlist))]=i
    choosefile=[]
    for i in range(len(maxlist)):
        choosefile.append([maxidx[i]-1,maxlist[i]])
    # Order in the prompt
    choosefile=sorted(choosefile, key=lambda x: x[1], reverse=reverse) #True:降序
    choosefile=[x[0] for x in choosefile]

    return choosefile

find_similar_text方法通过以下步骤来获取与训练最相似的文本:
读取文本数据：
从指定路径读取测试文本和训练文本，分别存储在 result_test_text 和 result_train_text 中。
构建文本列表：
将测试文本和所有训练文本合并到一个列表 sentences 中，用于后续的嵌入编码。
文本嵌入：
使用预训练模型（model）对所有文本进行编码，得到嵌入向量。
计算相似度：
遍历训练文本，计算每段训练文本与测试文本的余弦相似度。
动态维护一个最相似文本的列表，确保列表中始终存储最相似的 result_num 个文本。
排序并返回结果：
按相似度排序（升序或降序），返回最相似的训练文本的索引。

with open('../processfiles/pentitytext_sroie_train2.txt', 'r', encoding='utf-8') as f:
    result_train_text=f.read().split('\n')[:-1]
train_text = [a.split('|')[0] for a in result_train_text]
train_label = [a.split('|')[1] for a in result_train_text]
train_box = [a.split('|')[2] for a in result_train_text]
train_embeddings = model.encode(train_text)


# Find textually similar entities
def find_similar_entity(entities, result_num=4, reverse=True):  #加入box信息
    restext=''
    # encode the entity texts
    sentences = entities
    embeddings = model.encode(sentences) 
    embeddings=np.concatenate((embeddings,train_embeddings),axis=0)
    
    # Choose the most similar entity texts
    l=len(entities)
    cos_sims = cosine_similarity(embeddings[:l], embeddings[l:len(result_train_text) + l])
    for k in range(l):
        maxlist=[]
        maxidx=[]   
        for i in range(l, len(result_train_text) + l):
            cos_sim = cos_sims[k][i-l]
            if len(maxlist)< result_num:
                maxlist.append(cos_sim)
                maxidx.append(i)
            elif cos_sim>min(maxlist):
                maxlist[maxlist.index(min(maxlist))]=cos_sim
                maxidx[maxlist.index(min(maxlist))]=i
        choosefile=[]
        for i in range(len(maxlist)):
            choosefile.append([maxidx[i]-l,maxlist[i]])
        # Order in the prompt
        choosefile=sorted(choosefile, key=lambda x: x[1], reverse=reverse) #True:降序
        # write the prompt
        for i in range(len(choosefile)):
            restext+='{text:\"'+train_text[choosefile[i][0]]+'\",Box:'+train_box[choosefile[i][0]]+',entity:'+train_label[choosefile[i][0]]+'}\n'
    
    return restext

find_similar_entity是根据输入的实体文本（entities），从训练数据中找到与这些实体文本最相似的文本片段，并生成包含相似文本、边界框和实体标签的字符串,主要有以下四个步骤:
读取训练数据：
从文件中读取训练文本，每行包含文本、标签和边界框信息，用 | 分隔。
将文本、标签和边界框分别提取到不同的列表中。
文本嵌入：
使用预训练模型对训练文本进行编码，得到嵌入向量。
查找相似实体：
对输入的实体文本进行编码，并与训练文本的嵌入向量合并。
计算输入文本与训练文本的余弦相似度。
动态维护一个最相似文本的列表，确保列表中始终存储最相似的 result_num 个文本。
按相似度排序，并构造返回的字符串，包含相似文本、边界框和实体标签。
返回结果：
返回一个字符串，每行表示一个最相似的文本片段及其相关信息。

def predict(idx,num=4,lreverse=True,treverse=True):  
    test_file_path = "../../SROIE2019/test/experience/layoutimage"
    # Change to the address where the test layout images are stored
    test_files=os.listdir(test_file_path)
    test_file=test_file_path+'/'+test_files[idx]
    train_path = '../../SROIE2019/train/experience/layoutimage'
    # Change to the address where the training layout images are stored
    train_files = os.listdir('../../SROIE2019/train/gpt3_train_cut_gt')
    # Find the layout examples
    similarimages=find_similar_images(test_file, train_path,num,lreverse)
    # Find the textually similar document examples
    dtexample = find_similar_text(idx, num, treverse)
    
    # Read the test data
    file=test_files[idx].replace('jpg','txt')
    with open(os.path.join('../../SROIE2019/test/gpt3_test_cut', file), 'r', encoding='utf-8') as ft:
        data = ft.read()

    # Layout demenstration
    ld = ''
    hd=''
    for hf in similarimages:
        sifile=hf[0].replace('jpg','txt')
        with open(os.path.join('../../SROIE2019/train/gpt3_train_cut_gt', sifile),'r',encoding='utf-8') as ft:
            h=ft.read()
            ld+='Document:'+h+'\n\n'
    # Ask the LLM to analyze the layout of the document
    task='These are the information extracted from the document through OCR, and the Box is the position of the text in the document. Please analyze where each label is generally located in the document.\n'
    prompt_text='Label:\n'+map_text+'\n\n'+ld+task
    # complete layout demonstration
    la_text=ld+task+gpt_call(prompt_text)        
    
    # Document-level text demonstrations
    for hf in dtexample:
        f=train_files[int(hf)]
        with open(os.path.join('../../SROIE2019/train/gpt3_train_cut', f), 'r', encoding='utf-8') as ft:
            h = ft.read()
            hd += 'Q:' + h + ', return one company and its original address, one total, and one date?\n'
        with open(os.path.join('../../SROIE2019/train/gpt3_train_cut_gt', f), 'r', encoding='utf-8') as ft:
            h = ft.read()
            h=generategt(h)
            hd += 'A:' + h + '\n\n'

    # Entity-level text demonstrations
    data2=data.split('}{')
    similarentities=''
    entities=[]
    for da in data2:
        t = da.split('text:')[1].split(',Box')[0].strip('"')
        if check(t) == 0:
            continue
        entities.append(t)
    if len(entities)>0:
        similarentities=find_similar_entity(entities,4) 
    if similarentities!='':
        similarentities='Context:\n'+similarentities

    # generate prompt
    res = ''
    prompt_text = ''
    temp_str = data + ', return one company and its original address, one total, and one date?'
    prompt_text = 'Label:\n' + map_text + '\n\n\n' + similarentities + '\n\n\n' +la_text+'\n\n\n' +hd + '\nQ:' + temp_str
    prompt_text2 = 'Label:\n' + map_text + '\n\n\n' +la_text+'\n\n\n' +hd + '\nQ:' + temp_str
    
    # Dealing with the situation where the prompt word is too long
    if (len(enc.encode(prompt_text2)) > 15000):
        print('2')
        return predict(idx,2)
    while(len(enc.encode(prompt_text))>15000):
        print(idx)
        sen=similarentities.split('\n')
        idxs=sorted(random.sample(range(1,len(sen)), int(len(sen)*0.7)))
        similarentities='Context:\n'
        for i in idxs:
            similarentities+=sen[i]+'\n'
        prompt_text = 'Label:\n' + map_text + '\n\n\n' + similarentities + '\n\n\n' +la_text+'\n\n\n' +hd + '\nQ:' + temp_str
    # Input it into the LLM for inference
    res_text = gpt_call(prompt_text)
    res += res_text
    return res

# Main Program
wpath = '../../SROIE2019/test/result/'
test_file_path = "../../SROIE2019/test/experience/layoutimage"
test_files=os.listdir(test_file_path)
print(len(test_files))
for idx in range(len(test_files)):
    res_file = test_files[idx].replace('.jpg', '.txt')
    print(res_file)
    print(idx)
    res1 = predict(idx,4)
    with open(os.path.join(wpath, res_file), 'w', encoding='utf-8') as fl:
        fl.write(res1)

predict这个函数的作用是针对给定的测试文档，通过布局相似性、文本相似性和上下文信息，生成用于语言模型（LLM）推理的提示（prompt），并调用LLM进行信息抽取。

整个代码框架和功能可分为以下几部分:
布局相似性：
通过 find_similar_images 函数找到与测试图像最相似的训练图像。
文本相似性：
通过 find_similar_text 函数找到与测试文本最相似的训练文本。
实体级文本相似性：
从测试文本中提取实体文本，并通过 find_similar_entity 函数找到与实体文本最相似的训练文本。
构造提示（Prompt）：
结合布局演示、文档级文本演示和实体级文本演示，构造完整的提示字符串。
调用语言模型（LLM）：
将构造的提示输入到语言模型中，进行信息抽取。
结果保存：
将语言模型的输出结果保存到指定路径。

3.实验结果

以下是F1度量的定量结果。研究者提出的SAIL稳定地超过了各种基础LLM的基线。
在这里插入图片描述
另一部分的定性结果如下所示，情境学习ICL-D3IE错误地将实体判断为回复，而研究者的SAIL能够将实体正确地判断为问题。这表明ICL-D3 IE中的固定示例不足以指导学习者有效地学习离散文本之间的关系，凸显了为每个测试样本选择多样化示例的重要性。
在这里插入图片描述
研究者还将SAIL与传统的LLM大模型来进行比较，源LLaVA表现出有限的DIE能力，导致F1得分较低。GPT-4 o显著优于LLaVA（，但与专门的DIE方法相比仍有不足。因此，尽管发展迅速，但MLLM在DIE任务中仍然表现不佳，如下图所示:
在这里插入图片描述

下图（a）显示了CORD数据集关于在提示中包含布局演示的比较。在没有布局类似的演示的情况下使用提示符，LLM预测两个“13.000”都是“MENU.PRICE”，而SAIL将左边的“13.000”区分为“MENU.UNITPRICE”，将右边的“13.000”区分为“MENU.PRICE”。这一结果强调了整合布局演示的必要性，以使LLM有效地掌握文档结构。下图（b）显示了FUNSD数据集关于在提示符中添加实体级文本演示的比较。在省略这些演示后，LLM错误地将“对化合物敏感”预测为“问题，”并错误地将随后的四个实体分类为“答案”。尽管这种预测在布局方面是有意义的，但它未能与文本上下文相对应，突出了实体级文本相似性示例的关键作用。

在这里插入图片描述

总结

1.该论文中研究者提出的以样本为中心的上下文学习（SAIL）主要基于三个创新点:引入了布局相似性、实体级文本相似性、形成以示例为中心的上下文提示ICL提示模板。在不同大型语言模型（LLMs）的三个DIE基准测试中，SAIL表现优于基线方法。
2.研究的不足之处：一是搜索过程会产生额外的时间成本（使用GPT - 4时占13.3%）；二是使用多种示例会增加总标记数量。
3.研究展望：可从改进搜索方法、减少标记数量方面展开研究。例如探索更高效的搜索算法以降低时间成本，优化示例选择策略以减少标记数量同时保持性能。也可研究如何更好地利用LLMs的能力，进一步提高DIE任务的准确性和效率。