目视上丰富的文档提取模型的数据标记成本降低了

论文标题

目视上丰富的文档提取模型的数据标记成本降低了

Radically Lower Data-Labeling Costs for Visually Rich Document Extraction Models

论文作者

Zhou, Yichao, Wendt, James B., Potti, Navneet, Xie, Jing, Tata, Sandeep

论文摘要

用于构建自动提取模型（例如发票）的自动提取模型的关键瓶颈是获取以可接受的准确性来培训模型所需的数千种高质量标签文档的成本。我们提出选择性标签，以简化标签任务，以提供“是/否”标签，用于由在部分标记的文档上训练的模型预测的候选提取。我们将其与自定义的主动学习策略结合在一起，以找到模型最不确定的预测。我们通过从3个不同域中绘制的文档类型进行的实验表明，选择性标签可以将获得标记数据的成本降低$ 10 \ times $ $，而准确性却忽略不计。

A key bottleneck in building automatic extraction models for visually rich documents like invoices is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. We propose Selective Labeling to simplify the labeling task to provide "yes/no" labels for candidate extractions predicted by a model trained on partially labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by $10\times$ with a negligible loss in accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题