论文标题
历史文档中强大的文本线检测:学习和评估方法
Robust Text Line Detection in Historical Documents: Learning and Evaluation Methods
论文作者
论文摘要
文本分段是历史文档理解中的关键步骤之一。由于多年来降级的字体,内容,写作风格以及文档的质量,这是具有挑战性的。 在本文中,我们解决了目前阻止人们以高概括能力建立线路细分模型的局限性。我们提出了一项使用三个最先进的系统DOC-UFCN,DHSEGEM和ARU-NET进行的研究,并表明可以构建在各种历史文档数据集上训练的通用模型,这些模型可以正确分割不同的看不见的页面。本文还强调了培训期间使用的注释的重要性:每个现有数据集的注释方式都不同。我们提出了注释的统一,并对最终文本识别结果显示了积极的影响。在此期间,我们使用标准像素级指标,对象级和引入目标指标提出了完整的评估策略。
Text line segmentation is one of the key steps in historical document understanding. It is challenging due to the variety of fonts, contents, writing styles and the quality of documents that have degraded through the years. In this paper, we address the limitations that currently prevent people from building line segmentation models with a high generalization capacity. We present a study conducted using three state-of-the-art systems Doc-UFCN, dhSegment and ARU-Net and show that it is possible to build generic models trained on a wide variety of historical document datasets that can correctly segment diverse unseen pages. This paper also highlights the importance of the annotations used during training: each existing dataset is annotated differently. We present a unification of the annotations and show its positive impact on the final text recognition results. In this end, we present a complete evaluation strategy using standard pixel-level metrics, object-level ones and introducing goal-oriented metrics.