DOCLAYNET：一个大型的人类通知数据集用于文档分析

论文标题

DOCLAYNET：一个大型的人类通知数据集用于文档分析

DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

论文作者

Pfitzmann, Birgit, Auer, Christoph, Dolfi, Michele, Nassar, Ahmed S, Staar, Peter W J

论文摘要

准确的文档布局分析是高质量PDF文档转换的关键要求。随着最近公共的可用性，诸如PublayNet和Docbank之类的大型基础数据集，已证明深入学习模型非常有效地在布局检测和细分方面非常有效。尽管这些数据集的规模足以训练此类模型，但它们严重缺乏布局可变性，因为它们仅来自PubMed和Arxiv等科学文章存储库。因此，当将这些模型应用于更具挑战性和不同的布局时，布局细分的准确性大大下降。在本文中，我们提出\ textit {doclaynet}，这是一种新的，可公开的，文档延迟的注释数据集，以可可格式。它包含来自不同数据源的手动注释页面的80863，以表示布局的广泛可变性。对于每个PDF页面，布局注释提供标记为边界框，可供选择11个不同的类。 Doclaynet还提供了双重和三通道页的子集，以确定通道间协议。在多个实验中，我们为一组流行的对象检测模型提供基线精度得分（MAP）。我们还证明，这些模型落后于通道间一致性约10％。此外，我们提供证据表明Doclaynet的大小足够。最后，我们比较了在PublayNet，Docbank和Doclaynet上训练的模型，表明对Doclaynet训练的模型的布局预测更强大，因此是通用文档文档分析分析的首选选择。

Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we present \textit{DocLayNet}, a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide baseline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10\% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNet-trained models are more robust and thus the preferred choice for general-purpose document-layout analysis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题