论文标题

手写分类用于分析艺术历史文档

Handwriting Classification for the Analysis of Art-Historical Documents

论文作者

Bartz, Christian, Rätz, Hendrik, Meinel, Christoph

论文摘要

数字化的档案包含并保留数百万个文件中几代学者的知识。这些档案的规模要求自动分析,因为专家的手动分析通常太贵了。在本文中,我们专注于WPI艺术档案中的扫描文档中的笔迹分析。由于存档由用几种语言编写的文档组成,并且缺少带注释的培训数据来创建识别模型,因此我们提出了手写分类的任务,作为手写OCR管道的新步骤。我们提出了一个笔迹分类模型,该模型根据其视觉结构提取了文本片段,例如,数字,日期或单词。这样的分类通过突出显示包含特定文本类别的文档而无需阅读整个内容的文档来支持历史学家。为此,我们开发并比较了几种基于深度学习的文本分类模型。在广泛的实验中,我们展示了我们提出的方法的优点和缺点,并讨论了现实世界数据集中可能的用法方案。

Digitized archives contain and preserve the knowledge of generations of scholars in millions of documents. The size of these archives calls for automatic analysis since a manual analysis by specialists is often too expensive. In this paper, we focus on the analysis of handwriting in scanned documents from the art-historic archive of the WPI. Since the archive consists of documents written in several languages and lacks annotated training data for the creation of recognition models, we propose the task of handwriting classification as a new step for a handwriting OCR pipeline. We propose a handwriting classification model that labels extracted text fragments, eg, numbers, dates, or words, based on their visual structure. Such a classification supports historians by highlighting documents that contain a specific class of text without the need to read the entire content. To this end, we develop and compare several deep learning-based models for text classification. In extensive experiments, we show the advantages and disadvantages of our proposed approach and discuss possible usage scenarios on a real-world dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源