论文标题

报纸Navigator数据集:从1600万个历史报纸页面中提取和分析视觉内容

The Newspaper Navigator Dataset: Extracting And Analyzing Visual Content from 16 Million Historic Newspaper Pages in Chronicling America

论文作者

Lee, Benjamin Charles Germain, Mears, Jaime, Jakeway, Eileen, Ferriter, Meghan, Adams, Chris, Yarasavage, Nathan, Thomas, Deborah, Zwaard, Kate, Weld, Daniel S.

论文摘要

《编年史美国》是国家数字报纸计划的产物,该计划是国会图书馆与国家人文基金会之间的合作伙伴关系,以数字化历史报纸。迄今为止,已将超过1600万页的历史悠久的美国报纸用于编年史,并配有高分辨率图像和机器可读的Mets/Alto OCR。记录美国用户的兴趣非常感兴趣,是一种语义语料库,并带有提取的视觉内容和头条新闻。为此,我们介绍了一个视觉内容识别模型,该模型训练了对图片,插图,地图,漫画和编辑漫画的界限注释,这些漫画是国会图书馆的《超越词众召集计划》的一部分,并增强了其他注释,包括头条和广告。我们描述了利用这种深度学习模型来提取7种视觉内容的管道:头条,照片,插图,地图,漫画,社论漫画和广告,并配有文本内容,例如来自Mets/Alto OLTO OCR的字幕,以及快速图像相似性的图像嵌入式。我们报告了从编年史美国语料库中运行1630万页的管道的结果,并描述了由此产生的报纸Navigator数据集,这是从有史以来生产的历史报纸上提取的最大视觉内容数据集。报纸导航器数据集,填充的视觉内容识别模型和所有源代码都放置在公共领域,以进行无限制的重复使用。

Chronicling America is a product of the National Digital Newspaper Program, a partnership between the Library of Congress and the National Endowment for the Humanities to digitize historic newspapers. Over 16 million pages of historic American newspapers have been digitized for Chronicling America to date, complete with high-resolution images and machine-readable METS/ALTO OCR. Of considerable interest to Chronicling America users is a semantified corpus, complete with extracted visual content and headlines. To accomplish this, we introduce a visual content recognition model trained on bounding box annotations of photographs, illustrations, maps, comics, and editorial cartoons collected as part of the Library of Congress's Beyond Words crowdsourcing initiative and augmented with additional annotations including those of headlines and advertisements. We describe our pipeline that utilizes this deep learning model to extract 7 classes of visual content: headlines, photographs, illustrations, maps, comics, editorial cartoons, and advertisements, complete with textual content such as captions derived from the METS/ALTO OCR, as well as image embeddings for fast image similarity querying. We report the results of running the pipeline on 16.3 million pages from the Chronicling America corpus and describe the resulting Newspaper Navigator dataset, the largest dataset of extracted visual content from historic newspapers ever produced. The Newspaper Navigator dataset, finetuned visual content recognition model, and all source code are placed in the public domain for unrestricted re-use.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源