论文标题
光学特征识别质量会影响历史报纸剪报的感知有用性
Optical character recognition quality affects perceived usefulness of historical newspaper clippings
论文作者
论文摘要
介绍。我们研究了一份数字化历史芬兰报纸的交互式信息检索中不同质量光学特征识别的效果。 方法。这项研究基于模拟的交互式信息检索工作任务模型。 32名用户搜索了一篇文章,其中包括芬兰报纸UUSI SUOMETAR 1869-1918与CA。 145万辆自动分段文章。我们的文章搜索数据库具有每个文章的两个版本,并具有不同质量的光学特征识别。每个用户都使用0-3的分级相关性量表进行了六个预构建和六个自我形成的简短查询,并主观评估了前10个结果,而无需了解其他相同文章的光学特征识别质量差异。 分析。通过比较用户会话中评估得分的平均值来进行用户评估的分析。通过分析在这两个课程中总体上检索到的不同文档的返回文章的长度来检测到查询结果的差异。 结果。该研究的主要结果是,改善的光学特征识别质量会对历史报纸文章的有用性产生积极影响。 结论。我们能够证明,文档的光学特征识别质量的提高会在我们的历史报纸收藏中提高查询结果的平均相关性评估得分。据我们所知,这种模拟的交互式用户任务是第一个从经验上显示用户的主观相关性评估受到光学读取文本质量的变化的影响。
Introduction. We study effect of different quality optical character recognition in interactive information retrieval with a collection of one digitized historical Finnish newspaper. Method. This study is based on the simulated interactive information retrieval work task model. Thirty-two users made searches to an article collection of Finnish newspaper Uusi Suometar 1869-1918 with ca. 1.45 million auto segmented articles. Our article search database had two versions of each article with different quality optical character recognition. Each user performed six pre-formulated and six self-formulated short queries and evaluated subjectively the top-10 results using graded relevance scale of 0-3 without knowing about the optical character recognition quality differences of the otherwise identical articles. Analysis. Analysis of the user evaluations was performed by comparing mean averages of evaluations scores in user sessions. Differences of query results were detected by analysing lengths of returned articles in pre-formulated and self-formulated queries and number of different documents retrieved overall in these two sessions. Results. The main result of the study is that improved optical character recognition quality affects perceived usefulness of historical newspaper articles positively. Conclusions. We were able to show that improvement in optical character recognition quality of documents leads to higher mean relevance evaluation scores of query results in our historical newspaper collection. To the best of our knowledge this simulated interactive user-task is the first one showing empirically that users' subjective relevance assessments are affected by a change in the quality of optically read text.