论文标题
图书馆对数字图书馆中几乎不受欢迎的信息提取工作流的观点
A Library Perspective on Nearly-Unsupervised Information Extraction Workflows in Digital Libraries
论文作者
论文摘要
信息提取可以支持数字库的新颖有效的访问路径。但是,在实践中设计可靠的提取工作流程可能是成本密集的。一方面,合适的提取方法依赖于特定领域的训练数据。另一方面,无监督和开放的提取方法通常会产生非官方化的提取结果。本文解决了一个问题,数字库如何处理此类提取以及其质量在实践中是否足够。我们通过在百科全书(Wikipedia),药房和政治科学领域的案例研究中分析无监督的提取工作流程。我们报告机遇和局限性。最后,我们讨论了无监督提取工作流的最佳实践。
Information extraction can support novel and effective access paths for digital libraries. Nevertheless, designing reliable extraction workflows can be cost-intensive in practice. On the one hand, suitable extraction methods rely on domain-specific training data. On the other hand, unsupervised and open extraction methods usually produce not-canonicalized extraction results. This paper tackles the question how digital libraries can handle such extractions and if their quality is sufficient in practice. We focus on unsupervised extraction workflows by analyzing them in case studies in the domains of encyclopedias (Wikipedia), pharmacy and political sciences. We report on opportunities and limitations. Finally we discuss best practices for unsupervised extraction workflows.