论文标题

学术论文中使用的公式化表达式的提取和评估

Extraction and Evaluation of Formulaic Expressions Used in Scholarly Papers

论文作者

Iwatsuki, Kenichi, Boudin, Florian, Aizawa, Akiko

论文摘要

公式化的表达方式,例如“我们提出的本文”,对学术论文的作者有帮助,因为它们传达了交流功能。在上面,它显示了本文的目的。因此,可以轻松查找的公式化表达式(例如词典)的资源将是有用的。但是,公式化表达的形式通常会在很大程度上变化。例如,“在本文中”,“在本研究中,我们提出”和“在本文中,我们提出了一种新方法”,都被视为公式化表达式。这种跨度和形式的多样性在开发和评估公式表达式中引起问题。在本文中,我们提出了一种对跨度和式形式表达式变化的新方法。我们的方法将句子视为由公式化部分和非格式式部分组成的句子。然后,可以立即处理不同的表格,而不是尝试从整个语料库中提取公式化表达式,而是通过从每个句子中提取它们。基于此公式,为了避免多样性问题,我们提出了通过如何传达特定交流功能而不是将提取的表达式与现有词典进行比较来评估提取方法。我们还提出了一种新的提取方法,该方法利用命名的实体和依赖性结构从句子中删除非格式的部分。实验结果表明,与其他现有方法相比,提出的提取方法达到了最佳性能。

Formulaic expressions, such as 'in this paper we propose', are helpful for authors of scholarly papers because they convey communicative functions; in the above, it is showing the aim of this paper'. Thus, resources of formulaic expressions, such as a dictionary, that could be looked up easily would be useful. However, forms of formulaic expressions can often vary to a great extent. For example, 'in this paper we propose', 'in this study we propose' and 'in this paper we propose a new method to' are all regarded as formulaic expressions. Such a diversity of spans and forms causes problems in both extraction and evaluation of formulaic expressions. In this paper, we propose a new approach that is robust to variation of spans and forms of formulaic expressions. Our approach regards a sentence as consisting of a formulaic part and non-formulaic part. Then, instead of trying to extract formulaic expressions from a whole corpus, by extracting them from each sentence, different forms can be dealt with at once. Based on this formulation, to avoid the diversity problem, we propose evaluating extraction methods by how much they convey specific communicative functions rather than by comparing extracted expressions to an existing lexicon. We also propose a new extraction method that utilises named entities and dependency structures to remove the non-formulaic part from a sentence. Experimental results show that the proposed extraction method achieved the best performance compared to other existing methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源