中国成语释义

论文标题

中国成语释义

Chinese Idiom Paraphrasing

论文作者

Qiang, Jipeng, Li, Yang, Zhang, Chaowei, Li, Yun, Yuan, Yunhao, Zhu, Yi, Wu, Xindong

论文摘要

成语，是中文中一种惯用的表达，其中大多数由四个汉字组成。由于非构成性和隐喻意义的特性，儿童和非母语者很难理解中国的成语。这项研究提出了一项新的任务，称为中国成语释义（CIP）。 CIP的目的是在保留原始句子的含义的前提下向非偶然的句子重新表达给非偶像性的句子。由于中国NLP系统更容易处理没有习语的句子，因此CIP可用于预处理中国数据集，从而促进和改善中国NLP任务的执行，例如机器翻译系统，中国成语披肩和中国习惯嵌入。在这项研究中，CIP任务被视为特殊的释义生成任务。为了避免获取注释的困难，我们首先建立了一个基于人类和机器协作的大规模CIP数据集，该数据集由115,530个句子对组成。我们进一步部署了三个基线和两种新颖的CIP方法来解决CIP问题。结果表明，根据已建立的CIP数据集，所提出的方法比基线具有更好的性能。

Idioms, are a kind of idiomatic expression in Chinese, most of which consist of four Chinese characters. Due to the properties of non-compositionality and metaphorical meaning, Chinese Idioms are hard to be understood by children and non-native speakers. This study proposes a novel task, denoted as Chinese Idiom Paraphrasing (CIP). CIP aims to rephrase idioms-included sentences to non-idiomatic ones under the premise of preserving the original sentence's meaning. Since the sentences without idioms are easier handled by Chinese NLP systems, CIP can be used to pre-process Chinese datasets, thereby facilitating and improving the performance of Chinese NLP tasks, e.g., machine translation system, Chinese idiom cloze, and Chinese idiom embeddings. In this study, CIP task is treated as a special paraphrase generation task. To circumvent difficulties in acquiring annotations, we first establish a large-scale CIP dataset based on human and machine collaboration, which consists of 115,530 sentence pairs. We further deploy three baselines and two novel CIP approaches to deal with CIP problems. The results show that the proposed methods have better performances than the baselines based on the established CIP dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题