论文标题
联合方法进行复合分裂和惯用化合物检测
A Joint Approach to Compound Splitting and Idiomatic Compound Detection
论文作者
论文摘要
诸如机器翻译,语音识别和信息检索之类的应用需要有效地处理名词化合物,因为它们是播音外(OOV)单词的可能来源之一。对名词化合物的深入处理不仅需要将它们拆分为较小的组件(甚至根部),而且还需要识别应保持不合格的实例,因为它们具有惯用性。我们开发了一种基于深度学习的名词化合物分裂和惯用化合物检测的基于深度学习的方法,并使用新近收集的带注释的德语化合物训练的德语训练。我们的神经名词化合物分离器在子字级别上运行,并胜过目前的艺术状态约5%。
Applications such as machine translation, speech recognition, and information retrieval require efficient handling of noun compounds as they are one of the possible sources for out-of-vocabulary (OOV) words. In-depth processing of noun compounds requires not only splitting them into smaller components (or even roots) but also the identification of instances that should remain unsplitted as they are of idiomatic nature. We develop a two-fold deep learning-based approach of noun compound splitting and idiomatic compound detection for the German language that we train using a newly collected corpus of annotated German compounds. Our neural noun compound splitter operates on a sub-word level and outperforms the current state of the art by about 5%.