复杂单词识别的多语言和多域单语设置中的域适应

论文标题

复杂单词识别的多语言和多域单语设置中的域适应

Domain Adaptation in Multilingual and Multi-Domain Monolingual Settings for Complex Word Identification

论文作者

Zaharia, George-Eduard, Smădu, Răzvan-Alexandru, Cercel, Dumitru-Clementin, Dascalu, Mihai

论文摘要

复杂的单词识别（CWI）是用于简化文本的基石过程。 CWI高度依赖上下文，而其难以增加的可用数据集的稀缺性，这些数据集在域和语言方面差异很大。因此，开发一个强大的模型越来越困难，该模型概括了各种输入示例。在本文中，我们根据域适应性提出了一种针对CWI任务的新型培训技术，以改善目标特征和上下文表示。该技术解决了与多个域一起工作的问题，因为它创建了一种平滑探索数据集之间差异的方法。此外，我们还提出了一个类似的辅助任务，即简化文本，可用于补充词汇复杂性预测。与香草训练技术相比，我们的模型在Pearson相关系数方面获得了高达2.42％的提升，而从词汇复杂性预测2021数据集中考虑了复合物。同时，我们获得了Pearson分数增长3％，同时考虑了依靠复杂单词标识2018数据集的跨语义设置。此外，我们的模型在平均绝对误差方面产生最先进的结果。

Complex word identification (CWI) is a cornerstone process towards proper text simplification. CWI is highly dependent on context, whereas its difficulty is augmented by the scarcity of available datasets which vary greatly in terms of domains and languages. As such, it becomes increasingly more difficult to develop a robust model that generalizes across a wide array of input examples. In this paper, we propose a novel training technique for the CWI task based on domain adaptation to improve the target character and context representations. This technique addresses the problem of working with multiple domains, inasmuch as it creates a way of smoothing the differences between the explored datasets. Moreover, we also propose a similar auxiliary task, namely text simplification, that can be used to complement lexical complexity prediction. Our model obtains a boost of up to 2.42% in terms of Pearson Correlation Coefficients in contrast to vanilla training techniques, when considering the CompLex from the Lexical Complexity Prediction 2021 dataset. At the same time, we obtain an increase of 3% in Pearson scores, while considering a cross-lingual setup relying on the Complex Word Identification 2018 dataset. In addition, our model yields state-of-the-art results in terms of Mean Absolute Error.

下载PDF全文

下载文献需遵守相关版权规定

论文标题