语言变色龙：基于训练训练的语言模型，使用跨语言训练之间的语言转换分析

论文标题

语言变色龙：基于训练训练的语言模型，使用跨语言训练之间的语言转换分析

Language Chameleon: Transformation analysis between languages using Cross-lingual Post-training based on Pre-trained language models

论文作者

Son, Suhyune, Park, Chanjun, Lee, Jungseob, Shim, Midan, Lee, Chanhee, Jang, Yoonna, Seo, Jaehyung, Lim, Heuiseok

论文摘要

随着预训练的语言模型变得更加要求资源，因此资源丰富的语言（例如英语和资源筛选）语言之间的不平等正在恶化。这可以归因于以下事实：每种语言中的可用培训数据量都遵循幂律分布，并且大多数语言都属于分布的长尾巴。一些研究领域试图减轻此问题。例如，在跨语言转移学习和多语言培训中，目标是通过从资源丰富的语言中获得的知识使长尾语言受益。尽管取得了成功，但现有工作主要集中于尝试尽可能多的语言。结果，有针对性的深入分析主要不存在。在这项研究中，我们专注于单一的低资源语言，并使用跨语性训练后训练（XPT）进行广泛的评估和探测实验。为了使转移方案具有挑战性，我们选择韩语作为目标语言，因为它是一种孤立的语言，因此与英语几乎没有类型的分类。结果表明，XPT不仅胜过或与单语型模型相当，但在传输过程中训练有较大的数据，而且在转移过程中也很高。

As pre-trained language models become more resource-demanding, the inequality between resource-rich languages such as English and resource-scarce languages is worsening. This can be attributed to the fact that the amount of available training data in each language follows the power-law distribution, and most of the languages belong to the long tail of the distribution. Some research areas attempt to mitigate this problem. For example, in cross-lingual transfer learning and multilingual training, the goal is to benefit long-tail languages via the knowledge acquired from resource-rich languages. Although being successful, existing work has mainly focused on experimenting on as many languages as possible. As a result, targeted in-depth analysis is mostly absent. In this study, we focus on a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT). To make the transfer scenario challenging, we choose Korean as the target language, as it is a language isolate and thus shares almost no typology with English. Results show that XPT not only outperforms or performs on par with monolingual models trained with orders of magnitudes more data but also is highly efficient in the transfer process.

下载PDF全文

下载文献需遵守相关版权规定

论文标题