COSDA-ML：零击跨语义NLP的多语言代码转换数据增强

论文标题

COSDA-ML：零击跨语义NLP的多语言代码转换数据增强

CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP

论文作者

Qin, Libo, Ni, Minheng, Zhang, Yue, Che, Wanxiang

论文摘要

多语言上的嵌入（例如多语言 - 伯特（Mbert））在各种零拍的跨语性任务中都取得了成功。但是，这些模型受到跨不同语言的子词的不一致表示的不一致表示的限制。现有工作通过双语预测和微调技术解决了这个问题。我们提出了一个数据增强框架，以生成多种语言代码转换数据以微调Mbert，该数据鼓励模型一次通过混合其上下文信息来使源和多种目标语言的表示形式对齐。与现有工作相比，我们的方法不依赖双语句子进行培训，并且仅需要多种目标语言的培训过程。与Mbert相比，使用19种语言的五项任务进行了五个任务的实验结果表明，所有任务的性能都显着改善。

Multi-lingual contextualized embeddings, such as multilingual-BERT (mBERT), have shown success in a variety of zero-shot cross-lingual tasks. However, these models are limited by having inconsistent contextualized representations of subwords across different languages. Existing work addresses this issue by bilingual projection and fine-tuning technique. We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT, which encourages model to align representations from source and multiple target languages once by mixing their context information. Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages. Experimental results on five tasks with 19 languages show that our method leads to significantly improved performances for all the tasks compared with mBERT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题