论文标题
样式变化是代码转换的有利位置
Style Variation as a Vantage Point for Code-Switching
论文作者
论文摘要
代码转换(CS)是在几个双语和多语言社区中观察到的常见现象,从而在数字和社交媒体平台中达到了普遍性。这种越来越多的突出要求需要为关键下游任务进行CS语言建模。该领域的一个主要问题是缺乏注释数据和训练大规模神经模型的大量语料库。生成大量质量文本有助于多个下降的流任务,这些任务在很大程度上依赖语言建模,例如语音识别,文本到语音综合等。我们提出了CS的新颖有利位置是两种参与语言之间的风格变化。我们的方法不需要任何外部注释,例如词汇语言ID。它主要依赖于易于获得的单语言语料库,而无需任何平行的对齐和有限的自然CS句子。我们提出了一种两阶段的生成对抗训练方法,第一阶段为CS产生竞争性负面示例,第二阶段产生了更现实的CS句子。我们通过以下语言介绍实验:西班牙语 - 英语,英语,印度英语和阿拉伯语。我们表明,通过双阶段培训过程,生成的CS的指标趋势更接近上述语言对中的实际CS数据。我们认为,这种CS的观点是样式变化,为在CS文本中的各种任务建模提供了新的观点。
Code-Switching (CS) is a common phenomenon observed in several bilingual and multilingual communities, thereby attaining prevalence in digital and social media platforms. This increasing prominence demands the need to model CS languages for critical downstream tasks. A major problem in this domain is the dearth of annotated data and a substantial corpora to train large scale neural models. Generating vast amounts of quality text assists several down stream tasks that heavily rely on language modeling such as speech recognition, text-to-speech synthesis etc,. We present a novel vantage point of CS to be style variations between both the participating languages. Our approach does not need any external annotations such as lexical language ids. It mainly relies on easily obtainable monolingual corpora without any parallel alignment and a limited set of naturally CS sentences. We propose a two-stage generative adversarial training approach where the first stage generates competitive negative examples for CS and the second stage generates more realistic CS sentences. We present our experiments on the following pairs of languages: Spanish-English, Mandarin-English, Hindi-English and Arabic-French. We show that the trends in metrics for generated CS move closer to real CS data in each of the above language pairs through the dual stage training process. We believe this viewpoint of CS as style variations opens new perspectives for modeling various tasks in CS text.