论文标题

学习识别方言功能

Learning to Recognize Dialect Features

论文作者

Demszky, Dorottya, Sharma, Devyani, Clark, Jonathan H., Prabhakaran, Vinodkumar, Eisenstein, Jacob

论文摘要

建立为每个人服务的NLP系统需要考虑方言差异。但是方言不是单一的实体:相反,方言之间和内部的区别是由于语音和文本中数十个方言特征的存在,不存在和频率所捕获的,例如“ He {}运行”中的copula删除。在本文中,我们介绍了方言功能检测的任务,并根据预验证的变压器提出了两种多任务学习方法。对于大多数方言,这些功能的大规模注释语料库不可用,因此很难训练识别器。我们以少量的最小对训练我们的模型,建立语言学家通常如何定义方言特征。对印度英语的22个方言特征的测试集评估表明,这些模型学会了以高准确性识别许多功能,并且几个最小对对培训的有效性与数千个标记的示例一样有效。我们还证明了方言特征检测的下游适用性,作为方言密度和方言分类器的度量。

Building NLP systems that serve everyone requires accounting for dialect differences. But dialects are not monolithic entities: rather, distinctions between and within dialects are captured by the presence, absence, and frequency of dozens of dialect features in speech and text, such as the deletion of the copula in "He {} running". In this paper, we introduce the task of dialect feature detection, and present two multitask learning approaches, both based on pretrained transformers. For most dialects, large-scale annotated corpora for these features are unavailable, making it difficult to train recognizers. We train our models on a small number of minimal pairs, building on how linguists typically define dialect features. Evaluation on a test set of 22 dialect features of Indian English demonstrates that these models learn to recognize many features with high accuracy, and that a few minimal pairs can be as effective for training as thousands of labeled examples. We also demonstrate the downstream applicability of dialect feature detection both as a measure of dialect density and as a dialect classifier.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源