从低资源语言的技术干预中学习：贡迪的案例研究

论文标题

从低资源语言的技术干预中学习：贡迪的案例研究

Learnings from Technological Interventions in a Low Resource Language: A Case-Study on Gondi

论文作者

Mehta, Devansh, Santy, Sebastin, Mothilal, Ramaravind Kommiya, Srivastava, Brij Mohan Lal, Sharma, Alok, Shukla, Anurag, Prasad, Vishnu, U, Venkanna, Sharma, Amit, Bali, Kalika

论文摘要

为低资源语言开发技术的主要障碍是缺乏可用的数据。在本文中，我们报告了针对贡迪的4种技术驱动的数据收集方法的采用和部署，Gondi是印度南部和中部约有230万部落人士所说的低资源脆弱语言。在数据收集过程中，我们还通过创建社区可以使用的语言资源来扩展贡迪的信息的访问，例如词典，儿童故事，来自多个来源的冈迪内容以及基于互动的语音响应（IVR）基于互动的语音响应（IVR）的大量质量意识平台，来帮助其复兴。在这些干预措施结束时，我们收集了少于12,000个翻译的单词和/或句子，并确定了650多个社区成员，可以为将来的翻译工作提供帮助。该项目的更大目标是在贡迪中收集足够的数据来构建和部署可行的语言技术，例如机器翻译和语音到文本系统，这些技术可以帮助将语言带入Internet。

The primary obstacle to developing technologies for low-resource languages is the lack of usable data. In this paper, we report the adoption and deployment of 4 technology-driven methods of data collection for Gondi, a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India. In the process of data collection, we also help in its revival by expanding access to information in Gondi through the creation of linguistic resources that can be used by the community, such as a dictionary, children's stories, an app with Gondi content from multiple sources and an Interactive Voice Response (IVR) based mass awareness platform. At the end of these interventions, we collected a little less than 12,000 translated words and/or sentences and identified more than 650 community members whose help can be solicited for future translation efforts. The larger goal of the project is collecting enough data in Gondi to build and deploy viable language technologies like machine translation and speech to text systems that can help take the language onto the internet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题