马拉雅拉姆语和泰卢固语的本地和借出词的无监督分离

论文标题

马拉雅拉姆语和泰卢固语的本地和借出词的无监督分离

Unsupervised Separation of Native and Loanwords for Malayalam and Telugu

论文作者

Prakhya, Sridhama, P, Deepak

论文摘要

通常，一种语言的单词在不同的语言中被采用而没有翻译。这些单词以后一种语言写的文本以音译形式出现。这种现象在印度语言中尤为广泛，在印度语言中，从英语中借来了许多单词。在本文中，我们从凝集性dravidian语言的大量单词中自动识别借词的任务。我们针对德拉维式家庭的两种特定语言，即马拉雅拉姆语和泰卢固语。基于对语言的熟悉，我们概述了这两种语言中的本地单词往往具有更广泛的词干的特征 - 词干是表示单词前几个字符形成的子字序列的速记，而不是从其他语言中借来的单词。我们利用这种观察来构建目标函数和迭代优化公式，以优化它，从而对每个单词在此过程中的诞生进行评分。通过对来自马拉雅拉姆语和泰卢固语的现实世界数据集进行的广泛的经验分析，我们说明了我们方法在有效地量化态度对任务可用基线的有效性。

Quite often, words from one language are adopted within a different language without translation; these words appear in transliterated form in text written in the latter language. This phenomenon is particularly widespread within Indian languages where many words are loaned from English. In this paper, we address the task of identifying loanwords automatically and in an unsupervised manner, from large datasets of words from agglutinative Dravidian languages. We target two specific languages from the Dravidian family, viz., Malayalam and Telugu. Based on familiarity with the languages, we outline an observation that native words in both these languages tend to be characterized by a much more versatile stem - stem being a shorthand to denote the subword sequence formed by the first few characters of the word - than words that are loaned from other languages. We harness this observation to build an objective function and an iterative optimization formulation to optimize for it, yielding a scoring of each word's nativeness in the process. Through an extensive empirical analysis over real-world datasets from both Malayalam and Telugu, we illustrate the effectiveness of our method in quantifying nativeness effectively over available baselines for the task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题