论文标题
嘈杂的Twitter上的功能选择,用于语言标识的简短短信
Feature Selection on Noisy Twitter Short Text Messages for Language Identification
论文作者
论文摘要
书面语言标识的任务通常涉及对文本样本中存在的语言的检测。此外,一系列文本可能不属于单个固有的语言,也可能是用多种语言编写的文本的混合。由于其灵活且用户友好的环境,这种文本是在社交媒体平台中大量生成的。这样的文本包含大量功能,这些功能对于开发统计,概率以及其他类型的语言模型至关重要。大量功能具有丰富的和无关紧要的冗余功能,这些功能对学习模型的性能产生了不同的效果。因此,特征选择方法在选择与有效模型最相关的特征方面具有重要意义。在本文中,我们基本上将印度英语语言标识任务视为印地语和英语通常是印度两种口语最广泛的语言。我们在各种学习算法上应用了不同的特征选择算法,以分析算法的影响以及特征对任务性能的数量。该方法的重点是使用从Twitter提取的6903条推文的新型数据集进行级别的语言识别。在许多分类器上使用不同的特征选择算法检查了各种n-gram配置文件。最后,就任务进行的整体实验提供了详尽的比较分析。
The task of written language identification involves typically the detection of the languages present in a sample of text. Moreover, a sequence of text may not belong to a single inherent language but also may be mixture of text written in multiple languages. This kind of text is generated in large volumes from social media platforms due to its flexible and user friendly environment. Such text contains very large number of features which are essential for development of statistical, probabilistic as well as other kinds of language models. The large number of features have rich as well as irrelevant and redundant features which have diverse effect over the performance of the learning model. Therefore, feature selection methods are significant in choosing feature that are most relevant for an efficient model. In this article, we basically consider the Hindi-English language identification task as Hindi and English are often two most widely spoken languages of India. We apply different feature selection algorithms across various learning algorithms in order to analyze the effect of the algorithm as well as the number of features on the performance of the task. The methodology focuses on the word level language identification using a novel dataset of 6903 tweets extracted from Twitter. Various n-gram profiles are examined with different feature selection algorithms over many classifiers. Finally, an exhaustive comparative analysis is put forward with respect to the overall experiments conducted for the task.