使用统计和深度学习的方法对ODIA语言的言论一部分标记

论文标题

使用统计和深度学习的方法对ODIA语言的言论一部分标记

Part-of-Speech Tagging of Odia Language Using statistical and Deep Learning-Based Approaches

论文作者

Dalai, Tusarkanta, Mishra, Tapas Kumar, Sa, Pankaj K

论文摘要

自动言论（POS）标记是许多自然语言处理（NLP）任务的预处理步骤，例如名称实体识别（NER），语音处理，信息提取，单词感官歧义歧义和机器翻译。它已经在英语和欧洲语言方面取得了令人鼓舞的结果，但是使用印度语言，尤其是在Odia语言中，由于缺乏支持工具，资源和语言形态丰富性，因此尚未得到很好的探索。不幸的是，我们无法为Odia找到开源POS标记器，并且仅尝试为ODIA语言开发POS标记器的尝试。这项研究工作的主要贡献是介绍有条件的随机场（CRF）和基于深度学习的方法（CNN和双向长期短期记忆）来开发ODIA的语音部分。我们使用了一个公开访问的语料库，并用印度标准局（BIS）标签设定了数据集。但是，全球大多数语言都使用了通用依赖项（UD）标签集注释的数据集。因此，要保持统一性，odia数据集应使用相同的标签集。因此，我们已经构建了一个简单的映射，从BIS标签集到UD标签集。我们对CRF模型进行了各种特征集输入，观察到构造特征集的影响。基于深度学习的模型包括BI-LSTM网络，CNN网络，CRF层，角色序列信息和预训练的单词向量。通过使用卷积神经网络（CNN）和BI-LSTM网络提取角色序列信息。实施了神经序列标记模型的六种不同组合，并研究了其性能指标。已经观察到，具有字符序列特征和预训练的单词矢量的BI-LSTM模型取得了显着的最新结果。

Automatic Part-of-speech (POS) tagging is a preprocessing step of many natural language processing (NLP) tasks such as name entity recognition (NER), speech processing, information extraction, word sense disambiguation, and machine translation. It has already gained a promising result in English and European languages, but in Indian languages, particularly in Odia language, it is not yet well explored because of the lack of supporting tools, resources, and morphological richness of language. Unfortunately, we were unable to locate an open source POS tagger for Odia, and only a handful of attempts have been made to develop POS taggers for Odia language. The main contribution of this research work is to present a conditional random field (CRF) and deep learning-based approaches (CNN and Bidirectional Long Short-Term Memory) to develop Odia part-of-speech tagger. We used a publicly accessible corpus and the dataset is annotated with the Bureau of Indian Standards (BIS) tagset. However, most of the languages around the globe have used the dataset annotated with Universal Dependencies (UD) tagset. Hence, to maintain uniformity Odia dataset should use the same tagset. So we have constructed a simple mapping from BIS tagset to UD tagset. We experimented with various feature set inputs to the CRF model, observed the impact of constructed feature set. The deep learning-based model includes Bi-LSTM network, CNN network, CRF layer, character sequence information, and pre-trained word vector. Character sequence information was extracted by using convolutional neural network (CNN) and Bi-LSTM network. Six different combinations of neural sequence labelling models are implemented, and their performance measures are investigated. It has been observed that Bi-LSTM model with character sequence feature and pre-trained word vector achieved a significant state-of-the-art result.

下载PDF全文

下载文献需遵守相关版权规定

论文标题