Anubhuti-一个带注释的数据集用于孟加拉短篇小说的情感分析

论文标题

Anubhuti-一个带注释的数据集用于孟加拉短篇小说的情感分析

Anubhuti -- An annotated dataset for emotional analysis of Bengali short stories

论文作者

Pal, Aditya, Karn, Bhaskar

论文摘要

当今世界各地的数千种短篇小说和文章都用许多不同的语言编写。孟加拉语或孟加拉国是印度仅次于印度的第二高口语，是孟加拉国国家的民族语言。这项工作详细介绍了Anubhuti的创建，Anubhuti是第一个也是最大的文本语料库，用于分析孟加拉短篇小说作者表达的情绪。由于注释者的语言专业知识以及随后的标记方法，我们解释了数据集的数据收集方法，手动注释过程和数据集的高通道间一致性。我们还解决了孟加拉人（例如孟加拉语）低资源语言的原始数据和注释过程所面临的一些挑战。我们已经通过基线机器学习验证了数据集的性能以及对情绪分类的深度学习模型，并发现这些标准模型在Anubhuti上具有很高的精度和相关的功能选择。此外，我们还解释了该数据集如何对语言学家和数据分析师感兴趣，以研究孟加拉文学作者表达的情绪流。

Thousands of short stories and articles are being written in many different languages all around the world today. Bengali, or Bangla, is the second highest spoken language in India after Hindi and is the national language of the country of Bangladesh. This work reports in detail the creation of Anubhuti -- the first and largest text corpus for analyzing emotions expressed by writers of Bengali short stories. We explain the data collection methods, the manual annotation process and the resulting high inter-annotator agreement of the dataset due to the linguistic expertise of the annotators and the clear methodology of labelling followed. We also address some of the challenges faced in the collection of raw data and annotation process of a low resource language like Bengali. We have verified the performance of our dataset with baseline Machine Learning as well as a Deep Learning model for emotion classification and have found that these standard models have a high accuracy and relevant feature selection on Anubhuti. In addition, we also explain how this dataset can be of interest to linguists and data analysts to study the flow of emotions as expressed by writers of Bengali literature.

下载PDF全文

下载文献需遵守相关版权规定

论文标题