情感分析任务上的单词嵌入质量

论文标题

情感分析任务上的单词嵌入质量

Quality of Word Embeddings on Sentiment Analysis Tasks

论文作者

Çano, Erion, Morisio, Maurizio

论文摘要

单词嵌入或单词的分布式表示在各种应用中，例如机器翻译，情感分析，主题识别等。单词嵌入的质量和应用程序的性能取决于几个因素，例如培训方法，语料库大小和相关性等。在本研究中，我们比较了在读取预读取的单词嵌入模型上的单词嵌入模型上的单词分析和电影评论的单词嵌入模型。根据我们的结果，Twitter推文是歌词情感分析的最佳推文，而Google News和Common Crawl是电影Polarity Analysis中表现最好的人。手套训练的模型略微超越了那些经过Skipgram训练的模型。同样，主题相关性和语料库规模之类的因素会显着影响模型的质量。当有中型或大型文本集可用时，从同一培训数据集中获取单词嵌入通常是最佳选择。

Word embeddings or distributed representations of words are being used in various applications like machine translation, sentiment analysis, topic identification etc. Quality of word embeddings and performance of their applications depends on several factors like training method, corpus size and relevance etc. In this study we compare performance of a dozen of pretrained word embedding models on lyrics sentiment analysis and movie review polarity tasks. According to our results, Twitter Tweets is the best on lyrics sentiment analysis, whereas Google News and Common Crawl are the top performers on movie polarity analysis. Glove trained models slightly outrun those trained with Skipgram. Also, factors like topic relevance and size of corpus significantly impact the quality of the models. When medium or large-sized text sets are available, obtaining word embeddings from same training dataset is usually the best choice.

下载PDF全文

下载文献需遵守相关版权规定

论文标题