论文标题
CS在Semeval-2020任务9:代码开关单词嵌入以进行情感分析的有效性
CS-Embed at SemEval-2020 Task 9: The effectiveness of code-switched word embeddings for sentiment analysis
论文作者
论文摘要
社交媒体帖子的日益普及和情感分析的应用自然导致了对用多种语言编写的帖子的情感分析,这种做法被称为代码转换。虽然最近对代码转换帖子的研究重点是使用多语言单词嵌入,但这些嵌入并未在代码切换的数据上进行培训。在这项工作中,我们介绍了通过代码切换的推文培训的单词插件,特别是那些使用西班牙语和英语的推文,称为Spanglish。我们探索嵌入空间,以发现它们如何捕获两种语言中的单词的含义。我们通过参与Semeval 2020任务9:〜\ emph {情感分析社交媒体文本}来测试这些嵌入的有效性。我们利用它们来训练一个达到0.722的情感分类器。这高于0.656竞赛的基线,我们的团队(Codalab用户名\ emph {francesita})在29个参与的团队中排名14,击败了基线。
The growing popularity and applications of sentiment analysis of social media posts has naturally led to sentiment analysis of posts written in multiple languages, a practice known as code-switching. While recent research into code-switched posts has focused on the use of multilingual word embeddings, these embeddings were not trained on code-switched data. In this work, we present word-embeddings trained on code-switched tweets, specifically those that make use of Spanish and English, known as Spanglish. We explore the embedding space to discover how they capture the meanings of words in both languages. We test the effectiveness of these embeddings by participating in SemEval 2020 Task 9: ~\emph{Sentiment Analysis on Code-Mixed Social Media Text}. We utilised them to train a sentiment classifier that achieves an F-1 score of 0.722. This is higher than the baseline for the competition of 0.656, with our team (codalab username \emph{francesita}) ranking 14 out of 29 participating teams, beating the baseline.