论文标题
Covid-Transformer:使用通用句子编码器在Twitter上检测COVID-19趋势主题
Covid-Transformer: Detecting COVID-19 Trending Topics on Twitter Using Universal Sentence Encoder
论文作者
论文摘要
新型的电晕病毒病(也称为Covid-19)导致了大流行,影响了全球200多个国家。凭借其全球影响,Covid-19已成为几乎所有地方人民的主要关注点,因此,关于Covid-19相关主题,来自世界各个角落都有大量推文。在这项工作中,我们试图分析推文并检测到Twitter上人们的趋势主题和主要问题,这可以使我们能够更好地了解情况,并设计更好的计划。更具体地说,我们提出了一个基于通用句子编码器的模型,以检测近几个月来推文的主要主题。我们使用通用句子编码器来得出语义表示和推文的相似性。然后,我们使用句子的相似性及其嵌入,然后将它们馈送到K-均值聚类算法中以将类似的推文(从语义意义上讲)组成。之后,使用基于深度学习的文本摘要算法获得群集摘要,该算法可以揭示每个群集的基本主题。通过实验结果,我们表明我们的模型可以通过在句子级别上处理大量推文(可以保留推文的总体含义)来检测非常有用的主题。由于该框架对特定数据分布没有限制,因此可以用于检测来自任何其他社交媒体和任何其他上下文而不是Covid-19的趋势主题。实验结果表明,我们提出的方法对包括TF-IDF和潜在的Dirichlet分配(LDA)在内的其他基准的优势。
The novel corona-virus disease (also known as COVID-19) has led to a pandemic, impacting more than 200 countries across the globe. With its global impact, COVID-19 has become a major concern of people almost everywhere, and therefore there are a large number of tweets coming out from every corner of the world, about COVID-19 related topics. In this work, we try to analyze the tweets and detect the trending topics and major concerns of people on Twitter, which can enable us to better understand the situation, and devise better planning. More specifically we propose a model based on the universal sentence encoder to detect the main topics of Tweets in recent months. We used universal sentence encoder in order to derive the semantic representation and the similarity of tweets. We then used the sentence similarity and their embeddings, and feed them to K-means clustering algorithm to group similar tweets (in semantic sense). After that, the cluster summary is obtained using a text summarization algorithm based on deep learning, which can uncover the underlying topics of each cluster. Through experimental results, we show that our model can detect very informative topics, by processing a large number of tweets on sentence level (which can preserve the overall meaning of the tweets). Since this framework has no restriction on specific data distribution, it can be used to detect trending topics from any other social media and any other context rather than COVID-19. Experimental results show superiority of our proposed approach to other baselines, including TF-IDF, and latent Dirichlet allocation (LDA).