论文标题
Edna-Covid:一条大规模的Covid-19 Tweets数据集用Edna流媒体工具包收集
EDNA-Covid: A Large-Scale Covid-19 Tweets Dataset Collected with the EDNA Streaming Toolkit
论文作者
论文摘要
COVID-19-Pandemic从根本上改变了我们生活的许多方面。随着全国范围内的封锁和全职咨询,关于大流行的对话自然转向了社交网络,例如叽叽喳喳。这为在存在长期稳定的因素(例如大流行与大量,高速,高速,高噪声,高噪声Covid-19 Twitter-19 Twitter Feed)的情况下,对社会话语演变的发展提供了前所未有的见解。但是,从此类数据流中提取的实时信息提取需要容忍缺陷的流媒体基础架构,以从新闻机构,社会提要和CDC等权威医疗组织中对异源数据源进行非平凡整合。 To address this, we present (i) the EDNA streaming toolkit for consuming and processing streaming data, and (ii) EDNA-Covid, a multilingual, large-scale dataset of coronavirus-related tweets collected with EDNA since January 25, 2020. EDNA-Covid includes, at time of this publication, over 600M tweets from around the world in over 10 languages.我们将EDNA工具包和Edna-Covid数据集释放给公众,以便它们可以用于提取有关此非凡社交活动的宝贵见解。
The Covid-19 pandemic has fundamentally altered many facets of our lives. With nationwide lockdowns and stay-at-home advisories, conversations about the pandemic have naturally moved to social networks, e.g. Twitter. This affords an unprecedented insight into the evolution of social discourse in the presence of a long-running destabilizing factor such as a pandemic with the high-volume, high-velocity, high-noise Covid-19 Twitter feed. However, real-time information extraction from such a data stream requires a fault-tolerant streaming infrastructure to perform the non-trivial integration of heterogenous data sources from news organizations, social feeds, and authoritative medical organizations like the CDC. To address this, we present (i) the EDNA streaming toolkit for consuming and processing streaming data, and (ii) EDNA-Covid, a multilingual, large-scale dataset of coronavirus-related tweets collected with EDNA since January 25, 2020. EDNA-Covid includes, at time of this publication, over 600M tweets from around the world in over 10 languages. We release both the EDNA toolkit and the EDNA-Covid dataset to the public so that they can be used to extract valuable insights on this extraordinary social event.