论文标题

不再衰减:一个持久的Twitter数据集用于学习社会含义

Decay No More: A Persistent Twitter Dataset for Learning Social Meaning

论文作者

Zhang, Chiyu, Abdul-Mageed, Muhammad, Nagoudi, El Moatez Billah

论文摘要

随着社交媒体的扩散,许多研究诉诸社交媒体,以构建用于开发社会意义理解系统的数据集。对于Twitter的流行案例,大多数研究人员由于平台的数据分配策略而在没有实际文本内容的情况下分发了Tweet ID。一个问题是,随着时间的流逝,这些帖子变得越来越无法访问,这会导致不公平的比较和社交媒体研究的时间偏见。为了减轻对数据衰减的挑战,我们利用一种释义模型为社会含义(PTSM)提出新的持久英语Twitter数据集。 PTSM包含$ 17 $的社会意义数据集,其中10美元的任务类别。我们尝试了两个SOTA预训练的语言模型,并表明我们的PTSM可以用边缘性能损失的释义代替实际推文。

With the proliferation of social media, many studies resort to social media to construct datasets for developing social meaning understanding systems. For the popular case of Twitter, most researchers distribute tweet IDs without the actual text contents due to the data distribution policy of the platform. One issue is that the posts become increasingly inaccessible over time, which leads to unfair comparisons and a temporal bias in social media research. To alleviate this challenge of data decay, we leverage a paraphrase model to propose a new persistent English Twitter dataset for social meaning (PTSM). PTSM consists of $17$ social meaning datasets in $10$ categories of tasks. We experiment with two SOTA pre-trained language models and show that our PTSM can substitute the actual tweets with paraphrases with marginal performance loss.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源