Twitter中指定的实体识别：短期时间班次的数据集和分析

论文标题

Twitter中指定的实体识别：短期时间班次的数据集和分析

Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts

论文作者

Ushio, Asahi, Neves, Leonardo, Silva, Vitor, Barbieri, Francesco, Camacho-Collados, Jose

论文摘要

语言模型预训练的最新进展导致了指定实体识别（NER）的重要改进。但是，这一进展主要在新闻，维基百科或科学文章等形式良好的文件中进行了测试。在社交媒体中，景观是不同的，由于其嘈杂和动态的性质，它增加了另一层复杂性。在本文中，我们专注于Twitter中的NER，这是最大的社交媒体平台之一，并构建了一个新的NER数据集Tweetner7，其中包含从2019年9月至2021年8月至2021年8月注释的七种实体类型。该数据集是通过仔细分配到时间上的推文来构建的，并以代表性的趋势来构建。除数据集外，我们还提供了一组语言模型基线，并对任务上的语言模型性能进行分析，尤其是分析不同时间段的影响。特别是，我们专注于分析中的三个重要时间方面：随着时间的流逝，NER模型的短期退化，在不同时期微调语言模型的策略，以及自我标记，以替代缺乏最近标记的数据。 Tweetner7将公开发布（https://huggingface.co/datasets/tner/tweetner7），并在其上进行微调。

Recent progress in language model pre-training has led to important improvements in Named Entity Recognition (NER). Nonetheless, this progress has been mainly tested in well-formatted documents such as news, Wikipedia, or scientific articles. In social media the landscape is different, in which it adds another layer of complexity due to its noisy and dynamic nature. In this paper, we focus on NER in Twitter, one of the largest social media platforms, and construct a new NER dataset, TweetNER7, which contains seven entity types annotated over 11,382 tweets from September 2019 to August 2021. The dataset was constructed by carefully distributing the tweets over time and taking representative trends as a basis. Along with the dataset, we provide a set of language model baselines and perform an analysis on the language model performance on the task, especially analyzing the impact of different time periods. In particular, we focus on three important temporal aspects in our analysis: short-term degradation of NER models over time, strategies to fine-tune a language model over different periods, and self-labeling as an alternative to lack of recently-labeled data. TweetNER7 is released publicly (https://huggingface.co/datasets/tner/tweetner7) along with the models fine-tuned on it.

下载PDF全文

下载文献需遵守相关版权规定

论文标题