你从哪里发推文？根据上下文信息推断推文的原点位置

论文标题

你从哪里发推文？根据上下文信息推断推文的原点位置

Where did you tweet from? Inferring the origin locations of tweets based on contextual information

论文作者

Lamsal, Rabindra, Harwood, Aaron, Read, Maria Rodriguez

论文摘要

Twitter上的公开对话包括许多相关主题，包括灾难，抗议活动，政治，宣传，体育，气候变化，流行病/大流行爆发等，这些主题既可以具有区域性和全球性。空间话语分析依赖于地理数据。但是，如今，不到1％的推文被挖掘出来。在两个情况下 - 点位置或边界位置信息。推文的一个主要问题是，Twitter用户可以在位置A和特定于位置B的交换对话，我们将其称为位置A/B问题。如果可以将位置实体分类为原始位置（位置为）或非原始位置（位置BS），则可以考虑解决问题。在这项工作中，我们提出了一个简单而有效的框架（真正的原点模型） - 解决了使用机器级自然语言理解来识别可以想象包含其起源位置信息的推文的问题。该模型在国家（80％），州（67％），城市（58％），县（56％）和地区（64％）（64％）水平上实现了有希望的准确性，并获得了与基于CONLL-2003的罗伯塔（Roberta）基本的位置提取模型的支持。我们采用了推文contexualizer（Locbert），这是拟议模型的核心组件之一，以研究多个推文的发行版，以了解Twitter用户的推文行为，以提及原点和非原始位置。我们还重点介绍了当前被认为的黄金标准测试集（地面真相）方法，引入新的数据集并确定进一步发展该地区的研究途径。

Public conversations on Twitter comprise many pertinent topics including disasters, protests, politics, propaganda, sports, climate change, epidemics/pandemic outbreaks, etc., that can have both regional and global aspects. Spatial discourse analysis rely on geographical data. However, today less than 1% of tweets are geotagged; in both cases--point location or bounding place information. A major issue with tweets is that Twitter users can be at location A and exchange conversations specific to location B, which we call the Location A/B problem. The problem is considered solved if location entities can be classified as either origin locations (Location As) or non-origin locations (Location Bs). In this work, we propose a simple yet effective framework--the True Origin Model--to address the problem that uses machine-level natural language understanding to identify tweets that conceivably contain their origin location information. The model achieves promising accuracy at country (80%), state (67%), city (58%), county (56%) and district (64%) levels with support from a Location Extraction Model as basic as the CoNLL-2003-based RoBERTa. We employ a tweet contexualizer (locBERT) which is one of the core components of the proposed model, to investigate multiple tweets' distributions for understanding Twitter users' tweeting behavior in terms of mentioning origin and non-origin locations. We also highlight a major concern with the currently regarded gold standard test set (ground truth) methodology, introduce a new data set, and identify further research avenues for advancing the area.

下载PDF全文

下载文献需遵守相关版权规定

论文标题