Twitter-Demographer：一种基于流动的工具来丰富Twitter数据

论文标题

Twitter-Demographer：一种基于流动的工具来丰富Twitter数据

Twitter-Demographer: A Flow-based Tool to Enrich Twitter Data

论文作者

Bianchi, Federico, Cutrona, Vincenzo, Hovy, Dirk

论文摘要

Twitter数据已成为自然语言处理（NLP）和社会科学研究至关重要的，近年来推动了各种科学发现。但是，单独的文本数据通常不足以进行研究：尤其是社会科学家需要更多的变量来对各种因素进行分析和控制。我们如何增加此信息（例如用户的位置，年龄或推文情感）对匿名性和可重复性的影响有很大的影响，并且需要专门的努力。本文介绍了Twitter-Demographer，这是一种简单，基于流动的工具，可丰富Twitter数据，并提供有关推文和用户的其他信息。 Twitter-Demographer针对的是NLP从业者和（计算）社会科学家，他们希望通过汇总信息丰富数据集，促进可重复性，并为伪匿名性提供算法隐私措施。我们讨论了我们的设计选择，灵感来自基于流的编程范式，以使用可以轻松链式并扩展的黑盒组件。我们还分析了与使用此工具的使用有关的道德问题，以及促进伪匿名性的内置措施。

Twitter data have become essential to Natural Language Processing (NLP) and social science research, driving various scientific discoveries in recent years. However, the textual data alone are often not enough to conduct studies: especially social scientists need more variables to perform their analysis and control for various factors. How we augment this information, such as users' location, age, or tweet sentiment, has ramifications for anonymity and reproducibility, and requires dedicated effort. This paper describes Twitter-Demographer, a simple, flow-based tool to enrich Twitter data with additional information about tweets and users. Twitter-Demographer is aimed at NLP practitioners and (computational) social scientists who want to enrich their datasets with aggregated information, facilitating reproducibility, and providing algorithmic privacy-by-design measures for pseudo-anonymity. We discuss our design choices, inspired by the flow-based programming paradigm, to use black-box components that can easily be chained together and extended. We also analyze the ethical issues related to the use of this tool, and the built-in measures to facilitate pseudo-anonymity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题