跨量表的变化：Twitter数据采样下的测量保真度

论文标题

跨量表的变化：Twitter数据采样下的测量保真度

Variation across Scales: Measurement Fidelity under Twitter Data Sampling

论文作者

Wu, Siqi, Rizoiu, Marian-Andrei, Xie, Lexing

论文摘要

对数据质量的全面了解是社交媒体研究中测量研究的基石。本文介绍了对不同时间尺度和不同主题（实体，网络和级联）跨Twitter数据采样影响的深入测量。通过构建完整的推文流，我们表明Twitter速率限制消息是缺少推文量的准确指标。在时间尺度上，采样也有显着不同。虽然小时抽样率受不同时区的昼夜节奏影响，但毫秒级采样受实施选择的严重影响。对于诸如用户之类的Twitter实体，我们找到均匀速率的Bernoulli流程近似于经验分布。它还使我们能够通过观察到的样本数据估算真实排名。对于Twitter上的网络，它们的结构发生了重大改变，并且某些组件更有可能保留。对于转发级联，我们观察到了排序间隔时间和用户影响的分布变化，这将影响依赖这些功能的模型。这项工作引起人们对社交数据中的噪音和潜在偏见的关注，并提供了一些测量Twitter抽样效果的工具。

A comprehensive understanding of data quality is the cornerstone of measurement studies in social media research. This paper presents in-depth measurements on the effects of Twitter data sampling across different timescales and different subjects (entities, networks, and cascades). By constructing complete tweet streams, we show that Twitter rate limit message is an accurate indicator for the volume of missing tweets. Sampling also differs significantly across timescales. While the hourly sampling rate is influenced by the diurnal rhythm in different time zones, the millisecond level sampling is heavily affected by the implementation choices. For Twitter entities such as users, we find the Bernoulli process with a uniform rate approximates the empirical distributions well. It also allows us to estimate the true ranking with the observed sample data. For networks on Twitter, their structures are altered significantly and some components are more likely to be preserved. For retweet cascades, we observe changes in distributions of tweet inter-arrival time and user influence, which will affect models that rely on these features. This work calls attention to noises and potential biases in social data, and provides a few tools to measure Twitter sampling effects.

下载PDF全文

下载文献需遵守相关版权规定

论文标题