论文标题
在Airbnb数据集中广泛使用的数据中不正确的数据
Incorrect Data in the Widely Used Inside Airbnb Dataset
论文作者
论文摘要
最近发表的一些决策支持系统论文讨论了与信息系统研究中数据质量有关的问题。在此简短的研究说明中,我以这些论文中介绍的工作为基础,并记录了在研究中常用的大型开放数据集中发现的两个数据质量问题。在Airbnb(IA)内部收集来自Airbnb.com用户发布的位置和评论的数据。访客可以轻松地下载IA收集的数据,以在全球多个地点收集。尽管数据集广泛用于学术研究中,但尚未对数据集进行彻底的研究及其有效性。本说明检查数据集并解释了添加到数据集中的错误数据问题。调查结果表明,此问题可以归因于数据收集过程中的系统错误。结果表明,未经验证的开放数据集的使用可能是有问题的,尽管这项工作中提出的发现可能不足以挑战使用IA数据集的所有已发表的研究。此外,调查结果表明,由于Airbnb实施了新功能,因此发生了错误的数据。因此,除非发生变化,否则这个问题的后果可能只会变得更加严重。最后,本说明探讨了当比较数据集的两个不同版本时,重复性是一个问题。
Several recently published papers in Decision Support Systems discussed issues related to data quality in Information Systems research. In this short research note, I build on the work introduced in these papers and document two data quality issues discovered in a large open dataset commonly used in research. Inside Airbnb (IA) collects data from places and reviews as posted by users of Airbnb.com. Visitors can effortlessly download data collected by IA for several locations around the globe. While the dataset is widely used in academic research, no thorough investigation of the dataset and its validity has been conducted. This note examines the dataset and explains an issue of incorrect data added to the dataset. Findings suggest that this issue can be attributed to systemic errors in the data collection process. The results suggest that the use of unverified open datasets can be problematic, although the discoveries presented in this work may not be significant enough to challenge all published research that used the IA dataset. Additionally, findings indicate that the incorrect data happens because of a new feature implemented by Airbnb. Thus, unless changes are made, it is likely that the consequences of this issue will only become more severe. Finally, this note explores why reproducibility is a problem when two different releases of the dataset are compared.