论文标题

国际规模公民科学数据集的无障碍数据策展和分析

Accessible Data Curation and Analytics for International-Scale Citizen Science Datasets

论文作者

Murray, Benjamin, Kerfoot, Eric, Graham, Mark S., Sudre, Carole H., Molteni, Erika, Canas, Liane S., Antonelli, Michela, Klaser, Kerstin, Visconti, Alessia, Chan, Andrew T., Franks, Paul W., Davies, Richard, Wolf, Jonathan, Spector, Tim, Steves, Claire J., Modat, Marc, Ourselin, Sebastien

论文摘要

COVID症状研究是一项基于智能手机的人群Covid-19症状的监视研究,是大数据公民科学的典范。自2020年3月引入以来,已经记录了超过470万参与者和1.89亿个独特的评估。COVID症状研究的成功围绕有效数据策划带来了技术挑战,原因有两个。首先,数据集的规模意味着它不再使用商品硬件上的标准软件轻松处理。其次,研究小组的规模意味着,多个出版物使用的关键分析的可复制性和一致性成为一个问题。我们提出了Exetera,这是一种开源数据策划软件,旨在应对可扩展性挑战,并在国际研究小组(例如COVID症状研究数据集)的国际研究小组中进行可重复的研究。

The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. Over 4.7 million participants and 189 million unique assessments have been logged since its introduction in March 2020. The success of the Covid Symptom Study creates technical challenges around effective data curation for two reasons. Firstly, the scale of the dataset means that it can no longer be easily processed using standard software on commodity hardware. Secondly, the size of the research group means that replicability and consistency of key analytics used across multiple publications becomes an issue. We present ExeTera, an open source data curation software designed to address scalability challenges and to enable reproducible research across an international research group for datasets such as the Covid Symptom Study dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源