论文标题
Cosmohub:关于Hadoop的天文数据的交互式探索和分布
CosmoHub: Interactive exploration and distribution of astronomical data on Hadoop
论文作者
论文摘要
我们提出了基于Hadoop的Web应用程序的CosmoHub(https://cosmohub.pic.es),以执行大规模宇宙学数据集的交互式探索和分布。最近的宇宙学试图通过分析大量天文数据,在最后(未来)数十年中逐渐增加,以实验技术的数字化和自动化,在宇宙的大规模结构上绘制了宇宙的大规模结构。 Cosmohub在D'InformacióCientífica(PIC)托管和开发的Cosmohub为全球科学家社区提供了支持,而无需最终用户知道任何结构化查询语言(SQL)。它提供了几个大型国际合作的数据,例如欧几里得太空任务,黑暗能源调查(DES),加速宇宙调查的物理学(PAUS)和玛丽娜学院的数值模拟。虽然最初是作为PostgreSQL关系数据库Web前端开发的,但这项工作描述了当前版本的CosmoHub,该版本构建在Apache Hive的顶部,该版本促进了可扩展的读取,写作,写作和管理庞大的数据集。由于Cosmohub的数据集很少被修改,因此蜂巢更合适。 可以使用集成的可视化工具进行交互探索,超过60多个分类信息和$ 50 \ times 10^9 $天文对象,该对象包括1D直方图和2D热图图。在我们当前的实施中,可以在数十秒的时间范围内完成对$ 10^9 $对象数据集的在线探索。用户还可以在几分钟内以标准格式下载自定义的数据子集。
We present CosmoHub (https://cosmohub.pic.es), a web application based on Hadoop to perform interactive exploration and distribution of massive cosmological datasets. Recent Cosmology seeks to unveil the nature of both dark matter and dark energy mapping the large-scale structure of the Universe, through the analysis of massive amounts of astronomical data, progressively increasing during the last (and future) decades with the digitization and automation of the experimental techniques. CosmoHub, hosted and developed at the Port d'Informació Científica (PIC), provides support to a worldwide community of scientists, without requiring the end user to know any Structured Query Language (SQL). It is serving data of several large international collaborations such as the Euclid space mission, the Dark Energy Survey (DES), the Physics of the Accelerating Universe Survey (PAUS) and the Marenostrum Institut de Ciències de l'Espai (MICE) numerical simulations. While originally developed as a PostgreSQL relational database web frontend, this work describes the current version of CosmoHub, built on top of Apache Hive, which facilitates scalable reading, writing and managing huge datasets. As CosmoHub's datasets are seldomly modified, Hive it is a better fit. Over 60 TiB of catalogued information and $50 \times 10^9$ astronomical objects can be interactively explored using an integrated visualization tool which includes 1D histogram and 2D heatmap plots. In our current implementation, online exploration of datasets of $10^9$ objects can be done in a timescale of tens of seconds. Users can also download customized subsets of data in standard formats generated in few minutes.