天文学中有效的机器学习数据集的元素

论文标题

天文学中有效的机器学习数据集的元素

Elements of effective machine learning datasets in astronomy

论文作者

Boscoe, Bernie, Do, Tuan, Jones, Evan, Li, Yunqi, Alfaro, Kevin, Ma, Christy

论文摘要

在这项工作中，我们确定了天文学中有效的机器学习数据集的要素，并为其设计和创建提供了建议。机器学习已成为分析和了解天文学大量数据泛滥的越来越重要的工具。为了利用这些工具，培训和测试需要数据集。但是，为天文学建立机器学习数据集可能具有挑战性。天文数据是从构建的工具中收集的，目的是以传统方式探索科学问题，而不是进行机器学习。因此，通常情况下，原始数据甚至下游处理的数据都不适合机器学习。我们探索机器学习数据集的构建，我们问：哪些元素定义有效的机器学习数据集？我们在天文学中定义有效的机器学习数据集，以使用明确定义的数据点，结构和元数据形成。我们讨论为什么这些要素对于天文应用以及将其置于实践中的方式很重要。我们认为这些品质不仅使数据适合机器学习，还可以帮助促进可用，可重复使用和可复制的科学实践。

In this work, we identify elements of effective machine learning datasets in astronomy and present suggestions for their design and creation. Machine learning has become an increasingly important tool for analyzing and understanding the large-scale flood of data in astronomy. To take advantage of these tools, datasets are required for training and testing. However, building machine learning datasets for astronomy can be challenging. Astronomical data is collected from instruments built to explore science questions in a traditional fashion rather than to conduct machine learning. Thus, it is often the case that raw data, or even downstream processed data is not in a form amenable to machine learning. We explore the construction of machine learning datasets and we ask: what elements define effective machine learning datasets? We define effective machine learning datasets in astronomy to be formed with well-defined data points, structure, and metadata. We discuss why these elements are important for astronomical applications and ways to put them in practice. We posit that these qualities not only make the data suitable for machine learning, they also help to foster usable, reusable, and replicable science practices.

下载PDF全文

下载文献需遵守相关版权规定

论文标题