数据集特定分析的案例

论文标题

数据集特定分析的案例

A Case for Dataset Specific Profiling

论文作者

Ockerman, Seth, Wu, John, Stewart, Christopher

论文摘要

数据驱动的科学是一种新兴的范式，在该范式中，科学发现取决于计算AI模型针对丰富的纪律特定数据集的执行。借助现代机器学习框架，任何人都可以开发和执行计算模型，以揭示数据隐藏在数据中可以实现科学应用程序的概念。对于重要且广泛使用的数据集，计算每个计算模型的性能在云资源方面的成本较高。实践中使用的基准测定方法使用代表性数据集来推断性能而无需实际执行模型。尽管这些方法可行，但这些方法将广泛的数据集分析限制在一些数据集中，并引入有利于适用于代表性数据集的模型的偏差。结果，每个数据集的唯一特征均未探索，并根据广义数据集的推理选择SubPar模型。这需要一个新的范式，将数据集分析引入模型选择过程。为了证明对数据集特异性分析的需求，我们回答了两个问题：（1）与广泛使用的代表性数据集相比，科学数据集可以显着取消计算模型的排名顺序吗？（2）如果是这样，轻巧的模型执行是否可以提高基准测试精度？综上所述，这些问题的答案奠定了新的数据集意识基准制定范式的基础。

Data-driven science is an emerging paradigm where scientific discoveries depend on the execution of computational AI models against rich, discipline-specific datasets. With modern machine learning frameworks, anyone can develop and execute computational models that reveal concepts hidden in the data that could enable scientific applications. For important and widely used datasets, computing the performance of every computational model that can run against a dataset is cost prohibitive in terms of cloud resources. Benchmarking approaches used in practice use representative datasets to infer performance without actually executing models. While practicable, these approaches limit extensive dataset profiling to a few datasets and introduce bias that favors models suited for representative datasets. As a result, each dataset's unique characteristics are left unexplored and subpar models are selected based on inference from generalized datasets. This necessitates a new paradigm that introduces dataset profiling into the model selection process. To demonstrate the need for dataset-specific profiling, we answer two questions:(1) Can scientific datasets significantly permute the rank order of computational models compared to widely used representative datasets? (2) If so, could lightweight model execution improve benchmarking accuracy? Taken together, the answers to these questions lay the foundation for a new dataset-aware benchmarking paradigm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题