使用Maro自动调试汽车管道：ML自动补救式Oracle（扩展版）

论文标题

使用Maro自动调试汽车管道：ML自动补救式Oracle（扩展版）

Automatically Debugging AutoML Pipelines using Maro: ML Automated Remediation Oracle (Extended Version)

论文作者

Dolby, Julian, Tsay, Jason, Hirzel, Martin

论文摘要

实践中的机器学习通常涉及用于数据清理，功能工程，预处理和预测的复杂管道。这些管道由运算符组成，必须正确连接这些管道，并且必须正确配置其超参数。不幸的是，数据集，操作员或超参数的某些组合会导致故障。诊断和解决这些失败是乏味且容易出错的，并且可能会严重脱离数据科学家的工作流程。本文介绍了一种自动调试ML管道，解释故障并产生补救的方法。我们实施了我们的方法，该方法以Automl和SMT的组合为基础，该工具在称为Maro的工具中。 Maro与熟悉的数据科学生态系统无缝合作，包括Python，Jupyter Notebooks，Scikit-Learn和HyperOPT等汽车工具。我们从经验上评估我们的工具，发现在大多数情况下，单个补救措施会自动解决错误，不会产生其他故障，并且不会显着影响最佳准确性或收敛时间。

Machine learning in practice often involves complex pipelines for data cleansing, feature engineering, preprocessing, and prediction. These pipelines are composed of operators, which have to be correctly connected and whose hyperparameters must be correctly configured. Unfortunately, it is quite common for certain combinations of datasets, operators, or hyperparameters to cause failures. Diagnosing and fixing those failures is tedious and error-prone and can seriously derail a data scientist's workflow. This paper describes an approach for automatically debugging an ML pipeline, explaining the failures, and producing a remediation. We implemented our approach, which builds on a combination of AutoML and SMT, in a tool called Maro. Maro works seamlessly with the familiar data science ecosystem including Python, Jupyter notebooks, scikit-learn, and AutoML tools such as Hyperopt. We empirically evaluate our tool and find that for most cases, a single remediation automatically fixes errors, produces no additional faults, and does not significantly impact optimal accuracy nor time to convergence.

下载PDF全文

下载文献需遵守相关版权规定

论文标题