试图通过机器学习来超越因果关系：模型解释性技术的限制用于识别预测变量

论文标题

试图通过机器学习来超越因果关系：模型解释性技术的限制用于识别预测变量

Trying to Outrun Causality with Machine Learning: Limitations of Model Explainability Techniques for Identifying Predictive Variables

论文作者

Vowels, Matthew J.

论文摘要

已经提出了机器学习解释性技术作为“解释”或询问模型的一种手段，以了解为什么做出了特定的决策或预测。在使用机器学习来自动化涉及敏感因素和法律结果的决策过程的时候，这种能力尤其重要。确实，这是根据欧盟法律的要求。此外，与施加过度限制的功能形式有关的研究人员（例如，在线性回归中的情况）可能会促使使用机器学习算法与解释性技术一起使用机器学习算法，这是探索性研究的一部分，目的是确定与利益结果相关的重要变量。例如，流行病学家可能有兴趣通过使用随机森林并使用重要性措施评估可变相关性，从而识别影响疾病恢复的因素。但是，正如我们所证明的那样，机器学习算法并不像看起来那样灵活，而是对数据中的下层因果结构非常敏感。这样做的后果是，实际上对因果系统至关重要并且与结果高度相关的预测因素可能会被解释性技术认为是无关/不重要/不预测结果的预测。我们表明，这不是回归的数学含义的结果，而不是这是解释性技术本身的限制，以及这些含义与基本因果结构相关的条件独立性的相互作用。我们为希望探索重要变量数据的研究人员提供了一些替代建议。

Machine Learning explainability techniques have been proposed as a means of `explaining' or interrogating a model in order to understand why a particular decision or prediction has been made. Such an ability is especially important at a time when machine learning is being used to automate decision processes which concern sensitive factors and legal outcomes. Indeed, it is even a requirement according to EU law. Furthermore, researchers concerned with imposing overly restrictive functional form (e.g., as would be the case in a linear regression) may be motivated to use machine learning algorithms in conjunction with explainability techniques, as part of exploratory research, with the goal of identifying important variables which are associated with an outcome of interest. For example, epidemiologists might be interested in identifying `risk factors' - i.e. factors which affect recovery from disease - by using random forests and assessing variable relevance using importance measures. However, and as we demonstrate, machine learning algorithms are not as flexible as they might seem, and are instead incredibly sensitive to the underling causal structure in the data. The consequences of this are that predictors which are, in fact, critical to a causal system and highly correlated with the outcome, may nonetheless be deemed by explainability techniques to be unrelated/unimportant/unpredictive of the outcome. Rather than this being a limitation of explainability techniques per se, we show that it is rather a consequence of the mathematical implications of regression, and the interaction of these implications with the associated conditional independencies of the underlying causal structure. We provide some alternative recommendations for researchers wanting to explore the data for important variables.

下载PDF全文

下载文献需遵守相关版权规定

论文标题