论文标题
分裂或不分裂:分类中不同治疗的影响
To Split or Not to Split: The Impact of Disparate Treatment in Classification
论文作者
论文摘要
当机器学习模型根据敏感属性(例如年龄,性别)为个体产生不同的决定时,就会发生不同的治疗。在预测准确性至关重要的领域中,拟合表现出不同处理的模型可能是可以接受的。为了评估不同治疗的效果,我们比较了分裂分类器(即,在每个组上分别训练和部署的分类器)与群盲分类器(即不使用敏感属性的分类器)的性能(即分别训练和部署)。我们介绍了通过拆分分类器来量化性能提高的分类收益。直接从其定义中计算得分的收益可能是棘手的,因为它涉及解决无限维功能空间上的优化问题。在不同的绩效指标下,我们(i)证明了与拆分益处相同的表达式,可以通过求解小规模的凸面程序来有效地计算出来。 (ii)提供尖锐的上限和下限,以揭示出精确的条件,这些条件揭示了群盲分类器总是会遭受分裂分类器的非平凡性能差距。在有限的样本制度中,分裂不一定是有益的,我们提供数据依赖性界限来理解这种效果。最后,我们通过在合成数据集和现实世界数据集上的数值实验来验证我们的理论结果。
Disparate treatment occurs when a machine learning model yields different decisions for individuals based on a sensitive attribute (e.g., age, sex). In domains where prediction accuracy is paramount, it could potentially be acceptable to fit a model which exhibits disparate treatment. To evaluate the effect of disparate treatment, we compare the performance of split classifiers (i.e., classifiers trained and deployed separately on each group) with group-blind classifiers (i.e., classifiers which do not use a sensitive attribute). We introduce the benefit-of-splitting for quantifying the performance improvement by splitting classifiers. Computing the benefit-of-splitting directly from its definition could be intractable since it involves solving optimization problems over an infinite-dimensional functional space. Under different performance measures, we (i) prove an equivalent expression for the benefit-of-splitting which can be efficiently computed by solving small-scale convex programs; (ii) provide sharp upper and lower bounds for the benefit-of-splitting which reveal precise conditions where a group-blind classifier will always suffer from a non-trivial performance gap from the split classifiers. In the finite sample regime, splitting is not necessarily beneficial and we provide data-dependent bounds to understand this effect. Finally, we validate our theoretical results through numerical experiments on both synthetic and real-world datasets.