论文标题

减轻类边界标签不确定性以减少模型偏差和方差

Mitigating Class Boundary Label Uncertainty to Reduce Both Model Bias and Variance

论文作者

Almeida, Matthew, Ding, Wei, Crouter, Scott, Chen, Ping

论文摘要

在监督分类中,对决策边界的模型偏差和方差的研究至关重要。两者之间通常会有一个权衡,因为分类模型的决策边界进行微调,以适应更多的边界训练样本(即较高的模型复杂性)可以提高训练准确性(即较低的偏见),但针对看不见的数据(即较高的差异)造成了损害的概括。通过关注仅分类边界微调和模型复杂性,很难减少偏差和差异。为了克服这一难题,我们采取不同的观点并研究了一种新方法来处理培训数据标签中的不准确性和不确定性,这在许多应用程序是概念性且标签由人类注释者执行的许多应用中是不可避免的。分类过程可能会因培训数据标签的不确定性而破坏。扩展边界以适应不准确的标记点将增加偏置和方差。我们的新方法可以通过估计训练集的点标签不确定性并因此调整训练样品权重,从而使那些不确定性较高的样本加权并且不确定性低的样本会加权加权,从而减少偏差和差异。这样,不确定的样本对模型学习算法的目标函数的贡献较小,并且在决策边界上施加较少的拉力。在现实世界中的体育活动识别案例研究中,数据提出了许多标记挑战,我们表明这种新方法可改善模型性能并降低模型差异。

The study of model bias and variance with respect to decision boundaries is critically important in supervised classification. There is generally a tradeoff between the two, as fine-tuning of the decision boundary of a classification model to accommodate more boundary training samples (i.e., higher model complexity) may improve training accuracy (i.e., lower bias) but hurt generalization against unseen data (i.e., higher variance). By focusing on just classification boundary fine-tuning and model complexity, it is difficult to reduce both bias and variance. To overcome this dilemma, we take a different perspective and investigate a new approach to handle inaccuracy and uncertainty in the training data labels, which are inevitable in many applications where labels are conceptual and labeling is performed by human annotators. The process of classification can be undermined by uncertainty in the labels of the training data; extending a boundary to accommodate an inaccurately labeled point will increase both bias and variance. Our novel method can reduce both bias and variance by estimating the pointwise label uncertainty of the training set and accordingly adjusting the training sample weights such that those samples with high uncertainty are weighted down and those with low uncertainty are weighted up. In this way, uncertain samples have a smaller contribution to the objective function of the model's learning algorithm and exert less pull on the decision boundary. In a real-world physical activity recognition case study, the data presents many labeling challenges, and we show that this new approach improves model performance and reduces model variance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源