论文标题
使用自动贝叶斯先验选择的正规化数据编程
Regularized Data Programming with Automated Bayesian Prior Selection
论文作者
论文摘要
手动数据标签的成本可能是监督学习的重要障碍。数据编程(DP)为培训数据集创建提供了一个弱监督的解决方案,其中用户定义的程序化标记功能(LFS)的输出通过无监督的学习来调和。但是,在某些情况下,包括低数据的环境,DP无法超过未加权的多数票。这项工作引入了经典DP的贝叶斯扩展,通过使用正则化项来增强DP目标,从而减轻了无监督学习的失败。正则学习是通过最大程度的后验估计来实现的。大多数投票被认为是自动化先验参数选择的代理信号。结果表明,正规的DP相对于最大可能性和多数投票,赋予更大的解释性以及在低数据制度中的绩效提高性能。
The cost of manual data labeling can be a significant obstacle in supervised learning. Data programming (DP) offers a weakly supervised solution for training dataset creation, wherein the outputs of user-defined programmatic labeling functions (LFs) are reconciled through unsupervised learning. However, DP can fail to outperform an unweighted majority vote in some scenarios, including low-data contexts. This work introduces a Bayesian extension of classical DP that mitigates failures of unsupervised learning by augmenting the DP objective with regularization terms. Regularized learning is achieved through maximum a posteriori estimation with informative priors. Majority vote is proposed as a proxy signal for automated prior parameter selection. Results suggest that regularized DP improves performance relative to maximum likelihood and majority voting, confers greater interpretability, and bolsters performance in low-data regimes.