论文标题
对监督机器学习的采样偏见校正:一种贝叶斯推理方法,具有实际应用
Sampling Bias Correction for Supervised Machine Learning: A Bayesian Inference Approach with Practical Applications
论文作者
论文摘要
鉴于训练集已受到已知采样偏差的监督机器学习问题,如何培训模型以适合原始数据集?我们通过更改后验分布来说明采样函数,通过贝叶斯推理框架实现这一目标。然后,我们将此解决方案应用于二进制逻辑回归,并讨论数据集可能会受到故意样本偏差(例如标签不平衡)的情况。该技术广泛适用于大数据的统计推断,从医学科学到图像识别到营销。熟悉它将为从业者工具提供从数据收集到模型选择的推理管道。
Given a supervised machine learning problem where the training set has been subject to a known sampling bias, how can a model be trained to fit the original dataset? We achieve this through the Bayesian inference framework by altering the posterior distribution to account for the sampling function. We then apply this solution to binary logistic regression, and discuss scenarios where a dataset might be subject to intentional sample bias such as label imbalance. This technique is widely applicable for statistical inference on big data, from the medical sciences to image recognition to marketing. Familiarity with it will give the practitioner tools to improve their inference pipeline from data collection to model selection.