最近的邻居分类器超出了不完整的信息：从某些答案到某些预测

论文标题

最近的邻居分类器超出了不完整的信息：从某些答案到某些预测

Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions

论文作者

Karlaš, Bojan, Li, Peng, Wu, Renzhi, Gürel, Nezihe Merve, Chu, Xu, Wu, Wentao, Zhang, Ce

论文摘要

机器学习（ML）应用程序最近一直在蓬勃发展，这在很大程度上归因于数据的增加。但是，不一致和不完整的信息在现实数据集中无处不在，并且它们对ML应用程序的影响仍然难以捉摸。在本文中，我们通过扩展了CODD表的某些答案的概念，对这一影响进行了正式研究，该数据库研究社区已经探索了数十年来探讨的，该概念已探讨了机器学习领域。具体而言，我们专注于分类问题，并提出“某些预测”（CP）的概念 - 如果在所有可能因数据不完整而引起的所有可能的世界中训练的所有可能的分类器都会产生相同的预测，则可以肯定地预测（CP'ED）。我们研究了两个基本的CP查询：（Q1）检查确定数据示例是否可以CP的查询；（Q2）计算计算支持特定预测的分类器数（即标签）的查询。鉴于CP查询的一般解决方案毫不奇怪，如果没有对分类器的类型的假设，我们进一步在最近的邻居（NN）分类器的背景下进一步介绍了一个案例研究，在这种情况下，可以开发有效的CP查询解决方案 - 我们表明，可以在实用方面或多项态度以实现数量的时间来回答这两个查询，这是可能的。我们演示了CP的一个示例用例使用“机器学习数据清洁（ML的DC）”。我们表明，我们提出的基于CP构建的CPCLEAN方法通常可以通过轻度的手动清洁工作在分类准确性方面大大优于现有技术。

Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables, which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of "Certain Predictions" (CP) -- a test data example can be certainly predicted (CP'ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP'ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed -- we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of "data cleaning for machine learning (DC for ML)." We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques in terms of classification accuracy with mild manual cleaning effort.

下载PDF全文

下载文献需遵守相关版权规定

论文标题