使用代理的近似选择并保证

论文标题

使用代理的近似选择并保证

Approximate Selection with Guarantees using Proxies

论文作者

Kang, Daniel, Gan, Edward, Bailis, Peter, Hashimoto, Tatsunori, Zaharia, Matei

论文摘要

由于数据获取和存储的成本下降，研究人员和行业分析师通常希望在大型数据集中找到所有罕见事件的实例。例如，科学家可以便宜地捕获数千个小时的视频，但受到手动检查长时间以识别相关对象和事件的需要的限制。为了降低这一成本，最近的工作建议使用廉价的代理模型，例如图像分类器，以确定满足数据选择过滤器的大约数据点。不幸的是，这项最近的工作不能提供科学和生产环境中必要的统计准确性。在这项工作中，我们介绍了具有统计准确性保证的近似选择查询的新型算法。也就是说，鉴于甲骨文（通常是人类或昂贵的机器学习模型）的数量有限的确切识别，我们的算法符合最低精度或召回目标，概率很高。相比之下，现有的方法在满足这些召回目标和精确目标方面可能会失败。我们表明，对于真实和合成数据集中的精度和回忆目标，我们的算法可以提高查询结果质量高达30倍。

Due to the falling costs of data acquisition and storage, researchers and industry analysts often want to find all instances of rare events in large datasets. For instance, scientists can cheaply capture thousands of hours of video, but are limited by the need to manually inspect long videos to identify relevant objects and events. To reduce this cost, recent work proposes to use cheap proxy models, such as image classifiers, to identify an approximate set of data points satisfying a data selection filter. Unfortunately, this recent work does not provide the statistical accuracy guarantees necessary in scientific and production settings. In this work, we introduce novel algorithms for approximate selection queries with statistical accuracy guarantees. Namely, given a limited number of exact identifications from an oracle, often a human or an expensive machine learning model, our algorithms meet a minimum precision or recall target with high probability. In contrast, existing approaches can catastrophically fail in satisfying these recall and precision targets. We show that our algorithms can improve query result quality by up to 30x for both the precision and recall targets in both real and synthetic datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题