学习的K-NN距离估计

论文标题

学习的K-NN距离估计

Learned k-NN Distance Estimation

论文作者

Amagata, Daichi, Arai, Yusuke, Fujita, Sumio, Hara, Takahiro

论文摘要

众所周知，大数据挖掘是数据科学的重要任务，因为它可以提供有用的观察结果和隐藏在给定的大数据集中的新知识。基于接近性的数据分析在许多现实生活应用中尤其使用。在这样的分析中，通常使用与K最近的邻居的距离，因此其主要瓶颈是从数据检索中得出的。为提高这些分析的效率做出了许多努力。但是，他们仍然会产生巨大的成本，因为它们本质上需要许多数据访问。为了避免此问题，我们提出了一种机器学习技术，该技术可以快速，准确地估算给定查询的K-NN距离（即与K最近的邻居的距离）。我们训练完全连接的神经网络模型，并利用枢轴来实现准确的估计。我们的模型旨在具有有用的优势：它一次不距离K-NN，其推理时间为O（1）（未产生数据访问），但它保持很高的精度。我们对实际数据集的实验结果和案例研究证明了解决方案的效率和有效性。

Big data mining is well known to be an important task for data science, because it can provide useful observations and new knowledge hidden in given large datasets. Proximity-based data analysis is particularly utilized in many real-life applications. In such analysis, the distances to k nearest neighbors are usually employed, thus its main bottleneck is derived from data retrieval. Much efforts have been made to improve the efficiency of these analyses. However, they still incur large costs, because they essentially need many data accesses. To avoid this issue, we propose a machine-learning technique that quickly and accurately estimates the k-NN distances (i.e., distances to the k nearest neighbors) of a given query. We train a fully connected neural network model and utilize pivots to achieve accurate estimation. Our model is designed to have useful advantages: it infers distances to the k-NNs at a time, its inference time is O(1) (no data accesses are incurred), but it keeps high accuracy. Our experimental results and case studies on real datasets demonstrate the efficiency and effectiveness of our solution.

下载PDF全文

下载文献需遵守相关版权规定

论文标题