从数据中学习相似性措施

论文标题

从数据中学习相似性措施

Learning similarity measures from data

论文作者

Mathisen, Bjørn Magnus, Aamodt, Agnar, Bach, Kerstin, Langseth, Helge

论文摘要

定义相似性度量是某些机器学习方法的要求。一种这样的方法是基于案例的推理（CBR），其中使用相似度度量来检索与查询情况最相似的库中的情况或一组案例。即使对于与CBR专家一起工作的领域专家，分析上的相似性度量也很具有挑战性。但是，数据集通常作为构建CBR或机器学习系统的一部分收集。假定这些数据集包含从问题特征中正确识别解决方案的功能，因此它们还可能包含以构建或学习这种相似性度量的知识。这项工作的主要动机是使用机器学习来自动构建相似性措施，同时使训练时间尽可能低。我们的目标是调查如何应用机器学习以有效学习相似度度量。这种学习的相似性度量可以用于CBR系统，也可以用于半监督学习或单次学习任务中的数据。最近的工作已朝着这一目标发展，依靠很长的训练时间或手动对相似性度量的部分建模。我们创建了一个框架，以帮助我们分析当前学习相似性度量的方法。该分析产生了两个新颖的相似性测量设计。一种使用预训练的分类器作为相似性度量的基础设计。第二个设计在学习数据中的相似性度量并保持较低的过程中使用了尽可能少的建模。在14个不同的数据集上评估了这两种相似性措施。评估表明，使用分类器作为相似性度量的基础，可以提供最先进的表现。最后，评估表明，我们完全数据驱动的相似性度量设计优于最先进的方法，同时保持训练时间较低。

Defining similarity measures is a requirement for some machine learning methods. One such method is case-based reasoning (CBR) where the similarity measure is used to retrieve the stored case or set of cases most similar to the query case. Describing a similarity measure analytically is challenging, even for domain experts working with CBR experts. However, data sets are typically gathered as part of constructing a CBR or machine learning system. These datasets are assumed to contain the features that correctly identify the solution from the problem features, thus they may also contain the knowledge to construct or learn such a similarity measure. The main motivation for this work is to automate the construction of similarity measures using machine learning, while keeping training time as low as possible. Our objective is to investigate how to apply machine learning to effectively learn a similarity measure. Such a learned similarity measure could be used for CBR systems, but also for clustering data in semi-supervised learning, or one-shot learning tasks. Recent work has advanced towards this goal, relies on either very long training times or manually modeling parts of the similarity measure. We created a framework to help us analyze current methods for learning similarity measures. This analysis resulted in two novel similarity measure designs. One design using a pre-trained classifier as basis for a similarity measure. The second design uses as little modeling as possible while learning the similarity measure from data and keeping training time low. Both similarity measures were evaluated on 14 different datasets. The evaluation shows that using a classifier as basis for a similarity measure gives state of the art performance. Finally the evaluation shows that our fully data-driven similarity measure design outperforms state of the art methods while keeping training time low.

下载PDF全文

下载文献需遵守相关版权规定

论文标题