simex：明确预测一组自动编码器的数据库相似性

论文标题

simex：明确预测一组自动编码器的数据库相似性

SimEx: Express Prediction of Inter-dataset Similarity by a Fleet of Autoencoders

论文作者

Hwang, Inseok, Lee, Jinho, Liu, Frank, Cho, Minsik

论文摘要

知道一组数据之间的相似性在培训有效的模型中具有许多积极的含义，例如，通过未知数据集的有利于模型传输的已知数据集或数据增强问题。估计数据之间相似性的常见实践包括在原始样本空间中进行比较，从执行特定任务的模型中比较嵌入空间，或用不同数据集对预审计的模型进行微调并从中评估性能变化。但是，这些实践将遭受浅层比较，特定于任务的偏见或进行比较所需的大量时间和计算。我们提出了Simex，这是一种使用一组验证的自动编码器的新方法，用于早期预测数据库间相似性，每组都致力于重建已知数据的特定部分。具体而言，我们的方法将未知的数据样本作为那些预验证的自动编码器的输入，并评估重建的输出样本与其原始输入样本之间的差异。我们的直觉是，在未知数据样本和对自动编码器进行培训的已知数据部分之间的相似性越大，这种自动编码器的机会越好，可以使用其训练有素的知识，重建输出样本，将输出样本靠近原始。我们证明，与常见的相似性估计实践相比，我们的方法在预测数据间相似性方面达到了超过10倍的速度。我们还证明，通过我们的方法估计的数据间相似性与共同实践息息相关，并且优于样品或嵌入空间比较的基准方法，而没有在比较时间进行新的训练任何内容。

Knowing the similarity between sets of data has a number of positive implications in training an effective model, such as assisting an informed selection out of known datasets favorable to model transfer or data augmentation problems with an unknown dataset. Common practices to estimate the similarity between data include comparing in the original sample space, comparing in the embedding space from a model performing a certain task, or fine-tuning a pretrained model with different datasets and evaluating the performance changes therefrom. However, these practices would suffer from shallow comparisons, task-specific biases, or extensive time and computations required to perform comparisons. We present SimEx, a new method for early prediction of inter-dataset similarity using a set of pretrained autoencoders each of which is dedicated to reconstructing a specific part of known data. Specifically, our method takes unknown data samples as input to those pretrained autoencoders, and evaluate the difference between the reconstructed output samples against their original input samples. Our intuition is that, the more similarity exists between the unknown data samples and the part of known data that an autoencoder was trained with, the better chances there could be that this autoencoder makes use of its trained knowledge, reconstructing output samples closer to the originals. We demonstrate that our method achieves more than 10x speed-up in predicting inter-dataset similarity compared to common similarity-estimating practices. We also demonstrate that the inter-dataset similarity estimated by our method is well-correlated with common practices and outperforms the baselines approaches of comparing at sample- or embedding-spaces, without newly training anything at the comparison time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题