论文标题
关于抽样协作过滤数据集
On Sampling Collaborative Filtering Datasets
论文作者
论文摘要
我们研究数据集采样策略对建议算法的排名性能的实际后果。通常对较大数据集的样品进行培训和评估推荐系统。样品通常以幼稚或临时的方式进行:例如通过随机对数据集进行采样或通过选择具有许多交互的用户或项目。正如我们所证明的那样,常用的数据采样方案可能会对算法性能产生重大影响。遵循此观察,本文做出了三个主要贡献:(1)在算法和数据集特性方面表征采样对算法性能的影响(例如,稀疏特性,顺序动力学等); (2)设计SVP-CF是一种特定于数据的采样策略,旨在保留采样后模型的相对性能,并且特别适合于长尾交互数据; (3)开发一个甲骨文,数据生成,它可以建议采样方案,该方案最有可能保留给定数据集的模型性能。数据生成的主要好处是,它将允许推荐的系统从业人员快速原型和比较各种方法,同时,一旦对算法进行了重新验证并根据完整数据部署了算法,则可以保留算法性能。详细的实验表明,使用数据生成,我们可以比任何具有相同性能级别的采样策略丢弃5倍数据。
We study the practical consequences of dataset sampling strategies on the ranking performance of recommendation algorithms. Recommender systems are generally trained and evaluated on samples of larger datasets. Samples are often taken in a naive or ad-hoc fashion: e.g. by sampling a dataset randomly or by selecting users or items with many interactions. As we demonstrate, commonly-used data sampling schemes can have significant consequences on algorithm performance. Following this observation, this paper makes three main contributions: (1) characterizing the effect of sampling on algorithm performance, in terms of algorithm and dataset characteristics (e.g. sparsity characteristics, sequential dynamics, etc.); (2) designing SVP-CF, which is a data-specific sampling strategy, that aims to preserve the relative performance of models after sampling, and is especially suited to long-tailed interaction data; and (3) developing an oracle, Data-Genie, which can suggest the sampling scheme that is most likely to preserve model performance for a given dataset. The main benefit of Data-Genie is that it will allow recommender system practitioners to quickly prototype and compare various approaches, while remaining confident that algorithm performance will be preserved, once the algorithm is retrained and deployed on the complete data. Detailed experiments show that using Data-Genie, we can discard upto 5x more data than any sampling strategy with the same level of performance.