论文标题

Etree:学习树结构嵌入

eTREE: Learning Tree-structured Embeddings

论文作者

Almutairi, Faisal M., Wang, Yunlong, Wang, Dong, Zhao, Emily, Sidiropoulos, Nicholas D.

论文摘要

矩阵分解(MF)在广泛的机器学习和数据挖掘模型中起着重要作用。 MF通常用于获取项目嵌入和特征表示,因为它可以捕获相关性和跨维度的高阶统计依赖性的能力。在许多应用中,项目类别表现出分层树结构。例如,人类疾病可以分为粗体类别,例如细菌和病毒。这些类别可以进一步分为更细的类别,例如,病毒感染可以是呼吸道,胃肠道和异常病毒疾病。在电子商务,产品,电影,书籍等中,分为等级类别,例如,服装项目由性别划分,然后按类型(正式,休闲等)分开。尽管在某些应用中可以知道树结构和不同项目的类别,但必须与许多其他项目一起学习。在这项工作中,我们提出了Etree,该模型结合了(通常忽略的)树结构以增强嵌入的质量。我们利用非负MF(NMF)的特殊唯一性能来证明eTree的可识别性。提出的模型不仅利用了树结构的先验,而且还以无监督的数据驱动方式学习了分层聚类。我们得出了一种有效的算法解决方案以及可扩展的Etree实现,该解决方案利用了并行计算,计算缓存和温暖的开始策略。我们展示了Etree对来自各种应用领域的真实数据的有效性:医疗保健,推荐系统和教育。我们还通过域专家的解释来证明从Etree获得的树的有意义。

Matrix factorization (MF) plays an important role in a wide range of machine learning and data mining models. MF is commonly used to obtain item embeddings and feature representations due to its ability to capture correlations and higher-order statistical dependencies across dimensions. In many applications, the categories of items exhibit a hierarchical tree structure. For instance, human diseases can be divided into coarse categories, e.g., bacterial, and viral. These categories can be further divided into finer categories, e.g., viral infections can be respiratory, gastrointestinal, and exanthematous viral diseases. In e-commerce, products, movies, books, etc., are grouped into hierarchical categories, e.g., clothing items are divided by gender, then by type (formal, casual, etc.). While the tree structure and the categories of the different items may be known in some applications, they have to be learned together with the embeddings in many others. In this work, we propose eTREE, a model that incorporates the (usually ignored) tree structure to enhance the quality of the embeddings. We leverage the special uniqueness properties of Nonnegative MF (NMF) to prove identifiability of eTREE. The proposed model not only exploits the tree structure prior, but also learns the hierarchical clustering in an unsupervised data-driven fashion. We derive an efficient algorithmic solution and a scalable implementation of eTREE that exploits parallel computing, computation caching, and warm start strategies. We showcase the effectiveness of eTREE on real data from various application domains: healthcare, recommender systems, and education. We also demonstrate the meaningfulness of the tree obtained from eTREE by means of domain experts interpretation.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源