论文标题

简短文本的联合非负矩阵分解主题建模与共同信息

Federated Non-negative Matrix Factorization for Short Texts Topic Modeling with Mutual Information

论文作者

Si, Shijing, Wang, Jianzong, Zhang, Ruiyi, Su, Qinliang, Xiao, Jing

论文摘要

基于非阴性矩阵分解(NMF)主题建模广泛用于自然语言处理(NLP),以发现短文档的隐藏主题。通常,培训高质量的主题模型需要大量文本数据。在许多现实世界中,客户文本数据应具有私有和敏感,排除上传到数据中心。本文提出了一个联合的NMF(FedNMF)框架,该框架允许多个客户与本地存储的数据协作培训基于NMF的高质量基于NMF的主题模型。但是,当客户的数据分布是异质的,标准联合学习将大大破坏下游任务中主题模型的性能(例如,文本分类)。为了减轻此问题,我们进一步提出了FedNMF+MI,同时,在本地文本的计数特征及其主题体重向量之间同时最大化了相互信息(MI),以减轻性能退化。实验结果表明,我们的FEDNMF+MI方法的表现优于联合潜在的DIRICHLET分配(FedLda)和FedNMF,而没有MI方法,而没有MI方法的短文本,而相干分数和分类F1得分都有很大的差距。

Non-negative matrix factorization (NMF) based topic modeling is widely used in natural language processing (NLP) to uncover hidden topics of short text documents. Usually, training a high-quality topic model requires large amount of textual data. In many real-world scenarios, customer textual data should be private and sensitive, precluding uploading to data centers. This paper proposes a Federated NMF (FedNMF) framework, which allows multiple clients to collaboratively train a high-quality NMF based topic model with locally stored data. However, standard federated learning will significantly undermine the performance of topic models in downstream tasks (e.g., text classification) when the data distribution over clients is heterogeneous. To alleviate this issue, we further propose FedNMF+MI, which simultaneously maximizes the mutual information (MI) between the count features of local texts and their topic weight vectors to mitigate the performance degradation. Experimental results show that our FedNMF+MI methods outperform Federated Latent Dirichlet Allocation (FedLDA) and the FedNMF without MI methods for short texts by a significant margin on both coherence score and classification F1 score.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源