基于文本语义的智能CNN-VAE文本表示技术，用于综合大数据

论文标题

基于文本语义的智能CNN-VAE文本表示技术，用于综合大数据

An Intelligent CNN-VAE Text Representation Technology Based on Text Semantics for Comprehensive Big Data

论文作者

Liu, Genggeng, Guo, Canyang, Xie, Lin, Liu, Wenxi, Xiong, Naixue, Chen, Guolong

论文摘要

在大数据时代，互联网生成的大量文本数据诞生了各种文本表示方法。在自然语言处理（NLP）中，文本表示将文本转换为可以通过计算机处理的向量而不会丢失原始语义信息。但是，这些方法很难在单词之间有效地提取语义特征并在语言中区分多义。因此，提出了基于卷积神经网络（CNN）和变异自动编码器（VAE）的文本特征表示模型，以提取文本功能，并将获得的文本特征表示形式应用于文本分类任务。 CNN用于提取文本矢量的特征，以在单词之间获取语义，并引入VAE，以使文本特征空间与高斯分布更加一致。此外，改进的Word2Vec模型的输出被用作所提出的模型的输入，以区分不同上下文中同一个单词的不同含义。实验结果表明，所提出的模型优于K-Neareb邻居（KNN），随机森林（RF）和支持向量机（SVM）分类算法。

In the era of big data, a large number of text data generated by the Internet has given birth to a variety of text representation methods. In natural language processing (NLP), text representation transforms text into vectors that can be processed by computer without losing the original semantic information. However, these methods are difficult to effectively extract the semantic features among words and distinguish polysemy in language. Therefore, a text feature representation model based on convolutional neural network (CNN) and variational autoencoder (VAE) is proposed to extract the text features and apply the obtained text feature representation on the text classification tasks. CNN is used to extract the features of text vector to get the semantics among words and VAE is introduced to make the text feature space more consistent with Gaussian distribution. In addition, the output of the improved word2vec model is employed as the input of the proposed model to distinguish different meanings of the same word in different contexts. The experimental results show that the proposed model outperforms in k-nearest neighbor (KNN), random forest (RF) and support vector machine (SVM) classification algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题