论文标题
通过字形了解的文本分类嵌入和语义子字符扩展
Text Classification through Glyph-aware Disentangled Character Embedding and Semantic Sub-character Augmentation
论文作者
论文摘要
我们为非字母语言(例如中文和日语)提出了一个新的基于角色的文本分类框架。我们的框架由一个各种字符编码器(VCE)和字符级文本分类器组成。 VCE由$β$ - 变量自动编码器($β$ -VAE)组成,该自动编码器($β$ -VAE)学习了所提出的glyph-Awarce Ancean Ancean Andentangled Charnangled嵌入(GDCE)。由于我们的GDCE提供了独立于尺寸的零均值单位变量字符嵌入,因此它适用于我们可解释的数据扩展,即语义亚字符增强(SSA)。在本文中,我们使用文档和句子级别的日语文本分类任务评估了我们的框架。我们确认我们的GDCE和SSA不仅提供了可解释性的嵌入性,而且还提高了分类性能。我们的提案为最先进的模型取得了竞争成果,同时还提供了模型的解释性。我们的代码可在https://github.com/iyatomilab/gdce-ssa上找到
We propose a new character-based text classification framework for non-alphabetic languages, such as Chinese and Japanese. Our framework consists of a variational character encoder (VCE) and character-level text classifier. The VCE is composed of a $β$-variational auto-encoder ($β$-VAE) that learns the proposed glyph-aware disentangled character embedding (GDCE). Since our GDCE provides zero-mean unit-variance character embeddings that are dimensionally independent, it is applicable for our interpretable data augmentation, namely, semantic sub-character augmentation (SSA). In this paper, we evaluated our framework using Japanese text classification tasks at the document- and sentence-level. We confirmed that our GDCE and SSA not only provided embedding interpretability but also improved the classification performance. Our proposal achieved a competitive result to the state-of-the-art model while also providing model interpretability. Our code is available on https://github.com/IyatomiLab/GDCE-SSA