通过字形了解的文本分类嵌入和语义子字符扩展

论文标题

通过字形了解的文本分类嵌入和语义子字符扩展

Text Classification through Glyph-aware Disentangled Character Embedding and Semantic Sub-character Augmentation

论文作者

Aoki, Takumi, Kitada, Shunsuke, Iyatomi, Hitoshi

论文摘要

我们为非字母语言（例如中文和日语）提出了一个新的基于角色的文本分类框架。我们的框架由一个各种字符编码器（VCE）和字符级文本分类器组成。 VCE由$β$ - 变量自动编码器（$β$ -VAE）组成，该自动编码器（$β$ -VAE）学习了所提出的glyph-Awarce Ancean Ancean Andentangled Charnangled嵌入（GDCE）。由于我们的GDCE提供了独立于尺寸的零均值单位变量字符嵌入，因此它适用于我们可解释的数据扩展，即语义亚字符增强（SSA）。在本文中，我们使用文档和句子级别的日语文本分类任务评估了我们的框架。我们确认我们的GDCE和SSA不仅提供了可解释性的嵌入性，而且还提高了分类性能。我们的提案为最先进的模型取得了竞争成果，同时还提供了模型的解释性。我们的代码可在https://github.com/iyatomilab/gdce-ssa上找到

We propose a new character-based text classification framework for non-alphabetic languages, such as Chinese and Japanese. Our framework consists of a variational character encoder (VCE) and character-level text classifier. The VCE is composed of a $β$-variational auto-encoder ($β$-VAE) that learns the proposed glyph-aware disentangled character embedding (GDCE). Since our GDCE provides zero-mean unit-variance character embeddings that are dimensionally independent, it is applicable for our interpretable data augmentation, namely, semantic sub-character augmentation (SSA). In this paper, we evaluated our framework using Japanese text classification tasks at the document- and sentence-level. We confirmed that our GDCE and SSA not only provided embedding interpretability but also improved the classification performance. Our proposal achieved a competitive result to the state-of-the-art model while also providing model interpretability. Our code is available on https://github.com/IyatomiLab/GDCE-SSA

下载PDF全文

下载文献需遵守相关版权规定

论文标题