结构标签改善了学术文档质量预测的文本分类

论文标题

结构标签改善了学术文档质量预测的文本分类

Structure-Tags Improve Text Classification for Scholarly Document Quality Prediction

论文作者

Wenniger, Gideon Maillette de Buy, van Dongen, Thomas, Aedmaa, Eleri, Kruitbosch, Herbert Teun, Valentijn, Edwin A., Schomaker, Lambert

论文摘要

培训关于长文，特别是学术文献的培训复发的神经网络会导致学习问题。尽管分层注意力网络（HAN）在解决这些问题方面有效，但它们仍然失去了有关文本结构的重要信息。为了解决这些问题，我们建议使用HAN与结构标签相结合，以标志着句子在文档中的作用。将标签添加到句子中，将标签标记为对应标题，抽象或主体文本，可以改善对学术文档质量预测的最先进。所提出的系统应用于在同行数据集上接受/拒绝预测的任务，并将其与最新的基于Bilstm的模型和联合文本+视觉模型以及与普通汉斯的联合模型进行了比较。与普通的汉斯相比，所有三个领域的准确性都提高。在计算和语言领域，我们的新模型总体上最有效，并且比最佳文献结果提高了准确性4.7％。在引入标签以预测我们从Allen AI S2ORC数据集中的88K科学出版物的引用数量的标签时，我们还获得了改进。对于我们的Han-System带有结构标签，我们达到了28.5％的解释差异，比我们对基于Bilstm的模型的重新实现的提高了1.8％，并且比普通汉斯的改善相比提高了1.0％。

Training recurrent neural networks on long texts, in particular scholarly documents, causes problems for learning. While hierarchical attention networks (HANs) are effective in solving these problems, they still lose important information about the structure of the text. To tackle these problems, we propose the use of HANs combined with structure-tags which mark the role of sentences in the document. Adding tags to sentences, marking them as corresponding to title, abstract or main body text, yields improvements over the state-of-the-art for scholarly document quality prediction. The proposed system is applied to the task of accept/reject prediction on the PeerRead dataset and compared against a recent BiLSTM-based model and joint textual+visual model as well as against plain HANs. Compared to plain HANs, accuracy increases on all three domains. On the computation and language domain our new model works best overall, and increases accuracy 4.7% over the best literature result. We also obtain improvements when introducing the tags for prediction of the number of citations for 88k scientific publications that we compiled from the Allen AI S2ORC dataset. For our HAN-system with structure-tags we reach 28.5% explained variance, an improvement of 1.8% over our reimplementation of the BiLSTM-based model as well as 1.0% improvement over plain HANs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题