论文标题
皮肤:使用BERT进行医学语料库的略读密集型长篇文本分类
SkIn: Skimming-Intensive Long-Text Classification Using BERT for Medical Corpus
论文作者
论文摘要
Bert是一种自然语言处理中广泛使用的预训练模型。但是,由于BERT与文本长度是二次的,因此很难直接在长篇文本语料库上使用BERT模型。在某些领域,收集的文本数据可能很长,例如在医疗保健领域。因此,为了将BERT的预先训练的语言知识应用于长期文本,在本文中,模仿了人类在阅读长段落时使用的略读密集型阅读方法,提出了浏览密集型模型(皮肤)。它可以动态选择文本中的关键信息,以便大大缩短到BERT基本模型中的句子输入,从而有效地节省了分类算法的成本。实验表明,皮肤方法的准确性比在医学领域的长文本分类数据集上的基准获得了优越的精度,而其时间和空间需求随文本长度线性增加,从而减轻了长篇文本数据上基本BERT的时间和空间溢出问题。
BERT is a widely used pre-trained model in natural language processing. However, since BERT is quadratic to the text length, the BERT model is difficult to be used directly on the long-text corpus. In some fields, the collected text data may be quite long, such as in the health care field. Therefore, to apply the pre-trained language knowledge of BERT to long text, in this paper, imitating the skimming-intensive reading method used by humans when reading a long paragraph, the Skimming-Intensive Model (SkIn) is proposed. It can dynamically select the critical information in the text so that the sentence input into the BERT-Base model is significantly shortened, which can effectively save the cost of the classification algorithm. Experiments show that the SkIn method has achieved superior accuracy than the baselines on long-text classification datasets in the medical field, while its time and space requirements increase linearly with the text length, alleviating the time and space overflow problem of basic BERT on long-text data.