论文标题
生物医学摘要的切片:一系列序列分类任务
Sectioning of Biomedical Abstracts: A Sequence of Sequence Classification Task
论文作者
论文摘要
生物医学文献的快速增长导致了生物医学文本挖掘领域的许多进步。在大量信息中,生物医学文章摘要是易于访问的来源。但是,结构化摘要的数量,描述具有背景,客观,方法,结果和结论类别之一的修辞学部分的数量仍然不大。在生物医学摘要中探索有价值的信息可以加快顺序句子分类任务的改进。基于深度学习的模型在这项任务中取得重大成果具有巨大的性能/潜力。但是,它们通常可能过于复杂,并且与特定数据过高。在这个项目中,我们研究了一种最先进的深度学习模型,我们在这里称为SSN-4模型。我们研究了SSN-4模型的不同组件,以研究性能和复杂性之间的权衡。我们探讨了该模型如何推广到超出随机对照试验(RCT)数据集之外的新数据集。我们解决了一个问题,即是否可以将单词嵌入可以调整为提高性能的任务。此外,我们开发了第二个模型,该模型解决了第一个模型中的混淆对。结果表明,SSN-4模型似乎并没有超出RCT数据集的概括。
Rapid growth of the biomedical literature has led to many advances in the biomedical text mining field. Among the vast amount of information, biomedical article abstracts are the easily accessible sources. However, the number of the structured abstracts, describing the rhetorical sections with one of Background, Objective, Method, Result and Conclusion categories is still not considerable. Exploration of valuable information in the biomedical abstracts can be expedited with the improvements in the sequential sentence classification task. Deep learning based models has great performance/potential in achieving significant results in this task. However, they can often be overly complex and overfit to specific data. In this project, we study a state-of-the-art deep learning model, which we called SSN-4 model here. We investigate different components of the SSN-4 model to study the trade-off between the performance and complexity. We explore how well this model generalizes to a new data set beyond Randomized Controlled Trials (RCT) dataset. We address the question that whether word embeddings can be adjusted to the task to improve the performance. Furthermore, we develop a second model that addresses the confusion pairs in the first model. Results show that SSN-4 model does not appear to generalize well beyond RCT dataset.