论文标题
嵌入可以充分代表医学术语吗?新的大规模医学术语相似性数据集有答案!
Can Embeddings Adequately Represent Medical Terminology? New Large-Scale Medical Term Similarity Datasets Have the Answer!
论文作者
论文摘要
已经出现了大量接受医学数据培训的嵌入,但尚不清楚它们代表医学术语的很好,特别是在这些嵌入中编码语义上相似的医学术语的密切关系。迄今为止,仅可用于测试医学术语相似性的小数据集,不允许得出关于嵌入对医生使用的大量医疗条款的普遍性的结论。我们提出了多个自动创建的大规模医学术语相似性数据集,并在与医生的注释研究中确认其高质量。我们在新数据集上评估了最新的单词和上下文嵌入,比较了多个向量相似性和词向量聚合技术。我们的结果表明,当前的嵌入在充分编码医学术语的能力上是有限的。因此,新颖的数据集构成了能够准确代表整个医学术语的医学嵌入式开发的挑战性新基准。
A large number of embeddings trained on medical data have emerged, but it remains unclear how well they represent medical terminology, in particular whether the close relationship of semantically similar medical terms is encoded in these embeddings. To date, only small datasets for testing medical term similarity are available, not allowing to draw conclusions about the generalisability of embeddings to the enormous amount of medical terms used by doctors. We present multiple automatically created large-scale medical term similarity datasets and confirm their high quality in an annotation study with doctors. We evaluate state-of-the-art word and contextual embeddings on our new datasets, comparing multiple vector similarity metrics and word vector aggregation techniques. Our results show that current embeddings are limited in their ability to adequately encode medical terms. The novel datasets thus form a challenging new benchmark for the development of medical embeddings able to accurately represent the whole medical terminology.