论文标题
自动生物医学术语聚类通过学习细粒度表示
Automatic Biomedical Term Clustering by Learning Fine-grained Term Representations
论文作者
论文摘要
术语聚类在生物医学知识图构造中很重要。使用术语嵌入之间的相似性有助于术语聚类。最先进的术语嵌入利用预验证的语言模型来编码术语,并使用知识图中的同义词和关系知识来指导对比性学习。这些嵌入为属于同一概念的术语提供了紧密的嵌入。但是,从我们的探测实验中,这些嵌入对较小的文本差异不敏感,这导致生物医学项聚类失败。为了减轻这个问题,我们通过在对比度学习过程中提供动态的硬阳性和负样本来调整术语嵌入的采样策略,以学习细粒度的表述,从而导致更好的生物医学术语聚类。我们将提出的方法称为编码器++,并且已应用于新发布的名为BIOS的生物医学知识图中的生物医学概念。
Term clustering is important in biomedical knowledge graph construction. Using similarities between terms embedding is helpful for term clustering. State-of-the-art term embeddings leverage pretrained language models to encode terms, and use synonyms and relation knowledge from knowledge graphs to guide contrastive learning. These embeddings provide close embeddings for terms belonging to the same concept. However, from our probing experiments, these embeddings are not sensitive to minor textual differences which leads to failure for biomedical term clustering. To alleviate this problem, we adjust the sampling strategy in pretraining term embeddings by providing dynamic hard positive and negative samples during contrastive learning to learn fine-grained representations which result in better biomedical term clustering. We name our proposed method as CODER++, and it has been applied in clustering biomedical concepts in the newly released Biomedical Knowledge Graph named BIOS.