论文标题
用可比用户生成的内容构建跨语性消费者健康词汇
Constructing Cross-lingual Consumer Health Vocabulary with Word-Embedding from Comparable User Generated Content
论文作者
论文摘要
在线健康社区(OHC)是外行共享健康信息的主要渠道。为了分析OHC的健康消费者生成的内容(HCGC),确定外行使用的口语医学表达是一个至关重要的挑战。开放式和协作的消费者健康词汇(OAC CHV)是应对这种挑战的受控词汇。但是,OAC CHV仅以英语提供,将其适用性限制在其他语言中。这项研究提出了一个跨语言自动识别框架,用于将英语CHV扩展到跨语义上。我们的框架需要英语HCGC语料库和非英语(即本研究中的中文)HCGC语料库作为输入。使用Skip-gram算法确定两个单语词向量空间,以便每个空间在语言中编码来自外行的通用单词关联。基于等轴测假设,框架将两个单语言空间对准了双语单词矢量空间,在此我们使用余弦相似性作为指标来识别跨语言的语义相似单词。实验结果表明,我们的框架在识别跨语言的CHV方面优于其他两个大型语言模型。我们的框架仅需要原始的HCGC语料库和有限的医学翻译,从而减少了人类在编译跨语言CHV方面的努力。
The online health community (OHC) is the primary channel for laypeople to share health information. To analyze the health consumer-generated content (HCGC) from the OHCs, identifying the colloquial medical expressions used by laypeople is a critical challenge. The open-access and collaborative consumer health vocabulary (OAC CHV) is the controlled vocabulary for addressing such a challenge. Nevertheless, OAC CHV is only available in English, limiting its applicability to other languages. This research proposes a cross-lingual automatic term recognition framework for extending the English CHV into a cross-lingual one. Our framework requires an English HCGC corpus and a non-English (i.e., Chinese in this study) HCGC corpus as inputs. Two monolingual word vector spaces are determined using the skip-gram algorithm so that each space encodes common word associations from laypeople within a language. Based on the isometry assumption, the framework aligns two monolingual spaces into a bilingual word vector space, where we employ cosine similarity as a metric for identifying semantically similar words across languages. The experimental results demonstrate that our framework outperforms the other two large language models in identifying CHV across languages. Our framework only requires raw HCGC corpora and a limited size of medical translations, reducing human efforts in compiling cross-lingual CHV.