论文标题

药物限制:使用有监督的列编码在关系数据库上的语义查询

DrugDBEmbed : Semantic Queries on Relational Database using Supervised Column Encodings

论文作者

Bandyopadhyay, Bortik, Maneriker, Pranav, Patel, Vedang, Sahai, Saumya Yashmohini, Zhang, Ping, Parthasarathy, Srinivasan

论文摘要

传统的关系数据库包含许多潜在的语义信息,由于难以自动提取此类信息,这些信息在很大程度上仍未开发。最近的作品提出了无监督的机器学习方法,通过对数据库列进行文本文本,然后将文本令牌投影到固定的尺寸语义向量空间中,以提取此类隐藏信息。但是,在某些数据库中,可能可以使用特定于任务的类标签,而无监督的方法无法以原则上的方式杠杆作用。同样,当在单个代币级别生成嵌入时,必须通过将该列中存在的令牌的向量的平均值用于任何给定的行中的代币的平均值来计算多toke文本列的列编码。这种平均方法可能不会产生多token文本列的最佳语义矢量表示,正如在自然语言处理域中编码段落或文档时所观察到的那样。考虑到这些缺点,我们建议使用基于BISTM的序列编码器进行监督的机器学习方法,以直接生成用于药品银行数据库的多token文本列的列编码,其中包含金标准药物毒品交互(DDI)标签。我们的文本数据驱动的编码方法在某些列的监督DDI预测任务上实现了很高的准确性,我们使用那些监督的列编码来模拟和评估关系数据上的类比SQL查询以证明我们技术的功效。

Traditional relational databases contain a lot of latent semantic information that have largely remained untapped due to the difficulty involved in automatically extracting such information. Recent works have proposed unsupervised machine learning approaches to extract such hidden information by textifying the database columns and then projecting the text tokens onto a fixed dimensional semantic vector space. However, in certain databases, task-specific class labels may be available, which unsupervised approaches are unable to lever in a principled manner. Also, when embeddings are generated at individual token level, then column encoding of multi-token text column has to be computed by taking the average of the vectors of the tokens present in that column for any given row. Such averaging approach may not produce the best semantic vector representation of the multi-token text column, as observed while encoding paragraphs or documents in natural language processing domain. With these shortcomings in mind, we propose a supervised machine learning approach using a Bi-LSTM based sequence encoder to directly generate column encodings for multi-token text columns of the DrugBank database, which contains gold standard drug-drug interaction (DDI) labels. Our text data driven encoding approach achieves very high Accuracy on the supervised DDI prediction task for some columns and we use those supervised column encodings to simulate and evaluate the Analogy SQL queries on relational data to demonstrate the efficacy of our technique.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源