论文标题
回答具有完美隐私的基因组数据的计数查询
Answering Count Queries for Genomic Data with Perfect Privacy
论文作者
论文摘要
在本文中,我们考虑了回答基因组数据计数查询的问题,这些查询受到完美的隐私约束。计数查询通常用于从生物医学数据库(DBS)中收集骨料(人口范围)信息的应用中进行分析,例如全基因组关联研究。我们的目标是设计用于回答以下形式的计数查询的机制:\ textIt {数据库中有多少用户在其基因组中的某些位置具有特定的基因型?}同时,我们旨在实现预先确定的分支机构位置的敏感基因型的完美隐私(零信息泄漏)。敏感的基因型可能表明罕见的疾病和/或其他人可能希望保持私人的健康特征。我们为上述问题提供了局部和中央计数机制,这些机制可为敏感的基因型获得完美的信息理论隐私,同时最大程度地减少查询答案的预期绝对误差(或根据设置)的预期绝对误差(或每个用户误差概率)。我们还得出了满足完美隐私的任意查询机制的误差概率的下限。我们表明,我们的机制达到了接近下限的误差,并在某些特殊情况下匹配下限。我们从数值上表明,每种机制的性能取决于数据先前分布,查询和敏感基因型之间的相交以及基因组数据序列中相关性的强度。
In this paper, we consider the problem of answering count queries for genomic data subject to perfect privacy constraints. Count queries are often used in applications that collect aggregate (population-wide) information from biomedical Databases (DBs) for analysis, such as Genome-wide association studies. Our goal is to design mechanisms for answering count queries of the following form: \textit{How many users in the database have a specific set of genotypes at certain locations in their genome?} At the same time, we aim to achieve perfect privacy (zero information leakage) of the sensitive genotypes at a pre-specified set of secret locations. The sensitive genotypes could indicate rare diseases and/or other health traits one may want to keep private. We present both local and central count-query mechanisms for the above problem that achieves perfect information-theoretic privacy for sensitive genotypes while minimizing the expected absolute error (or per-user error probability, depending on the setting) of the query answer. We also derived a lower bound of the per-user probability of error for an arbitrary query-answering mechanism that satisfies perfect privacy. We show that our mechanisms achieve error close to the lower bound, and match the lower bound for some special cases. We numerically show that the performance of each mechanism depends on the data prior distribution, the intersection between the queried and sensitive genotypes, and the strength of the correlation in the genomic data sequence.