论文标题

寻找丢失的公用事业:私人位置数据

In Search of Lost Utility: Private Location Data

论文作者

Lestyán, Szilvia, Ács, Gergely, Biczók, Gergely

论文摘要

培训数据的不可用是在研究中造成许多挫败感的永久来源,尤其是由于隐私问题引起的。对于位置数据尤其如此,因为以前的技术都遭受了位置轨迹的固有稀疏性和高维度的影响,这些轨迹使大多数技术变得不切实际,从而产生了不切实际的痕迹和不可估量的方法。此外,通常会删除位置访问的时间信息,或者其分辨率大大减少。在本文中,我们介绍了一种新型技术,用于私下释放复合生成模型和整个具有详细时间信息的高维位置数据集。为了产生高保真的综合数据,我们利用了多种车辆流动性的特殊性,例如其语言式特征(“您应该知道公司所保留的公司的位置”)或人类如何计划从一个点到另一点的旅行。我们通过首先构建一个变量自动编码器来生成源和目标位置以及相应的轨迹时机来对数据集的生成器分布进行建模。接下来,我们通过Feed向前网络计算位置之间的过渡概率,并从该模型的输出中构建过渡图,该图形近似于源和目标之间所有路径的分布(在给定时间)。最后,使用马尔可夫链蒙特卡洛方法从该分布中取样一条路径。生成的合成数据集是高度逼真的,可扩展的,可提供良好的实用程序,但可以证明是私人的。我们根据两种最先进的方法和三个现实生活数据集对我们的模型进行评估,以证明我们方法的好处。

The unavailability of training data is a permanent source of much frustration in research, especially when it is due to privacy concerns. This is particularly true for location data since previous techniques all suffer from the inherent sparseness and high dimensionality of location trajectories which render most techniques impractical, resulting in unrealistic traces and unscalable methods. Moreover, time information of location visits is usually dropped, or its resolution is drastically reduced. In this paper we present a novel technique for privately releasing a composite generative model and whole high-dimensional location datasets with detailed time information. To generate high-fidelity synthetic data, we leverage several peculiarities of vehicular mobility such as its language-like characteristics ("you should know a location by the company it keeps") or how humans plan their trips from one point to the other. We model the generator distribution of the dataset by first constructing a variational autoencoder to generate the source and destination locations, and the corresponding timing of trajectories. Next, we compute transition probabilities between locations with a feed forward network, and build a transition graph from the output of this model, which approximates the distribution of all paths between the source and destination (at a given time). Finally, a path is sampled from this distribution with a Markov Chain Monte Carlo method. The generated synthetic dataset is highly realistic, scalable, provides good utility and, nonetheless, provably private. We evaluate our model against two state-of-the-art methods and three real-life datasets demonstrating the benefits of our approach.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源