论文标题

安全性和隐私的可扩展数据分类

Scalable Data Classification for Security and Privacy

论文作者

Tanaka, Paulo, Sapra, Sameet, Laptev, Nikolay

论文摘要

基于内容的数据分类是一个开放的挑战。传统的数据丢失预防(DLP)样系统通过指纹指纹并监视指纹数据的端点来解决此问题。在Facebook中有大量不断变化的数据资产,这种方法在发现哪些数据的位置既无法扩展又无效。本文是关于构建的端到端系统,该系统旨在在Facebook中检测敏感的语义类型,并自动执行数据保留和访问控制。 此处描述的方法是我们的​​第一个端到端隐私系统,该系统试图通过合并数据信号,机器学习和传统的指纹技术来解决此问题,以将所有数据绘制出来并在Facebook中进行分类。所描述的系统在生产中,在各种隐私类别中达到0.9+的平均F2分数,同时处理数十个数据存储的大量数据资产。

Content based data classification is an open challenge. Traditional Data Loss Prevention (DLP)-like systems solve this problem by fingerprinting the data in question and monitoring endpoints for the fingerprinted data. With a large number of constantly changing data assets in Facebook, this approach is both not scalable and ineffective in discovering what data is where. This paper is about an end-to-end system built to detect sensitive semantic types within Facebook at scale and enforce data retention and access controls automatically. The approach described here is our first end-to-end privacy system that attempts to solve this problem by incorporating data signals, machine learning, and traditional fingerprinting techniques to map out and classify all data within Facebook. The described system is in production achieving a 0.9+ average F2 scores across various privacy classes while handling a large number of data assets across dozens of data stores.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源