论文标题
基于云的企业搜索解决方案的非结构化大数据的隐私保护聚类
Privacy-Preserving Clustering of Unstructured Big Data for Cloud-Based Enterprise Search Solutions
论文作者
论文摘要
基于云的企业搜索服务(例如,亚马逊肯德拉)通过为企业大数据集提供便捷的搜索解决方案而吸引了大数据所有者。但是,由于对数据隐私的有效关注,处理机密大数据的个人和企业不愿完全接受此类服务。已经探索了基于客户端加密的解决方案,以减轻隐私问题。但是,这种解决方案阻碍了数据处理,特别是聚类,这对于处理不同形式的大数据至关重要。例如,聚类对于限制搜索空间并在大数据集中执行实时搜索操作至关重要。为了克服聚类加密的大数据的障碍,我们为三种非结构化加密的大数据集提出了隐私保护集群方案,即静态,半动态和动态数据集。为了保留数据隐私,提出的基于数据的统计特征的聚类方案功能,并确定(a)适当数量的簇数以及(b)适合每个群集的内容。通过评估三个不同数据集的聚类方案获得的实验结果表明,与加密数据的其他聚类方案相比,簇相干性的30%至60%的提高。在保护隐私企业搜索系统中采用聚类方案的搜索时间最多将搜索时间降低了78%,而搜索准确性最多提高了35%。
Cloud-based enterprise search services (e.g., Amazon Kendra) are enchanting to big data owners by providing them with convenient search solutions over their enterprise big datasets. However, individuals and businesses that deal with confidential big data (eg, credential documents) are reluctant to fully embrace such services, due to valid concerns about data privacy. Solutions based on client-side encryption have been explored to mitigate privacy concerns. Nonetheless, such solutions hinder data processing, specifically clustering, which is pivotal in dealing with different forms of big data. For instance, clustering is critical to limit the search space and perform real-time search operations on big datasets. To overcome the hindrance in clustering encrypted big data, we propose privacy-preserving clustering schemes for three forms of unstructured encrypted big datasets, namely static, semi-dynamic, and dynamic datasets. To preserve data privacy, the proposed clustering schemes function based on statistical characteristics of the data and determine (A) the suitable number of clusters and (B) appropriate content for each cluster. Experimental results obtained from evaluating the clustering schemes on three different datasets demonstrate between 30% to 60% improvement on the clusters' coherency compared to other clustering schemes for encrypted data. Employing the clustering schemes in a privacy-preserving enterprise search system decreases its search time by up to 78%, while increases the search accuracy by up to 35%.