论文标题
一种用于检测网络上基于CNAME的基于CNAME的跟踪的机器学习方法
A machine learning approach for detecting CNAME cloaking-based tracking on the Web
论文作者
论文摘要
各种浏览器内隐私保护技术旨在保护最终用户免受第三方跟踪。在与这些反测量的军备竞赛中,跟踪提供商开发了一种新技术,称为CNAME CLOAKING TACKINGing,以避免使用阻止第三方Cookie和请求的浏览器问题。为了检测此跟踪技术,浏览器扩展需要按需DNS查找API。但是,此功能仅由Firefox浏览器支持。 在本文中,我们提出了一种基于机器学习的方法,以检测基于CNAME的基于CNAME的跟踪,而无需按需DNS查找。我们的目标是检测与CNAME隐身相关跟踪链接的两个站点和请求。我们抓取目标站点的列表,并存储所有及其属性的HTTP/HTTPS请求。然后,我们通过查找子域的CNAME记录并根据众所周知的跟踪过滤器列表应用通配符匹配来自动标记所有实例。提取功能后,我们构建了一个有监督的分类模型,以区分网站并与基于CNAME的基于CNAME的跟踪有关。我们的评估表明,所提出的方法的表现优于众所周知的跟踪滤波器列表:站点的F1分数为0.790,请求0.885。通过分析功能排列的重要性,我们证明了脚本的数量和XMLHTTPRequests的比例对于检测站点是歧视性的,并且URL请求的长度有助于检测请求。最后,我们通过使用2018年数据集来训练模型并在2020年数据集中获得合理的性能来分析概念漂移,以使用基于CNAME的基于Cloaking的跟踪来检测两个站点和请求。
Various in-browser privacy protection techniques have been designed to protect end-users from third-party tracking. In an arms race against these counter-measures, the tracking providers developed a new technique called CNAME cloaking based tracking to avoid issues with browsers that block third-party cookies and requests. To detect this tracking technique, browser extensions require on-demand DNS lookup APIs. This feature is however only supported by the Firefox browser. In this paper, we propose a supervised machine learning-based method to detect CNAME cloaking-based tracking without the on-demand DNS lookup. Our goal is to detect both sites and requests linked to CNAME cloaking-related tracking. We crawl a list of target sites and store all HTTP/HTTPS requests with their attributes. Then we label all instances automatically by looking up CNAME record of subdomain, and applying wildcard matching based on well-known tracking filter lists. After extracting features, we build a supervised classification model to distinguish site and request related to CNAME cloaking-based tracking. Our evaluation shows that the proposed approach outperforms well-known tracking filter lists: F1 scores of 0.790 for sites and 0.885 for requests. By analyzing the feature permutation importance, we demonstrate that the number of scripts and the proportion of XMLHttpRequests are discriminative for detecting sites, and the length of URL request is helpful in detecting requests. Finally, we analyze concept drift by using the 2018 dataset to train a model and obtain a reasonable performance on the 2020 dataset for detecting both sites and requests using CNAME cloaking-based tracking.