论文标题
在跳跃之前查看:通过利用RAW URL和HTML特征来检测网络钓鱼页面
Look Before You Leap: Detecting Phishing Web Pages by Exploiting Raw URL And HTML Characteristics
论文作者
论文摘要
网络钓鱼网站分发未经请求的内容,并经常用于提交电子邮件和互联网欺诈;在提交任何用户信息之前检测它们至关重要。近年来,已经做出了几项努力来检测这些网络钓鱼网站。大多数现有方法都使用网站文本内容中手工制作的词汇和统计功能来培训分类模型来检测网络钓鱼网页。但是,这些网络钓鱼检测方法有一些挑战,包括1)提取手工制作的功能的繁琐,这些特征需要专门的域知识来确定哪些功能对特定平台有用; 2)建立在手工制作的功能上的模型遇到的困难,以捕获URL和HTML内容中的单词和字符中的语义模式。为了应对这些挑战,本文提出了Webphish,这是一种端到端的深度神经网络,该网络使用嵌入式的原始URL和HTML内容训练,以检测网站网络钓鱼攻击。首先,提出的模型会自动采用嵌入技术来将相应的字符提取到同源密集的向量中。然后,串联层合并URL和HTML嵌入矩阵。随后,卷积层用于建模其语义依赖性。进行了广泛的实验,该实验是使用现实世界的网络钓鱼数据进行的,该数据的准确性为98.1 \%,表明Webphish在识别网络钓鱼页面中的表现优于基线检测方法。
Phishing websites distribute unsolicited content and are frequently used to commit email and internet fraud; detecting them before any user information is submitted is critical. Several efforts have been made to detect these phishing websites in recent years. Most existing approaches use hand-crafted lexical and statistical features from a website's textual content to train classification models to detect phishing web pages. However, these phishing detection approaches have a few challenges, including 1) the tediousness of extracting hand-crafted features, which require specialized domain knowledge to determine which features are useful for a particular platform; and 2) the difficulties encountered by models built on hand-crafted features to capture the semantic patterns in words and characters in URL and HTML content. To address these challenges, this paper proposes WebPhish, an end-to-end deep neural network trained using embedded raw URLs and HTML content to detect website phishing attacks. First, the proposed model automatically employs an embedding technique to extract the corresponding characters into homologous dense vectors. Then, the concatenation layer merges the URL and HTML embedding matrices. Following that, Convolutional layers are used to model its semantic dependencies. Extensive experiments were conducted with real-world phishing data, which yielded an accuracy of 98.1\%, showing that WebPhish outperforms baseline detection approaches in identifying phishing pages.