VERIDARK：在黑暗网络上进行作者验证的大规模基准

论文标题

VERIDARK：在黑暗网络上进行作者验证的大规模基准

VeriDark: A Large-Scale Benchmark for Authorship Verification on the Dark Web

论文作者

Manolache, Andrei, Brad, Florin, Barbalau, Antonio, Ionescu, Radu Tudor, Popescu, Marius

论文摘要

DarkWeb代表了一个用于非法活动的温床，用户在不同的市场论坛上进行交流以交换商品和服务。执法机构从执行作者分析的法医工具中受益，以根据其文本内容来识别和配置用户。但是，传统上，使用文学文本（例如小说或粉丝小说中的片段）对作者身份分析进行了研究，这些文字在网络犯罪背景下可能不合适。此外，采用撰稿人分析工具进行网络犯罪的少数作品通常采用临时实验设置和数据集。为了解决这些问题，我们发布了Veridark：由三个大规模作者身份验证数据集和一个从用户活动中从黑暗网络相关的Reddit社区或流行的非法黑暗网络市场论坛获得的基准组成的基准。我们在三个数据集上评估竞争性NLP基准，并对预测进行分析，以更好地了解此类方法的局限性。我们在https://github.com/bit-ml/veridark上公开提供数据集和基线

The DarkWeb represents a hotbed for illicit activity, where users communicate on different market forums in order to exchange goods and services. Law enforcement agencies benefit from forensic tools that perform authorship analysis, in order to identify and profile users based on their textual content. However, authorship analysis has been traditionally studied using corpora featuring literary texts such as fragments from novels or fan fiction, which may not be suitable in a cybercrime context. Moreover, the few works that employ authorship analysis tools for cybercrime prevention usually employ ad-hoc experimental setups and datasets. To address these issues, we release VeriDark: a benchmark comprised of three large scale authorship verification datasets and one authorship identification dataset obtained from user activity from either Dark Web related Reddit communities or popular illicit Dark Web market forums. We evaluate competitive NLP baselines on the three datasets and perform an analysis of the predictions to better understand the limitations of such approaches. We make the datasets and baselines publicly available at https://github.com/bit-ml/VeriDark

下载PDF全文

下载文献需遵守相关版权规定

论文标题