海关导入声明数据集

论文标题

海关导入声明数据集

Customs Import Declaration Datasets

论文作者

Jeong, Chaeyoon, Kim, Sundong, Park, Jaewoo, Choi, Yeonsoo

论文摘要

鉴于大量的跨境流量，对贸易的有效控制对于保护人和社会免受非法贸易的影响变得更加重要。但是，交易级贸易数据集的有限可访问性阻碍了公开研究的进步，许多海关管理部门并未从基于数据的风险管理的最新进展中受益。在本文中，我们介绍了一个进口声明数据集，以促进海关管理部门的领域专家与来自数据科学和机器学习等不同领域的研究人员之间的协作。该数据集包含54,000个具有22个关键属性的人工生成的交易，并且与条件表格GAN合成，同时保持相关的功能。合成数据具有多个优点。首先，释放数据集没有限制，这些限制不允许披露原始的导入数据。制造步骤可最大程度地减少贸易统计中可能存在的身份风险。其次，已发布的数据遵循与源数据相似的分布，因此可以在各种下游任务中使用。因此，我们的数据集可以用作测试任何分类算法的性能的基准。通过提供数据及其生成过程，我们为欺诈检测任务打开了基线代码，因为我们从经验上表明，更高级的算法可以更好地检测欺诈。

Given the huge volume of cross-border flows, effective and efficient control of trade becomes more crucial in protecting people and society from illicit trade. However, limited accessibility of the transaction-level trade datasets hinders the progress of open research, and lots of customs administrations have not benefited from the recent progress in data-based risk management. In this paper, we introduce an import declaration dataset to facilitate the collaboration between domain experts in customs administrations and researchers from diverse domains, such as data science and machine learning. The dataset contains 54,000 artificially generated trades with 22 key attributes, and it is synthesized with conditional tabular GAN while maintaining correlated features. Synthetic data has several advantages. First, releasing the dataset is free from restrictions that do not allow disclosing the original import data. The fabrication step minimizes the possible identity risk which may exist in trade statistics. Second, the published data follow a similar distribution to the source data so that it can be used in various downstream tasks. Hence, our dataset can be used as a benchmark for testing the performance of any classification algorithm. With the provision of data and its generation process, we open baseline codes for fraud detection tasks, as we empirically show that more advanced algorithms can better detect fraud.

下载PDF全文

下载文献需遵守相关版权规定

论文标题