用于产品匹配的多语言变压器 - 实验和波兰的新基准测试

论文标题

用于产品匹配的多语言变压器 - 实验和波兰的新基准测试

Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish

论文作者

Możdżonek, Michał, Wróblewska, Anna, Tkachuk, Sergiy, Łukasik, Szymon

论文摘要

产品匹配对应于跨不同数据源匹配相同产品的任务。它通常采用可用的产品功能，除了多模式外，即由各种数据类型组成，可能是非均匀的且不完整的。该论文表明，经过微调后，预先训练的多语言变压器模型适用于使用英语和波兰语中的文本功能来解决产品匹配问题。我们在Web数据共享上用英语测试了多语言Mbert和XLM-Roberta模型 - 大规模产品匹配的培训数据集和金标准。获得的结果表明，这些模型的性能类似于该集合上测试的最新解决方案，在某些情况下，结果甚至更好。此外，我们完全准备了一个新的数据集，并根据从几家在线商店获得的研究目的获得的要约中的报价。这是POIRE中的第一个用于产品匹配任务的开放数据集，它允许比较预训练模型的有效性。因此，我们还显示了波兰数据集上的微型Mbert和XLM-Roberta模型获得的基线结果。

Product matching corresponds to the task of matching identical products across different data sources. It typically employs available product features which, apart from being multimodal, i.e., comprised of various data types, might be non-homogeneous and incomplete. The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching problem using textual features both in English and Polish languages. We tested multilingual mBERT and XLM-RoBERTa models in English on Web Data Commons - training dataset and gold standard for large-scale product matching. The obtained results show that these models perform similarly to the latest solutions tested on this set, and in some cases, the results were even better. Additionally, we prepared a new dataset entirely in Polish and based on offers in selected categories obtained from several online stores for the research purpose. It is the first open dataset for product matching tasks in Polish, which allows comparing the effectiveness of the pre-trained models. Thus, we also showed the baseline results obtained by the fine-tuned mBERT and XLM-RoBERTa models on the Polish datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题