HomePage2Vec：语言不可思议的网站嵌入和分类

论文标题

HomePage2Vec：语言不可思议的网站嵌入和分类

Homepage2Vec: Language-Agnostic Website Embedding and Classification

论文作者

Lugeon, Sylvain, Piccardi, Tiziano, West, Robert

论文摘要

当前，用于网站分类的公开模型没有提供嵌入方式，并且对英语以外的语言的支持有限。我们以92种类别标记的网站的数据集，该数据集是从Curlie收集的92种语言的数据集，这是最大的多语言人类编辑的Web目录。该数据集包含14个网站类别，这些网站类别跨语言对齐。除此之外，我们介绍了HomePage2Vec，这是一种机器学习的预培训模型，用于以语言语言方式根据其主页对网站进行分类和嵌入网站。 HomePage2Vec由于其功能集（文本内容，元数据标签和视觉属性）以及自然语言表示的最新进展，是由设计独立于语言的，并生成了基于嵌入的表示表示。我们表明，HomePage2Vec正确地对网站进行了宏观平均得分为0.90的网站，在低资源和高资源语言上具有稳定的性能。功能分析表明，即使有限的计算资源，一小部分有效的可计算功能就足以实现高性能。我们使跨语言，预先培训的HomePage2VEC模型和库公开可用的策划的Curlie数据集

Currently, publicly available models for website classification do not offer an embedding method and have limited support for languages beyond English. We release a dataset of more than two million category-labeled websites in 92 languages collected from Curlie, the largest multilingual human-edited Web directory. The dataset contains 14 website categories aligned across languages. Alongside it, we introduce Homepage2Vec, a machine-learned pre-trained model for classifying and embedding websites based on their homepage in a language-agnostic way. Homepage2Vec, thanks to its feature set (textual content, metadata tags, and visual attributes) and recent progress in natural language representation, is language-independent by design and generates embedding-based representations. We show that Homepage2Vec correctly classifies websites with a macro-averaged F1-score of 0.90, with stable performance across low- as well as high-resource languages. Feature analysis shows that a small subset of efficiently computable features suffices to achieve high performance even with limited computational resources. We make publicly available the curated Curlie dataset aligned across languages, the pre-trained Homepage2Vec model, and libraries

下载PDF全文

下载文献需遵守相关版权规定

论文标题