论文标题
HomePage2Vec:语言不可思议的网站嵌入和分类
Homepage2Vec: Language-Agnostic Website Embedding and Classification
论文作者
论文摘要
当前,用于网站分类的公开模型没有提供嵌入方式,并且对英语以外的语言的支持有限。我们以92种类别标记的网站的数据集,该数据集是从Curlie收集的92种语言的数据集,这是最大的多语言人类编辑的Web目录。该数据集包含14个网站类别,这些网站类别跨语言对齐。除此之外,我们介绍了HomePage2Vec,这是一种机器学习的预培训模型,用于以语言语言方式根据其主页对网站进行分类和嵌入网站。 HomePage2Vec由于其功能集(文本内容,元数据标签和视觉属性)以及自然语言表示的最新进展,是由设计独立于语言的,并生成了基于嵌入的表示表示。我们表明,HomePage2Vec正确地对网站进行了宏观平均得分为0.90的网站,在低资源和高资源语言上具有稳定的性能。功能分析表明,即使有限的计算资源,一小部分有效的可计算功能就足以实现高性能。我们使跨语言,预先培训的HomePage2VEC模型和库公开可用的策划的Curlie数据集
Currently, publicly available models for website classification do not offer an embedding method and have limited support for languages beyond English. We release a dataset of more than two million category-labeled websites in 92 languages collected from Curlie, the largest multilingual human-edited Web directory. The dataset contains 14 website categories aligned across languages. Alongside it, we introduce Homepage2Vec, a machine-learned pre-trained model for classifying and embedding websites based on their homepage in a language-agnostic way. Homepage2Vec, thanks to its feature set (textual content, metadata tags, and visual attributes) and recent progress in natural language representation, is language-independent by design and generates embedding-based representations. We show that Homepage2Vec correctly classifies websites with a macro-averaged F1-score of 0.90, with stable performance across low- as well as high-resource languages. Feature analysis shows that a small subset of efficiently computable features suffices to achieve high performance even with limited computational resources. We make publicly available the curated Curlie dataset aligned across languages, the pre-trained Homepage2Vec model, and libraries