论文标题
从网络中提取产品规格 - 超越表格和列表
Extraction of Product Specifications from the Web -- Going Beyond Tables and Lists
论文作者
论文摘要
网络上的电子商务产品页面通常在结构化表格块中呈现产品规范数据。提取这些产品属性 - 价值规范已使应用程序目录策划,搜索,问答等应用程序受益。但是,在不同的网站上,通常使用各种HTML元素(例如<table>,<ul>,<div>,<span>,<dl>等),通常用于渲染这些块,使其自动提取成为挑战。当前的大多数研究都集中在从表和列表中提取产品规格,因此,在应用于大规模提取设置时会遇到召回。在本文中,我们提出了一种产品规范提取方法,该方法超出了表格或列表,并概括了用于渲染规范块的不同HTML元素。使用手工编码的功能以及深度学习的空间和代币功能的组合,我们首先在产品页面上确定规范块。然后,我们从这些块中从这些块中提取产品属性 - 值对,此方法是受包装诱导启发的方法。我们创建了一个标记的产品规格数据集,该数据集是从14,111个从一系列不同产品网站中获取的14,111个不同规范块中提取的。我们的实验显示了与当前规范提取模型相比,我们的方法的功效,并支持我们关于其应用于大规模产品规范提取的主张。
E-commerce product pages on the web often present product specification data in structured tabular blocks. Extraction of these product attribute-value specifications has benefited applications like product catalogue curation, search, question answering, and others. However, across different Websites, there is a wide variety of HTML elements (like <table>, <ul>, <div>, <span>, <dl> etc.) typically used to render these blocks that makes their automatic extraction a challenge. Most of the current research has focused on extracting product specifications from tables and lists and, therefore, suffers from recall when applied to a large-scale extraction setting. In this paper, we present a product specification extraction approach that goes beyond tables or lists and generalizes across the diverse HTML elements used for rendering specification blocks. Using a combination of hand-coded features and deep learned spatial and token features, we first identify the specification blocks on a product page. We then extract the product attribute-value pairs from these blocks following an approach inspired by wrapper induction. We created a labeled dataset of product specifications extracted from 14,111 diverse specification blocks taken from a range of different product websites. Our experiments show the efficacy of our approach compared to the current specification extraction models and support our claim about its application to large-scale product specification extraction.