使用大语言模型了解HTML

论文标题

使用大语言模型了解HTML

Understanding HTML with Large Language Models

论文作者

Gur, Izzeddin, Nachum, Ofir, Miao, Yingjie, Safdari, Mustafa, Huang, Austin, Chowdhery, Aakanksha, Narang, Sharan, Fiedel, Noah, Faust, Aleksandra

论文摘要

大型语言模型（LLMS）在各种自然语言任务上表现出了出色的表现。然而，他们的HTML理解能力 - 即解析网页的原始HTML，并使用应用程序来自动化基于Web的任务，爬网和浏览器辅助检索 - 尚未得到充分探索。我们在三个任务下对HTML了解模型（微调LLM）和对其能力的深入分析进行了贡献：（i）HTML元素的语义分类，（ii）说明HTML输入的生成以及（iii）HTML页面的自动网络导航。尽管以前的工作已经开发了专门的体系结构和培训程序来了解HTML的理解，但我们表明，在标准的自然语言Corpora转移方面仔细研究了LLM，以非常好的对HTML理解任务非常好。例如，与在任务数据集中训练有素的模型相比，微调的LLM在语义分类中的准确性要高12％。此外，与以前的最佳监督模型相比，使用MiniWob基准的数据进行了微调，LLMS成功完成了50％的任务。在我们评估的LLM中，我们显示了证据表明，由于其双向编码器架构，基于T5的模型是理想的选择。为了促进对HTML理解的LLM的进一步研究，我们创建和开放源代码的HTML数据集蒸馏并自动标记了Common Crawl。

Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks. Yet, their capabilities for HTML understanding -- i.e., parsing the raw HTML of a webpage, with applications to automation of web-based tasks, crawling, and browser-assisted retrieval -- have not been fully explored. We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks: (i) Semantic Classification of HTML elements, (ii) Description Generation for HTML inputs, and (iii) Autonomous Web Navigation of HTML pages. While previous work has developed dedicated architectures and training procedures for HTML understanding, we show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks. For instance, fine-tuned LLMs are 12% more accurate at semantic classification compared to models trained exclusively on the task dataset. Moreover, when fine-tuned on data from the MiniWoB benchmark, LLMs successfully complete 50% more tasks using 192x less data compared to the previous best supervised model. Out of the LLMs we evaluate, we show evidence that T5-based models are ideal due to their bidirectional encoder-decoder architecture. To promote further research on LLMs for HTML understanding, we create and open-source a large-scale HTML dataset distilled and auto-labeled from CommonCrawl.

下载PDF全文

下载文献需遵守相关版权规定

论文标题