论文标题
部分可观测时空混沌系统的无模型预测
Do BERTs Learn to Use Browser User Interface? Exploring Multi-Step Tasks with Unified Vision-and-Language BERTs
论文作者
论文摘要
预先训练的变压器是统一多任务模型的良好基础,该模型由于其任务不合稳定表示。预训练的变压器通常与文本到文本框架结合使用,以通过单个模型执行多个任务。通过图形用户界面(GUI)执行任务是容纳各种任务的另一个候选人,包括具有视觉和语言输入的多步任务。但是,很少有论文将预训练的变压器与通过GUI执行相结合。为了填补这一空白,我们探索了一个框架,在该框架中,模型通过通过多个步骤操纵网页实现的GUI来执行任务。我们开发有或没有页面过渡的任务页面,并为框架提出BERT扩展。我们使用这些任务页面共同培训了BERT扩展,并进行了以下观察。 (1)该模型学会了在有或没有页面过渡的情况下使用两个任务页面。 (2)在没有页面过渡的五个任务中,有四个任务中,该模型执行的原始BERT性能的75%以上,而原始BERT的性能不使用浏览器。 (3)该模型没有有效地概括为看不见的任务。这些结果表明,我们可以通过GUI将Berts微调到多步骤任务,并且可以提高其普遍性。代码将在线提供。
Pre-trained Transformers are good foundations for unified multi-task models owing to their task-agnostic representation. Pre-trained Transformers are often combined with text-to-text framework to execute multiple tasks by a single model. Performing a task through a graphical user interface (GUI) is another candidate to accommodate various tasks, including multi-step tasks with vision and language inputs. However, few papers combine pre-trained Transformers with performing through GUI. To fill this gap, we explore a framework in which a model performs a task by manipulating the GUI implemented with web pages in multiple steps. We develop task pages with and without page transitions and propose a BERT extension for the framework. We jointly trained our BERT extension with those task pages, and made the following observations. (1) The model learned to use both task pages with and without page transition. (2) In four out of five tasks without page transitions, the model performs greater than 75% of the performance of the original BERT, which does not use browsers. (3) The model did not generalize effectively on unseen tasks. These results suggest that we can fine-tune BERTs to multi-step tasks through GUIs, and there is room for improvement in their generalizability. Code will be available online.