论文标题
ActionBert:利用用户操作以进行语义理解用户界面
ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces
论文作者
论文摘要
随着移动设备无处不在,与许多用户界面(UIS)进行定期互动是许多人日常生活的常见方面。为了提高这些设备的可访问性并在各种环境中启用其使用情况,可以建立可以帮助用户并通过UI完成任务的模型至关重要。但是,要实现这一目标存在一些挑战。首先,类似外观的UI组件可以具有不同的功能,这使得其功能不仅仅是分析其外观更重要。其次,移动应用程序中网页中的文档对象模型(DOM)之类的特定于域特异性功能提供了有关UI元素语义的重要信号,但这些功能并不是自然语言格式。第三,由于UI的大量多样性以及缺乏标准DOM或VH表示形式,建立具有高覆盖范围的UI理解模型需要大量的培训数据。 受到NLP中基于培训前的方法成功解决各种问题的启发,我们引入了一种新的预训练的UI代表模型,称为Actionbert。我们的方法旨在利用用户交互跟踪中的视觉,语言和域特异性特征来预先培训UIS及其组件的通用特征表示。我们的关键直觉是,用户操作,例如,对不同的UI组件的单击顺序揭示了有关其功能的重要信息。我们在各种下游任务上评估了所提出的模型,从图标分类到基于自然语言描述的UI组件检索。实验表明,所提出的Actionbert模型在所有下游任务中优于多模式基线,高达15.5%。
As mobile devices are becoming ubiquitous, regularly interacting with a variety of user interfaces (UIs) is a common aspect of daily life for many people. To improve the accessibility of these devices and to enable their usage in a variety of settings, building models that can assist users and accomplish tasks through the UI is vitally important. However, there are several challenges to achieve this. First, UI components of similar appearance can have different functionalities, making understanding their function more important than just analyzing their appearance. Second, domain-specific features like Document Object Model (DOM) in web pages and View Hierarchy (VH) in mobile applications provide important signals about the semantics of UI elements, but these features are not in a natural language format. Third, owing to a large diversity in UIs and absence of standard DOM or VH representations, building a UI understanding model with high coverage requires large amounts of training data. Inspired by the success of pre-training based approaches in NLP for tackling a variety of problems in a data-efficient way, we introduce a new pre-trained UI representation model called ActionBert. Our methodology is designed to leverage visual, linguistic and domain-specific features in user interaction traces to pre-train generic feature representations of UIs and their components. Our key intuition is that user actions, e.g., a sequence of clicks on different UI components, reveals important information about their functionality. We evaluate the proposed model on a wide variety of downstream tasks, ranging from icon classification to UI component retrieval based on its natural language description. Experiments show that the proposed ActionBert model outperforms multi-modal baselines across all downstream tasks by up to 15.5%.