论文标题
ScreenQA:移动应用程序屏幕截图大规模提问对
ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots
论文作者
论文摘要
我们介绍了ScreenQA,这是一种新颖的基准测试数据集,旨在通过问答来推动屏幕内容理解。现有的屏幕数据集专注于低水平的结构和组件理解,或者是更高级别的复合任务,例如自主代理的导航和任务完成。 ScreenQA试图弥合此差距。通过在RICO数据集上注释86K的问题回答,我们的目标是基于屏幕阅读理解能力,从而为基于屏幕截图的基于视觉的自动化奠定了基础。我们的注释包含完整答案,简短的答案短语和带有边界框的对应的UI内容,从而使四个子任务能够解决各种应用程序方案。我们在零射,微调和转移学习设置中使用开放量和专有模型评估数据集的功效。我们进一步证明了对Web应用程序的积极转移,突出了其超出移动应用程序的潜力。
We introduce ScreenQA, a novel benchmarking dataset designed to advance screen content understanding through question answering. The existing screen datasets are focused either on low-level structural and component understanding, or on a much higher-level composite task such as navigation and task completion for autonomous agents. ScreenQA attempts to bridge this gap. By annotating 86k question-answer pairs over the RICO dataset, we aim to benchmark the screen reading comprehension capacity, thereby laying the foundation for vision-based automation over screenshots. Our annotations encompass full answers, short answer phrases, and corresponding UI contents with bounding boxes, enabling four subtasks to address various application scenarios. We evaluate the dataset's efficacy using both open-weight and proprietary models in zero-shot, fine-tuned, and transfer learning settings. We further demonstrate positive transfer to web applications, highlighting its potential beyond mobile applications.