Google Research has introduced ScreenAI, a vision-language model that leverages large language models to understand and interact with various user interfaces (UIs) and infographics. ScreenAI aims to improve upon existing models like PaLI and pix2struct by training on a unique mixture of datasets and tasks.
To create a diverse pre-training dataset, ScreenAI utilizes a combination of techniques, including web crawling, layout annotation with computer vision models, and data generation using large language models like PaLM 2.
ScreenAI's architecture is based on PaLI, consisting of a multimodal encoder and an autoregressive decoder. The model is trained in two stages:
ScreenAI is fine-tuned and evaluated on various public benchmarks, including:
Additionally, three new benchmarks are introduced:
ScreenAI achieves state-of-the-art results on UI and infographic-based tasks like WebSRC and MoTIF. It also demonstrates best-in-class performance on ChartQA, DocVQA, and InfographicVQA compared to models of similar size.
Experiments show that increasing the model size improves performance across all tasks, and improvements have not saturated even at the largest size of 5B parameters.
ScreenAI presents a unified representation for understanding UIs and infographics, leveraging data from both domains and applying self-supervised learning techniques. While the model performs competitively on various benchmarks, the authors acknowledge that further research is needed to bridge the gap with larger models.
Ask anything...