Summary of ScreenAI: A visual language model for UI and visually-situated language understanding

  • blog.research.google
  • Article
  • Summarized Content

    Here is the proposed meta title, meta description, and detailed summary for the article:

    Large Language Models for Screen Understanding

    Google Research has introduced ScreenAI, a vision-language model that leverages large language models to understand and interact with various user interfaces (UIs) and infographics. ScreenAI aims to improve upon existing models like PaLI and pix2struct by training on a unique mixture of datasets and tasks.

    • UIs and infographics share design principles, enabling a single model for understanding them.
    • ScreenAI uses a flexible patching strategy to work well with different aspect ratios.
    • It is pre-trained using self-supervised learning and fine-tuned on manually labeled data.

    Data Generation with Large Language Models

    To create a diverse pre-training dataset, ScreenAI utilizes a combination of techniques, including web crawling, layout annotation with computer vision models, and data generation using large language models like PaLM 2.

    • Web pages and mobile apps are crawled to gather a variety of UI screenshots.
    • Layout annotators identify UI elements, icons, text, and spatial relationships.
    • Large language models generate synthetic data for tasks like question answering, navigation, and summarization.

    Model Architecture and Training

    ScreenAI's architecture is based on PaLI, consisting of a multimodal encoder and an autoregressive decoder. The model is trained in two stages:

    • Pre-training stage: Self-supervised learning on automatically generated data labels.
    • Fine-tuning stage: Using manually labeled data for various tasks related to UIs and infographics.

    Experiments and Results

    ScreenAI is fine-tuned and evaluated on various public benchmarks, including:

    • Question answering: ChartQA, DocVQA, InfographicVQA, WebSRC, ScreenQA
    • Navigation: Referring Expressions, MoTIF, Mug, Android in the Wild
    • Summarization: Screen2Words, Widget Captioning

    Additionally, three new benchmarks are introduced:

    • Screen Annotation: Evaluates layout understanding and spatial capabilities.
    • ScreenQA Short: Variation of ScreenQA with shortened ground truth answers.
    • Complex ScreenQA: Harder questions, various aspect ratios, and more complex scenarios.

    Performance and Scaling

    ScreenAI achieves state-of-the-art results on UI and infographic-based tasks like WebSRC and MoTIF. It also demonstrates best-in-class performance on ChartQA, DocVQA, and InfographicVQA compared to models of similar size.

    Experiments show that increasing the model size improves performance across all tasks, and improvements have not saturated even at the largest size of 5B parameters.

    Conclusion and Future Work

    ScreenAI presents a unified representation for understanding UIs and infographics, leveraging data from both domains and applying self-supervised learning techniques. While the model performs competitively on various benchmarks, the authors acknowledge that further research is needed to bridge the gap with larger models.

    Ask anything...

    Sign Up Free to ask questions about anything you want to learn.