Summary of LLaVA-1.6: Improved reasoning, OCR, and world knowledge

  • llava-vl.github.io
  • Article
  • Summarized Content

    Introducing LLaVA-1.6

    LLaVA-1.6 is the latest version of the open-source large multimodal model (LMM) with improved reasoning, OCR, and world knowledge capabilities. It builds upon the success of LLaVA-1.5 and incorporates several enhancements:

    • Higher input image resolution (up to 672x672 pixels) to capture more visual details.
    • Better visual reasoning and OCR performance through improved visual instruction tuning data mixture.
    • Enhanced visual conversation skills for various applications, including better world knowledge and logical reasoning.
    • Efficient deployment and inference with SGLang.

    OCR and Visual Reasoning Improvements

    One of the key improvements in LLaVA-1.6 is its enhanced OCR and visual reasoning capabilities. The model can now process higher-resolution images, allowing it to grasp intricate visual details more accurately. Additionally, the visual instruction tuning data mixture has been optimized, leading to better performance on tasks that require OCR and visual reasoning.

    Open-Source Release and Performance

    LLaVA-1.6 maintains the minimalist design and data efficiency of its predecessor, LLaVA-1.5, while achieving state-of-the-art performance compared to other open-source LMMs. Notably, it outperforms commercial models like Gemini Pro on several benchmarks, showcasing its competitive edge in the open-source realm.

    • The model achieves impressive results on various benchmarks, including MMMU, Math-Vista, MMB-ENG, MMB-CN, MM-Vet, LLaVA-Wild, and SEED-IMG.
    • LLaVA-1.6 demonstrates zero-shot Chinese capability, performing exceptionally well on Chinese multimodal scenarios like MMBench-CN.
    • The training process is highly efficient, requiring only 32 GPUs for approximately one day and utilizing less than 1M visual instruction tuning samples.

    High-Resolution Image Processing

    LLaVA-1.6 employs a dynamic high-resolution scheme, designed to accommodate images of various resolutions while preserving data efficiency. The model uses a grid configuration of {2 x 2, 1 x {2,3,4}, {2,3,4} x 1}, balancing performance and operational costs.

    Data Mixture and Model Scaling

    To further improve the model's capabilities, LLaVA-1.6 incorporates several enhancements in its data mixture and model scaling:

    • High-quality user instruction data from various sources, including existing GPT-V data and a new 15K visual instruction tuning dataset covering different applications.
    • Multimodal document, chart, and OCR data from sources like DocVQA, SynDog-EN, ChartQA, DVQA, and AI2D.
    • Scaling up the language model backbone to include larger models like Mistral-7B and Nous-Hermes-2-Yi-34B, supporting a wider range of scenarios and users.

    Qualitative Results and Model Performance

    LLaVA-1.6 showcases impressive qualitative results, demonstrating its capabilities in understanding and reasoning about various types of visual information, including images, text, and charts. The model provides detailed and insightful responses to prompts involving high-resolution images, flight information, and social media posts.

    Detailed model performance metrics are provided, comparing LLaVA-1.6 with other state-of-the-art models across various benchmarks, including VQAv2, GQA, VisWiz, TextVQA, ScienceQA, MMMU, Math-Vista, MMB-ENG, MMB-CN, MM-Vet, LLaVA-Wild, SEED-IMG, MME, and POPE.

    Open-Source Commitment and Future Development

    LLaVA-1.6 is an open-source release, with code, data, and models made publicly available to facilitate further development and research in the LMM community. The team behind LLaVA-1.6 expresses their commitment to responsible open-sourcing and continued improvements in OCR, reasoning, and world knowledge capabilities.

    Technical Details and Model Card

    The article provides detailed technical improvements, including the dynamic high-resolution scheme, data mixture enhancements, and model scaling techniques. A comprehensive model card is also included, outlining the specifics of the LLaVA-1.6 variants, including model sizes, resolutions, training data, compute requirements, and training data sample counts.

    Acknowledgments and Citations

    The article acknowledges the support and contributions from various organizations and individuals, including the A16Z Open Source AI Grants Program, NSF CAREER IIS2150012, Microsoft Accelerate Foundation Models Research, and the Institute of Information & communications Technology Planning & Evaluation (IITP) grants funded by the Korea government. Proper citation information is provided for referencing the work.

    Discover content by category

    Ask anything...

    Sign Up Free to ask questions about anything you want to learn.