LLaVA-1.6 is the latest version of the open-source large multimodal model (LMM) with improved reasoning, OCR, and world knowledge capabilities. It builds upon the success of LLaVA-1.5 and incorporates several enhancements:
One of the key improvements in LLaVA-1.6 is its enhanced OCR and visual reasoning capabilities. The model can now process higher-resolution images, allowing it to grasp intricate visual details more accurately. Additionally, the visual instruction tuning data mixture has been optimized, leading to better performance on tasks that require OCR and visual reasoning.
LLaVA-1.6 maintains the minimalist design and data efficiency of its predecessor, LLaVA-1.5, while achieving state-of-the-art performance compared to other open-source LMMs. Notably, it outperforms commercial models like Gemini Pro on several benchmarks, showcasing its competitive edge in the open-source realm.
LLaVA-1.6 employs a dynamic high-resolution scheme, designed to accommodate images of various resolutions while preserving data efficiency. The model uses a grid configuration of {2 x 2, 1 x {2,3,4}, {2,3,4} x 1}, balancing performance and operational costs.
To further improve the model's capabilities, LLaVA-1.6 incorporates several enhancements in its data mixture and model scaling:
LLaVA-1.6 showcases impressive qualitative results, demonstrating its capabilities in understanding and reasoning about various types of visual information, including images, text, and charts. The model provides detailed and insightful responses to prompts involving high-resolution images, flight information, and social media posts.
Detailed model performance metrics are provided, comparing LLaVA-1.6 with other state-of-the-art models across various benchmarks, including VQAv2, GQA, VisWiz, TextVQA, ScienceQA, MMMU, Math-Vista, MMB-ENG, MMB-CN, MM-Vet, LLaVA-Wild, SEED-IMG, MME, and POPE.
LLaVA-1.6 is an open-source release, with code, data, and models made publicly available to facilitate further development and research in the LMM community. The team behind LLaVA-1.6 expresses their commitment to responsible open-sourcing and continued improvements in OCR, reasoning, and world knowledge capabilities.
The article provides detailed technical improvements, including the dynamic high-resolution scheme, data mixture enhancements, and model scaling techniques. A comprehensive model card is also included, outlining the specifics of the LLaVA-1.6 variants, including model sizes, resolutions, training data, compute requirements, and training data sample counts.
The article acknowledges the support and contributions from various organizations and individuals, including the A16Z Open Source AI Grants Program, NSF CAREER IIS2150012, Microsoft Accelerate Foundation Models Research, and the Institute of Information & communications Technology Planning & Evaluation (IITP) grants funded by the Korea government. Proper citation information is provided for referencing the work.
Ask anything...