Here's all you need to know about it:
1. This model is capable of understanding images and text.
2. It can handle variable image resolution, supporting images with arbitrary sizes.
3. It can process large documents with interleaved text and images
4. It has a 128k context window
5. and... it is Open Source! Open-weights available with Apache 2.0 license
The performance is the best in multimodal and text benchmarks compared to other Open-Source multimodal models such as Phi-3 Vision, LLaVA-OV 7B, Qwen2-VL 7B, or Claude-3 Haiku.
The best? This 12B open-source model beats commercial Closed Models of similar size, and it is competitive against much larger closed models such as GPT-4o or Claude-3.5 Sonnet.
See Armand Ruiz's, VP of AI Platform, IBM, post on LinkedIn.
Bye for now,
Nick
#watsonx.ai
#GenerativeAI