Cohere For AI, the nonprofit research lab of AI startup Cohere, has launched Aya Vision, a multimodal AI model it calls best-in-class.
Aya Vision handles various tasks like generating image captions, answering visual-based questions, translating text, and summarizing content in 23 languages. Cohere is offering the model for free through WhatsApp to make it widely accessible and support AI research worldwide.
Despite AI’s progress, a significant performance gap still exists across different languages, especially in multimodal applications that combine text and images. Cohere designed Aya Vision to help close this gap.
The model comes in two versions: Aya Vision 32B and Aya Vision 8B. Cohere claims the 32B model outperforms competitors twice its size, including Meta’s Llama-3.2 90B Vision, on certain visual understanding benchmarks. The smaller 8B version reportedly beats models 10 times its size in specific evaluations.
Cohere has made both models available on Hugging Face under a Creative Commons 4.0 license with usage restrictions, preventing commercial applications.
To train Aya Vision, Cohere used a diverse set of English-language datasets, translating them to generate synthetic annotations. AI-generated annotations helped the model learn more efficiently while reducing reliance on expensive real-world data. As natural data sources become scarce, major AI companies like OpenAI increasingly rely on synthetic data for training.
Alongside Aya Vision, Cohere introduced AyaVisionBench, a benchmark suite designed to test vision-language capabilities. The suite evaluates tasks like spotting differences between images and converting screenshots into code, aiming to improve AI performance assessment in real-world multilingual settings.