Alibaba Introduces New AI Models to Challenge Global Tech Giants: Alibaba Group Holding rolled out a fresh set of artificial intelligence models on Tuesday, stepping up the race against both Chinese and international rivals.
At the center of the announcement was Qwen3-Omni, a multimodal model built to take on OpenAI’s GPT-4o and Google’s Gemini 2.5-Flash, also nicknamed “Nano Banana.”
Unlike traditional AI tools, Qwen3-Omni can handle text, audio, images, and even video, then respond with either text or voice.
Qwen3-Omni: A Unified Multimodal System
Alibaba described Qwen3-Omni as the first native end-to-end model that “unifies text, images, audio and video in one model.” The team behind the project shared the update on social media, noting that the launch puts Alibaba in direct competition with advanced AI systems already available outside China.
Stronger Performance in Benchmarks
According to developers, two versions of Qwen3-Omni have already shown better results than its predecessor, Qwen2.5-Omni-7B, as well as OpenAI’s GPT-4o and Google’s Gemini 2.5-Flash. The improvements were especially clear in audio recognition, comprehension, and understanding of images and videos.
Lin Junyang, a researcher on Alibaba’s Qwen team, credited these upgrades to major work on data projects. “This year, our audio team has spent great efforts on building large-scale audio data sets for both pretraining and post-training,” Lin said. “We have combined everything … to build our Qwen3-Omni.”
Multilingual Capabilities
One of the biggest highlights of Qwen3-Omni is its global reach. The model supports inputs in 119 text languages and can understand 19 spoken languages, including English, Chinese, Japanese, Spanish, Arabic and Urdu.

For outputs, it can generate spoken replies in 10 languages such as English, Chinese, French, German, Russian, Italian, Spanish, Portuguese, Japanese and Korean.
This makes it more than just a chat tool. When paired with devices that have cameras, microphones, and speakers, Qwen3-Omni can see and hear its surroundings, then reply out loud. Alibaba even shared a video demo showing the model in action.
Available to Developers Worldwide
To give developers early access, Alibaba has released three versions of Qwen3-Omni on open-source platforms like Hugging Face and GitHub.

Alongside Qwen3-Omni, the company also launched an updated image-editing model called Qwen-Image-Edit-2509 and a new speech model named Qwen3-TTS-Flash.
The image tool is designed to make edits smoother and more consistent, while the speech model – available only on Alibaba Cloud – can create expressive, humanlike voices that adjust tone based on the text.
Looking Ahead to the Apsara Conference
These releases come just in time for Alibaba Cloud’s annual Apsara Conference, which will run from Wednesday to Friday in Hangzhou, Zhejiang province. With its latest updates, Alibaba is clearly signaling that it wants a stronger seat at the global AI table.
ALSO READ: The Best Cutting-Edge Chinese AI Tools For Creating Videos In 2025





