Multimodal LLMs pursuing AGI now
Multimodal large language models have made substantial advances over the past year, and the practical application of such models is heading in the direction of pursuing artificial general intelligence, with diverse vertical industrial large models and AI agents emerging, said experts at the 2024 World Artificial Intelligence Conference, which wrapped up on July 6 in Shanghai.
Multimodal LLMs integrate and process diverse types of data — such as text, images, audio and video — to enhance understanding and generate comprehensive responses.
In May, the launch of GPT-4o, the latest LLM developed by OpenAI, caused a global sensation. The new flagship generative AI model features capabilities across text, voice and visuals, making interaction between humans and machines much more natural and seamless, the company said.
Triggered by GPT-4o, Chinese AI companies also showcased their LLM updates during the conference, including Baidu, Alibaba, Tencent, Huawei, SenseTime and Ant Group, as well as emerging companies such as Minimax, Baichuan Intelligence and Zhipu AI.
Chinese AI pioneer SenseTime launched its latest multimodal LLM on July 5. The new model features integration of diverse types of data and real-time streaming multimodal interaction with users, closely competing with GPT-4o in interaction effects and multiple core metrics, the company said.
Chinese financial tech firm Ant Group shared its latest LLM product on the same day.
"The Ant BaiLing Foundation Model has been equipped with native multimodal capabilities. It can directly understand and train various types of data including audio, video, images and text," said Xu Peng, vice-president of the group, who regards such native multimodal capabilities as the "right path to achieving artificial general intelligence" as they will enable LLMs to interact like humans.
Compared with the previous edition, this year's WAIC showcased remarkable advances in LLMs. The number of LLMs in China exceeds 330, according to official statistics.
The practical industrial application of large models, such as applying vertical large models, AI agents or MaaS (model as a service), was another hot topic at this year's WAIC.
"The creation of large models is only the starting point. Landing the LLM into industrial scenarios to generate value is the goal," said Wu Yunsheng, vice-president of Tencent Cloud and head of Tencent YouTu Lab.
Tencent Hunyuan, the company's general model, was one of the highlighted exhibits at this year's conference.
Jiang Jie, Tencent's vice-president, said: "In the future, general models will exist as infrastructure — like water, electricity and networks — for on-demand access. More models of different sizes and modalities will appear, and businesses can coordinate with large and small models to meet customized needs while improving performance."
Hu Shiwei, co-founder and president of Chinese AI company 4Paradigm, said the positioning of such large models as the new "infrastructure" in the future is a certainty.
"Our industrial large models have seen remarkable results in application. For example, in the financial services sector, AI has improved the accuracy of identifying fraudulent transactions. In the retail sector, personalized services have led to a significant increase in sales," Hu said.
In addition to developing vertical large models, many companies and developers are also using MaaS — a type of cloud-based service that offers users access to machine-learning models to develop AI applications.
Zhipu AI, a Beijing-based startup dubbed one of the four new "AI tigers" of China, has accumulated over 400,000 corporate users.