Google has officially announced the high-end voice conversation and generation capabilities of Gemini 2.5, marking a major breakthrough in AI voice technology. This new feature supports real-time voice conversations in more than 24 languages and provides unprecedented voice control capabilities, allowing developers to create richer and more interactive applications. Gemini 2.5 is now integrated into NotebookLM’s Speech Overview and Project Astra, among other products. Gemini 2.5 Multi-language real-time speech generation Google demonstrates that human voices can be messed up Gemini 2.5 Multi-language real-time speech generation Google demonstrates that human voices can be messed up. Google said that the native voice conversation function of Gemini 2.5 Flash Preview shows a number of innovative features. The system is capable of natural and smooth voice interaction, with excellent expressiveness and prosodic patterns, and an instant conversation experience with very low latency. The new feature supports style control, allowing users to adjust the conversation style through natural language prompts, including using specific accents, generating different tonal expressions, and even whispering conversations. The system also has tool integration capabilities, allowing users to use Google Search or developer custom tools in conversations. Gemini 2.5 also introduces emotional dialogue capabilities, which can respond to the user’s tone institutionalization, recognizing that the same text may produce very different dialogue effects in different intonations. The system is trained to recognize and ignore irrelevant speech such as background speech and environmental dialogue, and only respond when appropriate. Voice-to-video understanding allows Gemini 2.5 to stream voice and video to talk to users, discuss video content or interact via screen sharing. Multilingual support allows users to have conversations in more than 24 languages, and even mix multiple languages in the same sentence. However, there is no Mandarin or Cantonese in the currently supported languages, which still needs to be further updated. Google pointed out that text-to-speech technology is developing rapidly, and the latest model of Gemini 2.5 can provide unprecedented voice generation control. Users can generate a variety of content from short clips to long narratives, with precise control over style, tone, emotional expression and presentation. The system supports dynamic representation, which can bring vivid expression effects to poetry, news broadcasts and engaging storytelling. Models can express specific emotions and produce accent effects when needed, as well as control speech speed and ensure pronunciation accuracy, including the accuracy of specific words. Another breakthrough feature of Gemini 2.5 is multi-speaker conversation generation, which can generate NotebookLM-style two-person voice overviews from text input, making content more engaging through dialogue. The system provides support for multi-lingual voice content creation in more than 24 languages. Developers can choose Gemini 2.5 Pro Preview for the highest quality effect of complex prompts, or Gemini 2.5 Flash Preview for cost-effective daily use. This allows developers to dynamically create a variety of voice content such as announcements, stories, podcasts, and video games.
Gemini 2.5 multi-Chinese real-time voice generation, Google demonstration vocals are almost realistic
来自
标签:
发表回复