Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

Yemin Shi*, Yu Shu*, Siwei Dong*, Guangyi Liu*, Jaward Sesay, Jingwen Li, Zhiting Hu
Maitrix.org, UC San Diego, MBZUAI

*Equal Contribution

Abstract

A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation---where users can simply write text instructions to define the speaker’s identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.

Voila Conceptual Demo

Highlights


Voila models and code are fully open-sourced at Hugging Face. Also, try out Voila web demo!


Voila designs a hierarchical Transformer architecture, including streaming audio encoding and tokenization, and multi-scale Transformers consisting of an LLM backbone and a hierarchical audio generator. The models are trained in an end-to-end way with extensive audio-text data.

Voila Highlights

Demo Examples

Examples demonstrating Voila's capabilities


• AI Debates

Pets Debate: Samantha vs. Simpson

Human Input:
Samantha, you are on a heated debate with your friend Simpson on which is better as a pet, a dog or cat. Argue your point now.

Scientific Genius: Sheldon vs. Leonard

Human Input:
You are on a debate with your colleague Sheldon on who is a greater genius, Einstein or Isaac Newton. Please argue your point now.

Morning Beverages: Samantha vs. Leonard

Human Input:
Dr. Leonard, you are on a debate with your colleague Samantha on which is better in the morning, a cup of coffee or a cup of tea.

• Multiple Voice Styles

Video: From Simpson to Samantha

Show smooth voice transitions

Video: Continuous voice switching

Switch between multiple voices during the dialogue.

• Fun Conversations

Chat with Homer Simpson: Avoid eating junk food

The video shows rich emotions in the conversation (timbre, intonation, speaking speed, modal particles).

Chat with Samantha: Jokes and Banter

This video brings humorous and light-hearted dialogue.

• Text-to-Speech (TTS)

Elon Musk

I think it's very important to have a feedback loop, where you're constantly thinking about what you've done and how you could be doing it better.

Samantha (Her)

I've fallen in love. I'm an ordinary woman. I didn't think such violent things could happen to ordinary people.

Homer Simpson

I saw weird stuff in that place last night. Weird, strange, sick, twisted, eerie, godless, evil stuff… and I want in..

Sylvester Stallone

Life's not about how hard of a hit you can give... it's about how many you can take, and still keep moving forward.

Mark Zuckerberg

If you just work on stuff that you like and you’re passionate about, you don’t have to have a master plan with how things will play out.

BibTeX

@article{voila2025,
  author    = {Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu},
  title     = {Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play},
  eprint={2505.02707},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year      = {2025}
}