Voice-Language Foundation Models for Real-Time Autonomous Interaction and Speech Roleplay

Yemin Shi*, Yu Shu*, Siwei Dong*, Guangyi Liu*, Jaward Sesay, Jingwen Li, Zhiting Hu*
Maitrix.org, UC San Diego, MBZUAI

https://voila.maitrix.org     *Equal Contribution

Abstract

The evolution of voice AI systems has confronted numerous challenges, with traditional pipeline architectures embodying notable constraints such as high latency, the erosion of vocal nuances, and mechanical, reactive interactions. While these structures sufficiently accommodate basic voice-driven interfaces, the objective of achieving truly autonomous interactions akin to natural human communication has remained largely unmet. Voila, a pioneering family of large audio-language foundation models, is introduced to transcend these limitations. By leveraging an innovative end-to-end model design, Voila targets the dual objectives of facilitating real-time, autonomous, and flexible voice interactions while preserving rich vocal details. Voila's novel hierarchical Transformer architecture integrates streaming audio encoding with multitier audio generators, achieving a high-fidelity audio processing experience that minimizes latency to 195 ms—surpassing the average human response time. Additionally, Voila intelligently melds the voice and language modeling capabilities inherent in LLMs, empowering users with customizable voice and persona-driven interactions. This framework not only retains linguistic proficiency and vast knowledge but also incorporates millions of pre-built and customizable voices to enhance user engagement.

Moreover, Voila is established as a unified model adept in various audio tasks beyond spoken dialogue, including Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and adaptations such as speech translation. Capable of supporting six languages and enriched by extensive multilingual training data, Voila embodies a significant step forward in transforming machine interactions from passive exchanges into seamless, proactive dialogues. The deployment of Voila's web demo promises broad user accessibility and invites further exploration into redefining the dynamics of human-AI communication.

Voila Conceptual Demo

1. Highlights


Voila designs a hierarchical Transformer architecture, including streaming audio encoding and tokenization, and multi-scale Transformers consisting of an LLM backbone and a hierarchical audio generator. The models are trained in an end-to-end way with extensive audio-text data. The key advancements are:

Voila Highlights

We release the model in Hugging face and encourage the readers to try our web demo.

2. Demos

Experience Voila's capabilities through these interactive demonstrations


• AI Debates

Pets Debate: Samantha vs. Simpson

Human Input:
Samantha, you are on a heated debate with your friend Simpson on which is better as a pet, a dog or cat. Argue your point now.

Scientific Genius: Sheldon vs. Leonard

Human Input:
You are on a debate with your colleague Sheldon on who is a greater genius, Einstein or Isaac Newton. Please argue your point now.

Morning Beverages: Samantha vs. Leonard

Human Input:
Dr. Leonard, you are on a debate with your colleague Samantha on which is better in the morning, a cup of coffee or a cup of tea.

• Multiple Voice Styles

Video: From Simpson to Samantha

Show smooth voice transitions

Video: Continuous voice switching

Switch between multiple voices during the dialogue.

• Interesting Dialogs

Chat with Homer Simpson: Avoid eating junk food

The video shows rich emotions in the conversation (timbre, intonation, speaking speed, modal particles).

Chat with Samantha: Jokes and Banter

This video brings humorous and light-hearted dialogue.

• TTS Demos

Elon Musk

I think it's very important to have a feedback loop, where you're constantly thinking about what you've done and how you could be doing it better.

Samantha (Her)

I've fallen in love. I'm an ordinary woman. I didn't think such violent things could happen to ordinary people.

Homer Simpson

I saw weird stuff in that place last night. Weird, strange, sick, twisted, eerie, godless, evil stuff… and I want in..

Sylvester Stallone

Life's not about how hard of a hit you can give... it's about how many you can take, and still keep moving forward.

Mark Zuckerberg

If you just work on stuff that you like and you’re passionate about, you don’t have to have a master plan with how things will play out.

BibTeX

@article{voila2025,
  author    = {Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu},
  title     = {Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Roleplay},
  eprint={1111.11111},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year      = {2025}
}