Amphion Team | Apr 12, 2025

Controllability of human voice has long been an important issue in the field of audio generation, including speech generation (such as Text to Speech, TTS), singing voice generation (such as Singing Voice Synthesis, SVS), and song generation that includes both vocal and accompaniment music.

In our previous work, Vevo, we demonstrated how to use a unified framework to achieve controllable generation of various speech attributes (such as timbre and style). In this study, we propose Vevo1.5, which extends Vevo and focuses on unified and controllable generation for both speech and singing voice.

Beyond controlling timbre and style, Vevo1.5 notably introduces a prosody-based control mechanism—when generating speech or singing voice, we can provide a prosodic source waveform, enabling the generated human voice to mimic the source's prosodic contour (which is also interpreted as melody from the perspective of music). This approach offers several advantages:

For singing voice, we achieve melody-controllable singing voice generation. Particularly, unlike the SVS task which relies on expert-annotated musical scores (i.e., MIDI) as input for melody control, Vevo1.5 can directly use waveforms as input to extract prosodic contour (specifically, chromagram tokens, which will be described below) for melody control.
For speech generation, with prosody as an additional control condition, we can more naturally modify linguistic content in editing-based tasks while effectively preserving the original voice's intonation and emotions. Notably, this advantage is equally effective for targeted editing of lyrics in singing voice while maintaining the melody.
When controlling the prosody of speech and singing voice, we can even provide non-human sounds, such as instrumental sounds, as prosodic references. This opens up diverse possibilities for entertainment and artistic generation in downstream applications.

The blog will present the core concept of Vevo1.5 and showcase several illustrative demos.

<aside> 💡

We have released the pre-trained model and training recipe of Vevo1.5 at Amphion.

</aside>

Overview of Vevo1.5

We illustrate Vevo1.5's inference pipeline in the figure above, which employs our proposed two tokenizers:

Prosody Tokenizer: This tokenizer is designed to encode only the coarse-grained prosody information of audio. Specifically, we implement a VQ-VAE on chromagram with a relatively low frame rate (6.25 Hz) and limited vocabulary size (512).
Content-Style Tokenizer: This tokenizer aims to encode comprehensive information (e.g., text, articulation, and fine-grained prosody) except for timbre. Specifically, we utilize both chromagram and whisper features simultaneously as VQ-VAE input and reconstruction objective, with a higher frame rate (12.5 Hz) and larger vocabulary size (16384).