Demo of Singing Voice Synthesis in Muskits-ESPnet

Singing Voice Synthesis (SVS) takes a music score as input and generates singing vocal with the voice of a specific singer.

Music score usually includes lyrics, as well as duration and pitch of each word in lyrics,

How to use:

Choose Model-Language:
- Choose "zh" for Chinese lyrics input or "jp" for Japanese lyrics input.
- For example, "Model②(Mulitlingual)-zh" means model "Model②(Multilingual)" with lyrics input in Chinese.
[Optional] Choose Singer: Choose a singer from the drop-down menu.
Input lyrics:
- Input Chinese characters for "zh" and hiragana for "jp".
- You may include special symbols: 'AP' for breath, 'SP' for silence, and '-' for slur (Chinese lyrics only).
- Separate each lyric by either a space (' ') or a newline ('\n') (no quotation marks needed).
Input durations:
- Input durations as float numbers.
- The durations sequence should match the lyric sequence in length, with each duration aligned to a lyric.
- Separate each duration by a space (' ') or a newline ('\n') (no quotation marks needed).
Input pitches:
- Input MIDI note names or MIDI note numbers (e.g., MIDI note name "69" represents the MIDI note number "A4", and others follow accordingly).
- The pitch sequence should match the lyric sequence in length, with each pitch corresponding to a lyric.
- Separate each duration by a space (' ') or a newline ('\n') (no quotation marks needed).
Hit "Generate" and listen:
- "Running Status" shows the status of singing generatation. If any error exists, it will show the error information.
- "Pseudo MOS" represents predicted mean opinion score for the generated song.

Examples

Model-Language	Singer	Lyrics	Duration	Pitch