Demo of Singing Voice Synthesis in Muskits-ESPnet

This is the demo page of our toolkit Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm.

Singing Voice Synthesis (SVS) takes a music score as input and generates singing vocal with the voice of a specific singer.

Music score usually includes lyrics, as well as duration and pitch of each word in lyrics,

How to use:

  1. Choose Model-Language:
    • Choose "zh" for Chinese lyrics input or "jp" for Japanese lyrics input.
    • For example, "Model②(Mulitlingual)-zh" means model "Model②(Multilingual)" with lyrics input in Chinese.
  2. [Optional] Choose Singer: Choose a singer from the drop-down menu.
  3. Input lyrics:
    • Input Chinese characters for "zh" and hiragana for "jp".
    • You may include special symbols: 'AP' for breath, 'SP' for silence, and '-' for slur (Chinese lyrics only).
    • Separate each lyric by either a space (' ') or a newline ('\n') (no quotation marks needed).
  4. Input durations:
    • Input durations as float numbers.
    • The durations sequence should match the lyric sequence in length, with each duration aligned to a lyric.
    • Separate each duration by a space (' ') or a newline ('\n') (no quotation marks needed).
  5. Input pitches:
    • Input MIDI note names or MIDI note numbers (e.g., MIDI note name "69" represents the MIDI note number "A4", and others follow accordingly).
    • The pitch sequence should match the lyric sequence in length, with each pitch corresponding to a lyric.
    • Separate each duration by a space (' ') or a newline ('\n') (no quotation marks needed).
  6. Hit "Generate" and listen:
    • "Running Status" shows the status of singing generatation. If any error exists, it will show the error information.
    • "Pseudo MOS" represents predicted mean opinion score for the generated song.

Notice:

  • Plenty of exmpales are provided.
  • Extreme values may result in suboptimal generation quality!
Model-Language
Singer
Examples
Model-Language Singer Lyrics Duration Pitch