Created July 30, 2024

Singing Voice Synthesis with DiffSinger

Use shallow diffusion model to generate song singing from text and note, and convert and deloy complex model to onnx in Ryzen AI PC.

Things used in this project

Hardware components

Minisforum Venus UM790 Pro with AMD Ryzen™ 9

Story

1.Project Planning

In the tide of AI development, generative AI is the most eye-catching, especially with the emergence of LLM large models, which has made AIGC a qualitative breakthrough in natural language, image processing, robotics, etc. In this project, using the current mainstream AIGC technology on the AMD AI PC platform, a high naturalness degree of song can be generated from text and musical scores.
The SVS system for singing voice synthesis is built to synthesize high-quality and expressive singing voices, where the acoustic model generates acoustic features such as Mel spectrum on a given musical score. Previous singing acoustic models used simple losses such as L1 loss and L2 loss or GAN generative adversarial networks to reconstruct acoustic features, but they had excessive smoothing and unstable training problems, which hindered the naturalness of synthesized singing voices.
The DiffSinger[Jinglin Liu,etc] model is an SVS acoustic model based on a diffusion probability model. DiffSinger is also a Markov chain parameter model that converts noise into mel spectra step by step based on the musical score. By implicitly optimizing the variational boundary, DiffSinger can stably train and generate realistic outputs. In order to further improve speech quality and accelerate inference speed, the authors of the paper introduced a shallow diffusion mechanism to better utilize the prior knowledge learned through previous simple losses.

1.1 Model architecture

1.1.1 Acoustic model

DiffSinger first uses a dictionary to convert text into basic phonemes, and can also convert Midi music scores into notes. It uses Mel spectrum to analyze and express singing voice. Mel spectrum is obtained by converting sound into the frequency domain through FFT transformation and taking the corresponding power spectrum distribution in time series. The mel spectrum of a piece of sound is shown in the figure below：

With the Mel spectrum, sound can be synthesized, so one of the goals of the generation model is to generate the Mel spectrum information of singing voice, and then use this Mel spectrum to generate the singing voice. In order to obtain the Mel spectrum information, an Encoder and an Aux-Decoder are constructed. After training with data, an estimation M of the Mel spectrum can be obtained.

1.1.2 The Generation Model

The Mel spectrum predicted by the aux-decoder is a rough estimate and cannot generate high-quality songs. Therefore, it is necessary to use a generation model to make precise predictions of the Mel spectrum, incorporating features such as vocal , rhythm, and other stylistic phonemes into the final generated sound.

1） Shallow Diffusion Model

Specifically, DiffSinger starts generating with a shallow number of steps that is less than the original total number of diffusion steps, based on the intersection point between the diffusion trajectory of the true Mel spectrum and the diffusion trajectory predicted by the simple Mel spectrum decoder. In addition, the paper proposes a boundary prediction method to locate the intersection point and adaptively determine the shallow steps.

1.2 Diffusion Model

Since DiffSinger is constructed based on the diffusion model, let's introduce the diffusion model here.

The diffusion model has several advantages over other generative models, including its ability to generate high-quality samples, its flexibility in modeling complex data distributions, and its potential for scaling to large datasets. In the context of DiffSinger, the diffusion model is used to generate high-quality singing voices by predicting the Mel spectrum of the sound at each step of the reverse diffusion process.

1.2.1 Diffusion Model Processes

The diffusion model consists of two main processes: the diffusion process and the reverse process.

Diffusion Process

The diffusion process is a Markov chain with fixed parameters that gradually adds small amounts of Gaussian noise to the data at each step, eventually transforming the original data into a Gaussian distribution. This process is designed to destroy the structural information in the data, making it increasingly difficult to distinguish from pure noise.

Reverse Process

The reverse process is a Markov chain with learnable parameters that aims to reverse the diffusion process, recovering the original data from the Gaussian noise. In this process, a neural network, known as the denoiser, is trained to predict and subtract the noise added at each step of the diffusion process, effectively reversing the diffusion process and generating samples from the original data distribution.

1.2.2 Model Details

DiffSinger is built upon the diffusion model. Since the task of singing voice synthesis (SVS) involves modeling the conditional distribution p(M0|x), where M represents the Mel spectrum and x is the corresponding music score, the music score x is used as a condition in the reverse process of the diffusion denoiser. To improve the performance and efficiency of the model, a novel shallow diffusion mechanism is proposed. Additionally, a boundary prediction network is introduced, which can adaptively find the intersection boundaries required by the shallow diffusion mechanism.

1.2.3 Workflow

During training, DiffSinger predicts the random noise based on the timestep t, the music score x, and the Mel spectrum Mt at the t-th step.

During inference, the process starts with Gaussian noise sampled from the N(0, 1) distribution and proceeds step-by-step to denoise the noise. The denoising process consists of two main steps:

Use the denoiser to predict the random noise added at the current timestep.
Based on the predicted random noise, infer the Mel spectrum Mt-1 from Mt.

By iteratively applying these two steps in reverse order, DiffSinger can generate the Mel spectrum of the singing voice, which can then be used to synthesize the final audio.

1.3 Shallow Diffusion Mechanism

While acoustic models trained with simple loss functions may suffer from blurriness and oversmoothing, they still produce samples that have a strong connection to the real data distribution. This connection can provide valuable prior knowledge for DiffSinger.

To explore this connection and find ways to better utilize the prior knowledge, the authors of the paper conduct empirical observations using the diffusion process. They propose a shallow diffusion mechanism as a way to leverage this prior knowledge more effectively.

The shallow diffusion mechanism aims to reduce the number of diffusion steps required to achieve a similar level of denoising performance. By shortening the diffusion path, the model can focus more on the relevant features and patterns in the data, while avoiding unnecessary diffusion into irrelevant noise.

This mechanism allows DiffSinger to start from a partially denoised state, rather than pure Gaussian noise, and to perform a shorter and more targeted denoising process. By doing so, DiffSinger can make better use of the prior knowledge provided by the acoustic models and generate singing voices that are more faithful to the original data distribution.

When t=0, M contains rich details between adjacent harmonics, which can affect the naturalness of the synthesized singing voice. However, M~ is overly smoothed.
As t increases, the samples from the two processes become indistinguishable.

Based on these observations, the authors of the paper propose the following hypothesis: When the number of diffusion steps is sufficiently large, the trajectory from the manifold of M~ to the Gaussian noise manifold intersects with the trajectory from the manifold of M to the Gaussian noise manifold. Inspired by this observation, the authors propose the shallow diffusion mechanism: instead of starting from pure Gaussian white noise, the reverse process begins at the intersection point of the two trajectories as shown in the figure below. This allows the model to leverage the prior information encoded in M and focus on denoising only the necessary parts, thereby improving the efficiency and quality of the generated singing voices.

Therefore, the computational overhead of the reverse process can be significantly alleviated. Specifically, during the inference stage, the authors propose the following steps:

An auxiliary decoder is utilized to generate M, which is conditioned on the output of the music score encoder and trained with L1 loss. This decoder serves as a way to approximate the initial state M that is close to the intersection point of the two trajectories.
An intermediate sample is generated through the diffusion process at a smaller timestep k. This step helps to initialize the reverse process at a state that is closer to the desired output, reducing the number of necessary denoising steps.
Starting from the Mel spectrogram Mk~ at the intersection point of the two trajectories, the reverse process performs k rounds of denoising to reconstruct the final singing voice. By initiating the reverse process from this intermediate state, the model can efficiently generate high-quality singing voices with reduced computational cost.

2.Part Sourcing

DiffSinger source code

https://github.com/MoonInTheRiver/DiffSinger

Vitis AI library support

https://github.com/Xilinx/Vitis-AI

Ryzen AI PC

PC with AMD Ryzen AI support

3.Building

3.1 Model conversion using ONNX

Due to the complexity of the DiffSinger model, the ONNX Open Neural Network Exchange format mainly supports static models, which means it cannot directly handle complex models that contain dynamic or conditional computation paths. Therefore, we have taken the approach of splitting the DiffSinger model into multiple sub-models, including GaussianDiffusion, Hifigan, and PitchEncoder. For the more complex GaussianDiffusion module, we have created a new class to encapsulate its model structure and use ONNX's export function to convert the model to ONNX format. This process allows us to maintain the manageability and compatibility of the model while taking advantage of ONNX's cross-platform advantages. The following is a simplified example code framework that shows how to encapsulate and export the GaussianDiffusion module

class GaussianDiffusionWrap(GaussianDiffusion):
    def forward(self, txt_tokens,
        # Wrapped Arguments
        spk_id,
        pitch_midi,
        midi_dur,
        is_slur,
        #mel2ph,
        ):
        return super().forward(txt_tokens, spk_id=spk_id, ref_mels=None, infer=True,
                                pitch_midi=pitch_midi, midi_dur=midi_dur,
                                is_slur=is_slur)#, mel2ph=mel2ph
class DFSInferWrapped(e2e.DiffSingerE2EInfer):
    def build_model(self):
        model = GaussianDiffusionWrap(
                    phone_encoder=self.ph_encoder,
                    out_dims=hparams['audio_num_mel_bins'], 
                    denoise_fn=DIFF_DECODERS[hparams['diff_decoder_type']](
                                    hparams), timesteps=hparams['timesteps'],
                                    K_step=hparams['K_step'],
                                    loss_type=hparams['diff_loss_type'],
                                    spec_min=hparams['spec_min'], 
                                    spec_max=hparams['spec_max'],
                   )
        model.eval()
        load_ckpt(model, hparams['fs2_ckpt'], 'model')
        if hparams.get('pe_enable') is not None and hparams['pe_enable']:
             self.pe = PitchExtractor().to(self.device)
             load_ckpt(self.pe, hparams['pe_ckpt'], 'model', strict=True)
             self.pe.eval()
      return model

3.2 pipe process and deploy

In the reasoning process, the GaussianDiffusion module first receives the music score and text input. These inputs are converted into a frequency domain representation of audio characteristics, the Mel spectrum, using the ShallowDiffusion algorithm, a simplified diffusion model algorithm. Subsequently, the Mel spectrum is passed to the PitchEncoder module to further process the pitch information of the audio, and finally generates the final audio file in WAV format through the Hifigan module. The entire process builds a complete conversion chain from text and music score to sound output.

c = {
    'text': '你 说 你 不 SP 懂 为 何 在 这 时 牵 手 AP',
    'notes': 'D#4/Eb4 | D#4/Eb4 | D#4/Eb4 | D#4/Eb4 | rest | D#4/Eb4 | D4 | D4 | D4 | D#4/Eb4 | F4 | D#4/Eb4 | D4 | rest',
    'notes_duration': '0.113740 | 0.329060 | 0.287950 | 0.133480 | 0.150900 | 0.484730 | 0.242010 | 0.180820 | 0.343570 | 0.152050 | 0.266720 | 0.280310 | 0.633300 | 0.444590',
    'input_type': 'word'
}
target = "./infer_out/onnx_test_singer_res.wav"
set_hparams(print_hparams=False)
spec_min= torch.FloatTensor(hparams['spec_min'])[None, None, :hparams['keep_bins']]
spec_max= torch.FloatTensor(hparams['spec_max'])[None, None, :hparams['keep_bins']]
infer_ins = TestAllInfer(hparams)
out = infer_ins.infer_once(c)
os.makedirs(os.path.dirname(target), exist_ok=True)
print(f'| save audio: {target}')
save_wav(out, target, hparams['audio_sample_rate'])
print("OK")

To maximize the performance of Ryzen AI PC, Vitis AI runtime is adopted as the underlying support for ONNX models in model infer processing, thereby enhancing the real-time performance of the models. As the model parameters are not very large, and for a high quanlity sound, the infer process do not quantize model.

config_file_path = "vaip_config.json"
aie_options = ort.SessionOptions()
aie_options.enable_profiling = True
print("load pe")
self.pe2 = ort.InferenceSession("xiaoma_pe.onnx",
providers = ['VitisAIExecutionProvider'],
sess_options=aie_options,
provider_options=[{'config_file': config_file_path}])
print("load hifigan")
self.vocoder2 = ort.InferenceSession("hifigan.onnx",
providers = ['VitisAIExecutionProvider'],
sess_options=aie_options,
provider_options=[{'config_file': config_file_path}])
print("load singer_fs")
self.model2 = ort.InferenceSession("singer_fs.onnx",
providers = ['VitisAIExecutionProvider'],
sess_options=aie_options,
provider_options=[{'config_file': config_file_path}])

here is experiment output wav file link.

4.Coding and algorithm

I have upload model onnx convert code, infer code, list as below:

model onnx export:

onnx_export_singer.py

onnx_export_hifigan.py

onnx_export_pe.py

onnx model infer:

onnx_test_singer.py

[Jinglin Liu,etc] DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism. AAAI Conference on Artificial Intelligence (AAAI-22)

# coding=utf8

import json
import os
import sys
import inference.svs.ds_e2e as e2e
from modules.fastspeech.pe import PitchExtractor
from usr.diff.shallow_diffusion_tts import GaussianDiffusion
from usr.diffsinger_task import DIFF_DECODERS

from utils import load_ckpt
from utils.audio import save_wav
from utils.hparams import set_hparams, hparams

import torch
import numpy as np

from utils.text_encoder import TokenTextEncoder
from usr.diffsinger_task import DIFF_DECODERS

root_dir = os.path.dirname(os.path.abspath(__file__))
os.environ['PYTHONPATH'] = f'"{root_dir}"'

sys.argv = [
    f'{root_dir}/inference/svs/ds_e2e.py',
    '--config',
    f'{root_dir}/usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml',
    '--exp_name',
    '0228_opencpop_ds100_rel'
]


class GaussianDiffusionWrap(GaussianDiffusion):
    def forward(self, txt_tokens,
                # Wrapped Arguments
                spk_id,
                pitch_midi,
                midi_dur,
                is_slur,
                #mel2ph,
                ):

        return super().forward(txt_tokens, spk_id=spk_id, ref_mels=None, infer=True,
                               pitch_midi=pitch_midi, midi_dur=midi_dur,
                               is_slur=is_slur)#, mel2ph=mel2ph


class DFSInferWrapped(e2e.DiffSingerE2EInfer):
    def build_model(self):
        model = GaussianDiffusionWrap(
            phone_encoder=self.ph_encoder,
            out_dims=hparams['audio_num_mel_bins'], denoise_fn=DIFF_DECODERS[hparams['diff_decoder_type']](
                hparams),
            timesteps=hparams['timesteps'],
            K_step=hparams['K_step'],
            loss_type=hparams['diff_loss_type'],
            spec_min=hparams['spec_min'], spec_max=hparams['spec_max'],
        )

        model.eval()
        #load_ckpt(model, hparams['work_dir'], 'model')
        load_ckpt(model, hparams['fs2_ckpt'], 'model')

        if hparams.get('pe_enable') is not None and hparams['pe_enable']:
            self.pe = PitchExtractor().to(self.device)
            load_ckpt(self.pe, hparams['pe_ckpt'], 'model', strict=True)
            self.pe.eval()

        return model


if __name__ == '__main__':

    set_hparams(config='usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml',print_hparams=False)

    dev = 'cpu'
    #dev = 'cuda' if torch.cuda.is_available() else 'cpu'

    infer_ins = DFSInferWrapped(hparams)
    infer_ins.model.to(dev)

    inp = {
        'text': 'AP',
        'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
        'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
        'input_type': 'word'
    }  # user input: Chinese characters
    
    
    
    
    phone_list = ["AP", "SP", "a", "ai", "an", "ang", "ao", "b", "c", "ch", "d", "e", "ei", "en", "eng", "er", "f", "g",
                  "h", "i", "ia", "ian", "iang", "iao", "ie", "in", "ing", "iong", "iu", "j", "k", "l", "m", "n", "o",
                  "ong", "ou", "p", "q", "r", "s", "sh", "t", "u", "ua", "uai", "uan", "uang", "ui", "un", "uo", "v",
                  "van", "ve", "vn", "w", "x", "y", "z", "zh"]
    
    ph_encoder =  TokenTextEncoder(None, vocab_list=phone_list, replace_oov=',')

    gd = GaussianDiffusion(
                 phone_encoder = ph_encoder,
                 out_dims=hparams['audio_num_mel_bins'], denoise_fn=DIFF_DECODERS[hparams['diff_decoder_type']](hpa
     rams),
                 timesteps=hparams['timesteps'],
                 K_step=hparams['K_step'],
                 loss_type=hparams['diff_loss_type'],
                 spec_min=hparams['spec_min'], spec_max=hparams['spec_max'],
             )
    load_ckpt(gd, hparams['fs2_ckpt'],'model')
    
    with torch.no_grad():
        inp = infer_ins.preprocess_input(
            inp, input_type=inp['input_type'] if inp.get('input_type') else 'word')
        sample = infer_ins.input_to_batch(inp)
        txt_tokens = sample['txt_tokens']  # [B, T_t]
        spk_id = sample.get('spk_ids')

        pitch_midi = sample['pitch_midi']
        midi_dur = sample['midi_dur']
        is_slur = sample['is_slur']
        #mel2ph  = np.zeros_like(is_slur) #None #sample['mel2ph']
        #output_names = 
        
        print(f'txt_tokens: {txt_tokens.shape}')
        print(f'pitch_midi: {pitch_midi.shape}')
        print(f'midi_dur: {midi_dur.shape}')
        print(f'is_slur: {is_slur.shape}')

        torch.onnx.export(
            gd.fs2,
            (
                txt_tokens.to(dev),
                spk_id.to(dev),
                pitch_midi.to(dev),
                midi_dur.to(dev),
                is_slur.to(dev),
                #mel2ph.to(dev),
            ),
            "fs2.onnx",
            verbose=True,
            export_params=True,
            input_names=["txt_tokens", "spk_id",
                         "pitch_midi", "midi_dur", "is_slur"],#, "mel2ph"
            output_names = ['dur','mel2ph','decoder_inp','mel_out','fs2_mel','fs2_mels'],
            dynamic_axes={
                "txt_tokens": {
                    0: "a",
                    1: "b",
                },
                "spk_id": {
                    0: "a",
                    1: "b",
                },
                "pitch_midi": {
                    0: "a",
                    1: "b",
                },
                "midi_dur": {
                    0: "a",
                    1: "b",
                },
                "is_slur": {
                    0: "a",
                    1: "b",
                },
                
                "mel_out": {
                    0: "a",
                    1: "b",
                    2: "c",
                },
                "fs2_mel": {
                    0: "a",
                    1: "b",
                    2: "c",
                },
                "fs2_mels": {
                    0: "a",
                    1: "b",
                    2: "c",
                },

                #"mel2ph": {
                #    0: "a",
                #    1: "b",
                #}
            },
            opset_version=13
        )
    print('finished part1!\n')

# coding=utf8

import os
import sys
import inference.svs.ds_e2e as e2e
from utils.audio import save_wav
from utils.hparams import set_hparams, hparams

import torch

root_dir = os.path.dirname(os.path.abspath(__file__))
os.environ['PYTHONPATH'] = f'"{root_dir}"'

sys.argv = [
    f'{root_dir}/inference/svs/ds_e2e.py',
    '--config',
    f'{root_dir}/usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml',
    '--exp_name',
    '0228_opencpop_ds100_rel'
]

if __name__ == '__main__':

    set_hparams(config='usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml', print_hparams=False)

    dev = 'cuda' if torch.cuda.is_available() else 'cpu'

    infer_ins = e2e.DiffSingerE2EInfer(hparams)
    infer_ins.vocoder.to(dev)
    batch = 1
    frame_len=967
    num_mel_bin = 80

    with torch.no_grad():
        x = torch.rand(batch, num_mel_bin, frame_len).to(dev)
        f0 = torch.rand(batch, frame_len).to(dev)

        #x = torch.load("c.pt").to(dev)
        #f0 = torch.load("f0.pt").to(dev)

        print(x.shape)
        print(f0.shape)

        torch.onnx.export(
            infer_ins.vocoder,
            (
                x,
                f0
            ),
            "hifigan.onnx",
            verbose=True,
            export_params=True,
            input_names=["x", "f0"],
            dynamic_axes={
                "x": {
                    0: "batch",
                    1: "num_mel_bin",
                    2: "frame_len",
                },
                "f0": {
                    0: "batch",
                    1: "frame_len"
                }
            },
            opset_version=13,
        )

    print(infer_ins.vocoder)
    print("OK")

# coding=utf8

import os
from pyexpat import model
import sys
import inference.svs.ds_e2e as e2e
from inference.svs.opencpop.map import cpop_pinyin2ph_func
from utils.audio import save_wav
from utils.hparams import set_hparams, hparams

import numpy as np

import torch
import onnxruntime as ort

from tqdm import tqdm

from utils.text_encoder import TokenTextEncoder

root_dir = os.path.dirname(os.path.abspath(__file__))
os.environ['PYTHONPATH'] = f'"{root_dir}"'

sys.argv = [
    f'{root_dir}/inference/svs/ds_e2e.py',
    '--config',
    f'{root_dir}/usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml',
    '--exp_name',
    '0228_opencpop_ds100_rel'
]


def to_numpy(tensor):
    if (tensor is None):
        return np.array([[]])
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()


spec_max = 0
spec_min = 0


def denorm_spec(x):
    return (x + 1) / 2 * (spec_max - spec_min) + spec_min


class TestAllInfer(e2e.DiffSingerE2EInfer):
    def __init__(self, hparams, device=None):
        if device is None:
            device = 'cuda' if torch.cuda.is_available() else 'cpu'
        #device='cpu'
        self.hparams = hparams
        self.device = device

        phone_list = ["AP", "SP", "a", "ai", "an", "ang", "ao", "b", "c", "ch", "d", "e", "ei", "en", "eng", "er", "f", "g",
                      "h", "i", "ia", "ian", "iang", "iao", "ie", "in", "ing", "iong", "iu", "j", "k", "l", "m", "n", "o",
                      "ong", "ou", "p", "q", "r", "s", "sh", "t", "u", "ua", "uai", "uan", "uang", "ui", "un", "uo", "v",
                      "van", "ve", "vn", "w", "x", "y", "z", "zh"]
        self.ph_encoder = TokenTextEncoder(
            None, vocab_list=phone_list, replace_oov=',')
        self.pinyin2phs = cpop_pinyin2ph_func()
        self.spk_map = {'opencpop': 0}


        config_file_path = "vaip_config.json"
        aie_options = ort.SessionOptions()
        aie_options.enable_profiling = True

        print("load pe")
        self.pe2 = ort.InferenceSession("xiaoma_pe.onnx",
                                        providers = ['VitisAIExecutionProvider'],
                                        sess_options=aie_options,
                                        provider_options=[{'config_file': config_file_path}])
        print("load hifigan")
        self.vocoder2 = ort.InferenceSession("hifigan.onnx",
                                        providers = ['VitisAIExecutionProvider'],
                                        sess_options=aie_options,
                                        provider_options=[{'config_file': config_file_path}])
        print("load singer_fs")
        self.model2 = ort.InferenceSession("singer_fs.onnx",
                                        providers = ['VitisAIExecutionProvider'],
                                        sess_options=aie_options,
                                        provider_options=[{'config_file': config_file_path}])
        ips = self.model2.get_inputs()
        print(len(ips))
        for i in range(0, len(ips)):
            print(f'{i}. {ips[i].name}')

        '''
        print("load singer_denoise")
        self.model3 = ort.InferenceSession("singer_denoise.onnx")
        ips = self.model3.get_inputs()
        print(len(ips))
        for i in range(0, len(ips)):
            print(f'{i}. {ips[i].name}')
        '''
        print("load over")

    def run_vocoder(self, c, **kwargs):
        c = c.transpose(2, 1)  # [B, 80, T]
        f0 = kwargs.get('f0')  # [B, T]

        if f0 is not None and hparams.get('use_nsf'):
            ort_inputs = {
                'x': to_numpy(c),
                'f0': to_numpy(f0)
            }
        else:
            ort_inputs = {
                'x': to_numpy(c),
                'f0': {}
            }
            # [T]

        ort_out = self.vocoder2.run(None, ort_inputs)
        y = torch.from_numpy(ort_out[0]).to(self.device)

        return y[None]

    def forward_model(self, inp):
        sample = self.input_to_batch(inp)
        txt_tokens = sample['txt_tokens']  # [B, T_t]
        spk_id = sample.get('spk_ids')
        midi_dur = sample['midi_dur']
        is_slur = sample['is_slur']
        #mel2ph = sample['mel2ph']
        #mel2ph = None

        device = txt_tokens.device

        with torch.no_grad():
            decoder_inp = self.model2.run(
                None,
                {
                    "txt_tokens": to_numpy(txt_tokens),
                    #"spk_id": to_numpy(spk_id),
                    "pitch_midi": to_numpy(sample['pitch_midi']).astype(np.int64),
                    "midi_dur": to_numpy(sample['midi_dur']),
                    "is_slur": to_numpy(sample['is_slur']).astype(np.int64),
                    #"mel2ph": np.array([0, 0]).astype(np.int64)
                }
            )

            cond = torch.from_numpy(decoder_inp[0]).transpose(1, 2)

            print(f'cond2: {cond}')

            t = hparams['K_step']
            print('===> gaussion start.')
            shape = (cond.shape[0], 1,
                     hparams['audio_num_mel_bins'], cond.shape[2])
            x = torch.randn(shape, device=device)
            # x = torch.zeros(shape, device=device)
            '''
            
            for i in tqdm(reversed(range(0, t)), desc='sample time step', total=t):
                res2 = self.model3.run(
                    None,
                    {
                        "x": to_numpy(x),
                        "t": np.array([i]).astype(np.int64),
                        "cond": to_numpy(cond),
                    }
                )
                x = torch.from_numpy(res2[0])

            x = x[:, 0].transpose(1, 2)
            
            x = cond
            
            if mel2ph is not None:  # for singing
                mel_out = denorm_spec(x) * ((mel2ph > 0).float()[:, :, None])
            else:
                mel_out = denorm_spec(x)

            # mel_out = output['mel_out']  # [B, T,80]
            '''
            #mel_out = decoder_inp['mel_out']
            mel_out = decoder_inp[-2]
            print(mel_out.shape)
            print(f'mel_out:{mel_out}')
            
            if hparams.get('pe_enable') is not None and hparams['pe_enable']:
                pe2_res = self.pe2.run(None,
                                       {
                                           #'mel_input': to_numpy(mel_out)
                                           'mel_input': mel_out
                                       }
                                       )

                # pe predict from Pred mel
                f0_pred = torch.from_numpy(pe2_res[1])

            else:
                # f0_pred = output['f0_denorm']
                f0_pred = None

            # Run Vocoder
            wav_out = self.run_vocoder(torch.from_numpy(mel_out), f0=f0_pred)
        wav_out = wav_out.cpu().numpy()
        return wav_out[0]


if __name__ == '__main__':
    
    c = {
        'text': 'AP',
        'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
        'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
        'input_type': 'word'
    }  # user input: Chinese characters
    '''
    
    c = {
        'text': '    SP         AP',
        'notes': 'D#4/Eb4 | D#4/Eb4 | D#4/Eb4 | D#4/Eb4 | rest | D#4/Eb4 | D4 | D4 | D4 | D#4/Eb4 | F4 | D#4/Eb4 | D4 | rest',
        'notes_duration': '0.113740 | 0.329060 | 0.287950 | 0.133480 | 0.150900 | 0.484730 | 0.242010 | 0.180820 | 0.343570 | 0.152050 | 0.266720 | 0.280310 | 0.633300 | 0.444590',
        'input_type': 'word'
    }
    '''

    target = "./infer_out/onnx_test_singer_res.wav"

    set_hparams(config='usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml',print_hparams=False)

    spec_min= torch.FloatTensor(hparams['spec_min'])[None, None, :hparams['keep_bins']]
    spec_max= torch.FloatTensor(hparams['spec_max'])[None, None, :hparams['keep_bins']]

    infer_ins = TestAllInfer(hparams)

    out = infer_ins.infer_once(c)
    os.makedirs(os.path.dirname(target), exist_ok=True)
    print(f'| save audio: {target}')
    save_wav(out, target, hparams['audio_sample_rate'])

    print("OK")

# coding=utf8

import os
import sys
import inference.svs.ds_e2e as e2e
from utils.audio import save_wav
from utils.hparams import set_hparams, hparams

import torch

root_dir = os.path.dirname(os.path.abspath(__file__))
os.environ['PYTHONPATH'] = f'"{root_dir}"'

sys.argv = [
    f'{root_dir}/inference/svs/ds_e2e.py',
    '--config',
    f'{root_dir}/usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml',
    '--exp_name',
    '0228_opencpop_ds100_rel'
]

if __name__ == '__main__':
    set_hparams(config='usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml', print_hparams=False)

    #dev = 'cuda'
    dev = 'cuda' if torch.cuda.is_available() else 'cpu'

    infer_ins = e2e.DiffSingerE2EInfer(hparams)
    infer_ins.pe.to(dev)
    batch = 1
    frame_len = 967
    num_mel_bin = 80
    with torch.no_grad():
        mel_input = torch.rand(batch, frame_len, num_mel_bin).to(dev)

        torch.onnx.export(
            infer_ins.pe,
            (
                mel_input
            ),
            "xiaoma_pe.onnx",
            verbose=True,
            export_params=True,
            input_names=["mel_input"],
            dynamic_axes={
                "mel_input": {
                    0: "batch",
                    1: "frame_len",
                    2: "num_mel_bin"
                }
            },
            opset_version=13
        )

    print("OK")

Credits

Min Ma

8 projects • 1 follower

Senior Software Engineer

Thanks to Liu, Jinglin and Li, Chengxi and Ren, Yi and Chen, Feiyang and Liu, Peng and Zhao, Zhou.

Singing Voice Synthesis with DiffSinger

Things used in this project

Hardware components

Story

1.Project Planning

1.1 Model architecture

1.1.1 Acoustic model

1.1.2 The Generation Model

1.2 Diffusion Model

1.2.1 Diffusion Model Processes

1.2.2 Model Details

1.2.3 Workflow

1.3 Shallow Diffusion Mechanism

3.Building

3.1 Model conversion using ONNX

3.2 pipe process and deploy

4.Coding and algorithm

Code

GaussianDiffusion model export code

hifigan model onnx export

DiffSinger infer code

PE model export code

Credits

Min Ma

Comments

Embed the widget on your own site

Singing Voice Synthesis with DiffSinger

Singing Voice Synthesis with DiffSinger

Things used in this project

Hardware components

Story

1.Project Planning

1.1 Model architecture

1.1.1 Acoustic model

1.1.2 The Generation Model

1.2 Diffusion Model

1.2.1 Diffusion Model Processes

1.2.2 Model Details

1.2.3 Workflow

1.3 Shallow Diffusion Mechanism

3.Building

3.1 Model conversion using ONNX

3.2 pipe process and deploy

4.Coding and algorithm

Code

GaussianDiffusion model export code

hifigan model onnx export

DiffSinger infer code

PE model export code

Credits

Min Ma

Comments