In the tide of AI development, generative AI is the most eye-catching, especially with the emergence of LLM large models, which has made AIGC a qualitative breakthrough in natural language, image processing, robotics, etc. In this project, using the current mainstream AIGC technology on the AMD AI PC platform, a high naturalness degree of song can be generated from text and musical scores.
The SVS system for singing voice synthesis is built to synthesize high-quality and expressive singing voices, where the acoustic model generates acoustic features such as Mel spectrum on a given musical score. Previous singing acoustic models used simple losses such as L1 loss and L2 loss or GAN generative adversarial networks to reconstruct acoustic features, but they had excessive smoothing and unstable training problems, which hindered the naturalness of synthesized singing voices.
The DiffSinger[Jinglin Liu,etc] model is an SVS acoustic model based on a diffusion probability model. DiffSinger is also a Markov chain parameter model that converts noise into mel spectra step by step based on the musical score. By implicitly optimizing the variational boundary, DiffSinger can stably train and generate realistic outputs. In order to further improve speech quality and accelerate inference speed, the authors of the paper introduced a shallow diffusion mechanism to better utilize the prior knowledge learned through previous simple losses.
DiffSinger first uses a dictionary to convert text into basic phonemes, and can also convert Midi music scores into notes. It uses Mel spectrum to analyze and express singing voice. Mel spectrum is obtained by converting sound into the frequency domain through FFT transformation and taking the corresponding power spectrum distribution in time series. The mel spectrum of a piece of sound is shown in the figure below:
With the Mel spectrum, sound can be synthesized, so one of the goals of the generation model is to generate the Mel spectrum information of singing voice, and then use this Mel spectrum to generate the singing voice. In order to obtain the Mel spectrum information, an Encoder and an Aux-Decoder are constructed. After training with data, an estimation M of the Mel spectrum can be obtained.
The Mel spectrum predicted by the aux-decoder is a rough estimate and cannot generate high-quality songs. Therefore, it is necessary to use a generation model to make precise predictions of the Mel spectrum, incorporating features such as vocal , rhythm, and other stylistic phonemes into the final generated sound.
1) Shallow Diffusion Model
Specifically, DiffSinger starts generating with a shallow number of steps that is less than the original total number of diffusion steps, based on the intersection point between the diffusion trajectory of the true Mel spectrum and the diffusion trajectory predicted by the simple Mel spectrum decoder. In addition, the paper proposes a boundary prediction method to locate the intersection point and adaptively determine the shallow steps.
1.2 Diffusion ModelSince DiffSinger is constructed based on the diffusion model, let's introduce the diffusion model here.
The diffusion model has several advantages over other generative models, including its ability to generate high-quality samples, its flexibility in modeling complex data distributions, and its potential for scaling to large datasets. In the context of DiffSinger, the diffusion model is used to generate high-quality singing voices by predicting the Mel spectrum of the sound at each step of the reverse diffusion process.
The diffusion model consists of two main processes: the diffusion process and the reverse process.
- Diffusion Process
The diffusion process is a Markov chain with fixed parameters that gradually adds small amounts of Gaussian noise to the data at each step, eventually transforming the original data into a Gaussian distribution. This process is designed to destroy the structural information in the data, making it increasingly difficult to distinguish from pure noise.
- Reverse Process
The reverse process is a Markov chain with learnable parameters that aims to reverse the diffusion process, recovering the original data from the Gaussian noise. In this process, a neural network, known as the denoiser, is trained to predict and subtract the noise added at each step of the diffusion process, effectively reversing the diffusion process and generating samples from the original data distribution.
DiffSinger is built upon the diffusion model. Since the task of singing voice synthesis (SVS) involves modeling the conditional distribution p(M0|x), where M represents the Mel spectrum and x is the corresponding music score, the music score x is used as a condition in the reverse process of the diffusion denoiser. To improve the performance and efficiency of the model, a novel shallow diffusion mechanism is proposed. Additionally, a boundary prediction network is introduced, which can adaptively find the intersection boundaries required by the shallow diffusion mechanism.
1.2.3 WorkflowDuring training, DiffSinger predicts the random noise based on the timestep t, the music score x, and the Mel spectrum Mt at the t-th step.
During inference, the process starts with Gaussian noise sampled from the N(0, 1) distribution and proceeds step-by-step to denoise the noise. The denoising process consists of two main steps:
- Use the denoiser to predict the random noise added at the current timestep.
- Based on the predicted random noise, infer the Mel spectrum Mt-1 from Mt.
By iteratively applying these two steps in reverse order, DiffSinger can generate the Mel spectrum of the singing voice, which can then be used to synthesize the final audio.
1.3 Shallow Diffusion MechanismWhile acoustic models trained with simple loss functions may suffer from blurriness and oversmoothing, they still produce samples that have a strong connection to the real data distribution. This connection can provide valuable prior knowledge for DiffSinger.
To explore this connection and find ways to better utilize the prior knowledge, the authors of the paper conduct empirical observations using the diffusion process. They propose a shallow diffusion mechanism as a way to leverage this prior knowledge more effectively.
The shallow diffusion mechanism aims to reduce the number of diffusion steps required to achieve a similar level of denoising performance. By shortening the diffusion path, the model can focus more on the relevant features and patterns in the data, while avoiding unnecessary diffusion into irrelevant noise.
This mechanism allows DiffSinger to start from a partially denoised state, rather than pure Gaussian noise, and to perform a shorter and more targeted denoising process. By doing so, DiffSinger can make better use of the prior knowledge provided by the acoustic models and generate singing voices that are more faithful to the original data distribution.
- When t=0, M contains rich details between adjacent harmonics, which can affect the naturalness of the synthesized singing voice. However, M~ is overly smoothed.
- As t increases, the samples from the two processes become indistinguishable.
Based on these observations, the authors of the paper propose the following hypothesis: When the number of diffusion steps is sufficiently large, the trajectory from the manifold of M~ to the Gaussian noise manifold intersects with the trajectory from the manifold of M to the Gaussian noise manifold. Inspired by this observation, the authors propose the shallow diffusion mechanism: instead of starting from pure Gaussian white noise, the reverse process begins at the intersection point of the two trajectories as shown in the figure below. This allows the model to leverage the prior information encoded in M and focus on denoising only the necessary parts, thereby improving the efficiency and quality of the generated singing voices.
Therefore, the computational overhead of the reverse process can be significantly alleviated. Specifically, during the inference stage, the authors propose the following steps:
- An auxiliary decoder is utilized to generate M, which is conditioned on the output of the music score encoder and trained with L1 loss. This decoder serves as a way to approximate the initial state M that is close to the intersection point of the two trajectories.
- An intermediate sample is generated through the diffusion process at a smaller timestep k. This step helps to initialize the reverse process at a state that is closer to the desired output, reducing the number of necessary denoising steps.
- Starting from the Mel spectrogram Mk~ at the intersection point of the two trajectories, the reverse process performs k rounds of denoising to reconstruct the final singing voice. By initiating the reverse process from this intermediate state, the model can efficiently generate high-quality singing voices with reduced computational cost.
2.Part Sourcing
- DiffSinger source code
https://github.com/MoonInTheRiver/DiffSinger
- Vitis AI library support
https://github.com/Xilinx/Vitis-AI
- Ryzen AI PC
PC with AMD Ryzen AI support
3.Building3.1 Model conversion using ONNXDue to the complexity of the DiffSinger model, the ONNX Open Neural Network Exchange format mainly supports static models, which means it cannot directly handle complex models that contain dynamic or conditional computation paths. Therefore, we have taken the approach of splitting the DiffSinger model into multiple sub-models, including GaussianDiffusion, Hifigan, and PitchEncoder. For the more complex GaussianDiffusion module, we have created a new class to encapsulate its model structure and use ONNX's export function to convert the model to ONNX format. This process allows us to maintain the manageability and compatibility of the model while taking advantage of ONNX's cross-platform advantages. The following is a simplified example code framework that shows how to encapsulate and export the GaussianDiffusion module
class GaussianDiffusionWrap(GaussianDiffusion):
def forward(self, txt_tokens,
# Wrapped Arguments
spk_id,
pitch_midi,
midi_dur,
is_slur,
#mel2ph,
):
return super().forward(txt_tokens, spk_id=spk_id, ref_mels=None, infer=True,
pitch_midi=pitch_midi, midi_dur=midi_dur,
is_slur=is_slur)#, mel2ph=mel2ph
class DFSInferWrapped(e2e.DiffSingerE2EInfer):
def build_model(self):
model = GaussianDiffusionWrap(
phone_encoder=self.ph_encoder,
out_dims=hparams['audio_num_mel_bins'],
denoise_fn=DIFF_DECODERS[hparams['diff_decoder_type']](
hparams), timesteps=hparams['timesteps'],
K_step=hparams['K_step'],
loss_type=hparams['diff_loss_type'],
spec_min=hparams['spec_min'],
spec_max=hparams['spec_max'],
)
model.eval()
load_ckpt(model, hparams['fs2_ckpt'], 'model')
if hparams.get('pe_enable') is not None and hparams['pe_enable']:
self.pe = PitchExtractor().to(self.device)
load_ckpt(self.pe, hparams['pe_ckpt'], 'model', strict=True)
self.pe.eval()
return model
3.2 pipe process and deployIn the reasoning process, the GaussianDiffusion module first receives the music score and text input. These inputs are converted into a frequency domain representation of audio characteristics, the Mel spectrum, using the ShallowDiffusion algorithm, a simplified diffusion model algorithm. Subsequently, the Mel spectrum is passed to the PitchEncoder module to further process the pitch information of the audio, and finally generates the final audio file in WAV format through the Hifigan module. The entire process builds a complete conversion chain from text and music score to sound output.
c = {
'text': '你 说 你 不 SP 懂 为 何 在 这 时 牵 手 AP',
'notes': 'D#4/Eb4 | D#4/Eb4 | D#4/Eb4 | D#4/Eb4 | rest | D#4/Eb4 | D4 | D4 | D4 | D#4/Eb4 | F4 | D#4/Eb4 | D4 | rest',
'notes_duration': '0.113740 | 0.329060 | 0.287950 | 0.133480 | 0.150900 | 0.484730 | 0.242010 | 0.180820 | 0.343570 | 0.152050 | 0.266720 | 0.280310 | 0.633300 | 0.444590',
'input_type': 'word'
}
target = "./infer_out/onnx_test_singer_res.wav"
set_hparams(print_hparams=False)
spec_min= torch.FloatTensor(hparams['spec_min'])[None, None, :hparams['keep_bins']]
spec_max= torch.FloatTensor(hparams['spec_max'])[None, None, :hparams['keep_bins']]
infer_ins = TestAllInfer(hparams)
out = infer_ins.infer_once(c)
os.makedirs(os.path.dirname(target), exist_ok=True)
print(f'| save audio: {target}')
save_wav(out, target, hparams['audio_sample_rate'])
print("OK")
To maximize the performance of Ryzen AI PC, Vitis AI runtime is adopted as the underlying support for ONNX models in model infer processing, thereby enhancing the real-time performance of the models. As the model parameters are not very large, and for a high quanlity sound, the infer process do not quantize model.
config_file_path = "vaip_config.json"
aie_options = ort.SessionOptions()
aie_options.enable_profiling = True
print("load pe")
self.pe2 = ort.InferenceSession("xiaoma_pe.onnx",
providers = ['VitisAIExecutionProvider'],
sess_options=aie_options,
provider_options=[{'config_file': config_file_path}])
print("load hifigan")
self.vocoder2 = ort.InferenceSession("hifigan.onnx",
providers = ['VitisAIExecutionProvider'],
sess_options=aie_options,
provider_options=[{'config_file': config_file_path}])
print("load singer_fs")
self.model2 = ort.InferenceSession("singer_fs.onnx",
providers = ['VitisAIExecutionProvider'],
sess_options=aie_options,
provider_options=[{'config_file': config_file_path}])
here is experiment output wav file link.
4.Coding and algorithmI have upload model onnx convert code, infer code, list as below:
- model onnx export:
onnx_export_singer.py
onnx_export_hifigan.py
onnx_export_pe.py
- onnx model infer:
onnx_test_singer.py
[Jinglin Liu,etc] DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism. AAAI Conference on Artificial Intelligence (AAAI-22)
Comments