Demo Video
Demo Page
Features
Task Description
System Description
Demo Samples
Comparison with Recent Models
Open-source Code
Installation
Training
Inference

Published August 1, 2024 © MIT

Universal Speech Enhancement with Score Diffusion

Apply Score diffusion to improve speech signals recorded under various adverse conditions and distortions.

AdvancedWork in progress1 hour108

Universal Speech Enhancement with Score Diffusion

Things used in this project

Hardware components

AMD Radeon™ Pro W7900 GPU

Software apps and online services

Ubuntu 22.04 LTS

PyTorch 2.3.0

Story

Demo Video

Universal Speech Enhancement Demo_哔哩哔哩_bilibili

Demo Page

https://nanless.github.io/universal-speech-enhancement-demo

All demo audio samples available in the demo page. It may take a few seconds to load the audio and image files in the demo page.

Features

Improving speech signals recorded under various distortion conditions to approximate studio-quality audio
One model for all monaural speech enhancement tasks: noise suppression, dereverberation, equalization, packet loss concealment, bandwidth extension, de-clipping, and others
Score diffusion-based and GAN-based approaches for training and inference
Easy-to-use interface
24kHz sampling rate pipeline

Task Description

Universal speech enhancement aims to improve speech signals recorded under various adverse conditions and distortions, including noise, reverberation, clipping, equalization (EQ) distortion, packet loss, codec loss, bandwidth limitations, and other forms of degradation.

A comprehensive universal speech enhancement system integrates multiple techniques such as noise suppression, dereverberation, equalization, packet loss concealment, bandwidth extension, de-clipping, and other enhancement methods to produce speech signals that closely approximate studio-quality audio. This improvement in audio transmission via communication devices not only makes communication more comfortable but also benefits a wide range of other speech processing systems.

System Description

NCSN++ network architecture used as a score model

My universal speech enhancement models are trained on clean speech from the EARS dataset and noise signals from the DNS5 dataset. After training on simulated dataset with various distortions, the models are capable of removing distortions from speech signal and approximate studio-quality audio. The training process follows diffusion model training style, and the network structure used is illustrated in the figure above.

The sampling rate used in the project is 24 kHz. Input audio will be resampled to 24 kHz before processing, and the output audio remains at 24 kHz.

Demo Samples

Hybrid Distortions 1

Hybrid Distortions 1 Enhanced

Hybrid Distortions 2

Hybrid Distortions 2 Enhanced

Hybrid Distortions 3

Hybrid Distortions 3 Enhanced

Hybrid Distortions 4

Hybrid Distortions 4 Enhanced

Comparison with Recent Models

Comparison of the enhanced results of different open-source speech restoration and enhancement methods: Voice Fixer and Resemble Enhance. As illustrated, the enhanced result of my model is much better than that of Voice Fixer, and a little better then Resemble Enhance.

Equalization distortion

Voice Fixer

Resemble Enhance

Score Diffusion

Open-source Code

nanless/universal-speech-enhancement: Apply Score diffusion to improve speech signals recorded under various adverse conditions and distortions, including noise, reverberation, clipping, equalization (EQ) distortion, packet loss, codec loss, bandwidth limitations, and other forms of degradation. (github.com)

Installation

# install pytorch (I use rocm for GPU training, but you can use CUDA if you have it)
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/rocm6.0
# install requirements
pip install -r requirements.txt

Training

Train model with chosen experiment configuration from configs/experiment/

python src/train.py experiment=SGMSE_Large

Inference

Predict with trained model

Download the pretrained SGMSE model: https://huggingface.co/nanless/universal-speech-enhancement-SGMSE/blob/main/use_SGMSE.ckpt

Download the pretrained LSGAN model: https://huggingface.co/nanless/universal-speech-enhancement-LSGAN/blob/main/use_LSGAN.ckpt

# predict with SGMSE model
python src/predict.py data.data_folder=<path/to/test/folder> data.target_folder=<path/to/output/folder> model=SGMSE_Large ckpt_path=<path/to/trained/model>
# refine with LSGAN model
python src/predict.py data.data_folder=<path/to/SGMEE/output/folder> data.target_folder=<path/to/output/folder> model=LSGAN ckpt_path=<path/to/trained/model>