Universal Speech Enhancement Demo_哔哩哔哩_bilibili
Demo Pagehttps://nanless.github.io/universal-speech-enhancement-demo
All demo audio samples available in the demo page. It may take a few seconds to load the audio and image files in the demo page.
Features- Improving speech signals recorded under various distortion conditions to approximate studio-quality audio
- One model for all monaural speech enhancement tasks: noise suppression, dereverberation, equalization, packet loss concealment, bandwidth extension, de-clipping, and others
- Score diffusion-based and GAN-based approaches for training and inference
- Easy-to-use interface
- 24kHz sampling rate pipeline
Universal speech enhancement aims to improve speech signals recorded under various adverse conditions and distortions, including noise, reverberation, clipping, equalization (EQ) distortion, packet loss, codec loss, bandwidth limitations, and other forms of degradation.
A comprehensive universal speech enhancement system integrates multiple techniques such as noise suppression, dereverberation, equalization, packet loss concealment, bandwidth extension, de-clipping, and other enhancement methods to produce speech signals that closely approximate studio-quality audio. This improvement in audio transmission via communication devices not only makes communication more comfortable but also benefits a wide range of other speech processing systems.
System DescriptionMy universal speech enhancement models are trained on clean speech from the EARS dataset and noise signals from the DNS5 dataset. After training on simulated dataset with various distortions, the models are capable of removing distortions from speech signal and approximate studio-quality audio. The training process follows diffusion model training style, and the network structure used is illustrated in the figure above.
The sampling rate used in the project is 24 kHz. Input audio will be resampled to 24 kHz before processing, and the output audio remains at 24 kHz.
Demo SamplesComparison of the enhanced results of different open-source speech restoration and enhancement methods: Voice Fixer and Resemble Enhance. As illustrated, the enhanced result of my model is much better than that of Voice Fixer, and a little better then Resemble Enhance.
# install pytorch (I use rocm for GPU training, but you can use CUDA if you have it)
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/rocm6.0
# install requirements
pip install -r requirements.txt
TrainingTrain model with chosen experiment configuration from configs/experiment/
python src/train.py experiment=SGMSE_Large
InferencePredict with trained model
Download the pretrained SGMSE model: https://huggingface.co/nanless/universal-speech-enhancement-SGMSE/blob/main/use_SGMSE.ckpt
Download the pretrained LSGAN model: https://huggingface.co/nanless/universal-speech-enhancement-LSGAN/blob/main/use_LSGAN.ckpt
# predict with SGMSE model
python src/predict.py data.data_folder=<path/to/test/folder> data.target_folder=<path/to/output/folder> model=SGMSE_Large ckpt_path=<path/to/trained/model>
# refine with LSGAN model
python src/predict.py data.data_folder=<path/to/SGMEE/output/folder> data.target_folder=<path/to/output/folder> model=LSGAN ckpt_path=<path/to/trained/model>
Comments