Sound demos for "DiffWave: A Versatile Diffusion Model for Audio Synthesis"



Reimplementations on GitHub:

Vocoder in PyTorch;

Another vocoder in PyTorch;

Vocoder in TensorFlow;

Unconditional generator in PyTorch.

Section Ⅰ: Neural vocoding on the LJ Speech dataset

The audio samples are generated by conditioning on the ground-truth mel spectrogram. Let C denote the residual channels in the WaveNet-like architectures and T denote the number of diffusion steps in the reverse process.

DiffWave (C = 128, T = 200)   WaveNet (C = 128)     WaveFlow (C = 128)      Ground-truth (recorded)    


DiffWave (C = 64,  T = 50)     ClariNet (C = 64)       WaveFlow (C = 64)      


Fast sampling:

Audio samples can be directly generated from above DiffWave models trained with T = 200 or 50 diffusion steps within as few as Tinfer = 6 steps at synthesis, thus the synthesis is much faster.

DiffWave (C = 128,  T = 200,  Tinfer = 6)     DiffWave (C = 64,  T = 50,  Tinfer = 6)      


Section Ⅱ: Class-conditional waveform generation on the SC09 dataset

The audio samples are generated by conditioning on the digit labels (0 - 9). Results are arranged according to the conditional digit labels. More samples can be downloaded via this link.

     DiffWave      Ground-truth (recorded)    
     WaveNet-256


Section Ⅲ: Unconditional waveform generation on the SC09 dataset

The audio samples are generated without any conditional information. Results are arranged according to their digit labels from human listeners. More samples can be downloaded via this link.

     DiffWave      Ground-truth (recorded)    
     WaveGAN:
    WaveNet-256:


Section Ⅳ: Denoising steps in the reverse process of DiffWave

The audio samples are from the intermediate outputs of the above DiffWave vocoder (C = 128, T = 200) in the reverse process, which gradually transforms white noise (t = 200) to human sounds (t = 0). Note that, the most effective denoising steps occur near t = 0.

WARNING: Loud volume!

  t = 200
  t = 100
  t = 50
  t = 20
  t = 10
  t = 5
  t = 1
  t = 0


Section Ⅴ: Zero-shot speech denoising (bonus!)

We found our unconditional DiffWave model can readily perform speech denoising. The SC09 dataset provides six different kinds of noises for data augmentation in recognition task: (1) white noise, (2) pink noise, (3) running tap, (4) exercise bike, (5) dude miaowing, (6) doing the dishes. We did not use them when training our generative model. We add 10% of each kind of noise to test data and input these noisy utterances into the reverse process at t = 25. Note that, our model is not trained on a denoising task and has zero knowledge about any noise type other than the white noise added in diffusion process. The results indicate DiffWave learns a good prior of audios.

  Ground-truth
Noisy audio Output Noisy audio Output
  Noise 1
  Noise 2
  Noise 3
  Noise 4
  Noise 5
  Noise 6


Section Ⅵ: Interpolation (bonus!)

We can do interpolation with the digit conditioned DiffWave model on the SC09 dataset. The interpolation of voices between two speakers is done in the latent space at t = 50. The numbers on the left (e.g., 0.4) represent the linear interpolation weights for the second speaker.

Example 1 Example 2 Example 3
  0.0
  0.2
  0.4
  0.5
  0.6
  0.8
  1.0