WG-WaveNet: Real-Time High-Fidelity Speech Synthesis without GPU

Paper: arXiv

Code: GitHub

Authors: Po-chun Hsu, Hung-yi Lee

Abstract: In this paper, we propose WG-WaveNet, a fast, lightweight, and high-quality waveform generation model. WG-WaveNet is composed of a compact flow-based model and a post-filter. The two components are jointly trained by maximizing the likelihood of the training data and optimizing loss functions on the frequency domains. As we design a flow-based model that is heavily compressed, the proposed model requires much less computational resources compared to other waveform generation models during both training and inference time; even though the model is highly compressed, the post-filter maintains the quality of generated waveform. Our PyTorch implementation can be trained using less than 8 GB GPU memory and generates audio samples at a rate of more than 5000 kHz on an NVIDIA 1080Ti GPU. Furthermore, even if synthesizing on a CPU, we show that the proposed method is capable of generating 44.1 kHz speech waveform 1.2 times faster than real-time. Experiments also show that the quality of generated audio is comparable to those of other methods.

Audio Samples

Audio Quality Comparison

1.
WG-WaveNet (Proposed)	WG-WaveNet (Proposed g-20)	WaveNet	ParallelWaveGAN

WaveGlow	SqueezeWave	Griffin-Lim	Ground Truth

2.
WG-WaveNet (Proposed)	WG-WaveNet (Proposed g-20)	WaveNet	ParallelWaveGAN

WaveGlow	SqueezeWave	Griffin-Lim	Ground Truth

3.
WG-WaveNet (Proposed)	WG-WaveNet (Proposed g-20)	WaveNet	ParallelWaveGAN

WaveGlow	SqueezeWave	Griffin-Lim	Ground Truth

4.
WG-WaveNet (Proposed)	WG-WaveNet (Proposed g-20)	WaveNet	ParallelWaveGAN

WaveGlow	SqueezeWave	Griffin-Lim	Ground Truth

5.
WG-WaveNet (Proposed)	WG-WaveNet (Proposed g-20)	WaveNet	ParallelWaveGAN

WaveGlow	SqueezeWave	Griffin-Lim	Ground Truth

6.
WG-WaveNet (Proposed)	WG-WaveNet (Proposed g-20)	WaveNet	ParallelWaveGAN

WaveGlow	SqueezeWave	Griffin-Lim	Ground Truth

7.
WG-WaveNet (Proposed)	WG-WaveNet (Proposed g-20)	WaveNet	ParallelWaveGAN

WaveGlow	SqueezeWave	Griffin-Lim	Ground Truth

High-Fidelity Audio Generation

1.
WG-WaveNet (Proposed w1600)	WG-WaveNet (Proposed w800)	WG-WaveNet (Proposed w800+g-20)	ParallelWaveGAN (w1600)

ParallelWaveGAN (w800)	Ground Truth (16k)	Ground Truth (22k)	Ground Truth (44k)

2.
WG-WaveNet (Proposed w1600)	WG-WaveNet (Proposed w800)	WG-WaveNet (Proposed w800+g-20)	ParallelWaveGAN (w1600)

ParallelWaveGAN (w800)	Ground Truth (16k)	Ground Truth (22k)	Ground Truth (44k)

3.
WG-WaveNet (Proposed w1600)	WG-WaveNet (Proposed w800)	WG-WaveNet (Proposed w800+g-20)	ParallelWaveGAN (w1600)

ParallelWaveGAN (w800)	Ground Truth (16k)	Ground Truth (22k)	Ground Truth (44k)

4.
WG-WaveNet (Proposed w1600)	WG-WaveNet (Proposed w800)	WG-WaveNet (Proposed w800+g-20)	ParallelWaveGAN (w1600)

ParallelWaveGAN (w800)	Ground Truth (16k)	Ground Truth (22k)	Ground Truth (44k)

5.
WG-WaveNet (Proposed w1600)	WG-WaveNet (Proposed w800)	WG-WaveNet (Proposed w800+g-20)	ParallelWaveGAN (w1600)

ParallelWaveGAN (w800)	Ground Truth (16k)	Ground Truth (22k)	Ground Truth (44k)

6.
WG-WaveNet (Proposed w1600)	WG-WaveNet (Proposed w800)	WG-WaveNet (Proposed w800+g-20)	ParallelWaveGAN (w1600)

ParallelWaveGAN (w800)	Ground Truth (16k)	Ground Truth (22k)	Ground Truth (44k)

7.
WG-WaveNet (Proposed w1600)	WG-WaveNet (Proposed w800)	WG-WaveNet (Proposed w800+g-20)	ParallelWaveGAN (w1600)

ParallelWaveGAN (w800)	Ground Truth (16k)	Ground Truth (22k)	Ground Truth (44k)

Text-to-Speech

Pseudo Inverse+Griffin-Lim	WG-WaveNet (Proposed)	WaveNet	ParallelWaveGAN

Ablation Study

1.
λ= 1, n= 3	λ= 1, n= 1	λ= 0, n= 1

g-20	WN-WaveNet	Ground Truth

2.
λ= 1, n= 3	λ= 1, n= 1	λ= 0, n= 1

g-20	WN-WaveNet	Ground Truth

16 kHz Audio Samples

WG-WaveNet	Ground Truth