Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge
Published
Annual Conference of the International Speech Communication Association (INTERSPEECH)
Abstract
Text-to-Speech (TTS) services that run on edge devices have
many advantages compared to cloud TTS, e.g., latency and
privacy issues. However, neural vocoders with a low complexity and small model footprint inevitably generate annoying sounds. This study proposes a Bunched LPCNet2, an
improved LPCNet architecture that provides highly efficient
performance in high-quality for cloud servers and in a lowcomplexity for low-resource edge devices. Single logistic distribution achieves computational efficiency, and insightful tricks
reduce the model footprint while maintaining speech quality. A
DualRate architecture, which generates a lower sampling rate
from a prosody model, is also proposed to reduce maintenance
costs. The experiments demonstrate that Bunched LPCNet2
generates satisfactory speech quality with a model footprint of
1.1MB while operating faster than real-time on a RPi 3B. Our
audio samples are available at https://srtts.github.
io/bunchedLPCNet2