Attention based on-device streaming speech recognition with large speech corpus
Published
IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)
Abstract
In this paper, we present a new on-device automatic speechrecognition (ASR) system based on monotonic chunk-wiseattention (MoChA) models. MoChA model performancefinally surpassed that of the previous conventional ASR sys-tems through connectionist temporal classifier (CTC) andcross entropy (CE) joint training, layer-wise pretraining, andMinimum word error rate (MWER) training. In addition, wereduce the model size by around 4.7 times using a hyperlow-rank assumption (LRA) method with minimal sacri-ficing in recognition accuracy. The memory footprint wasfurther reduced 1/4 times using 8-bit quantization to bringdown the final model size to around 40 MB. For the per-sonalized on-device ASR system, we fused n-gram modelsand result of MoChA models also used weighted finite statetransducer (WFST) based method to improve the on-demanduser-context based speech recognition.