Publications

Improved Small-Footprint ASR-Based Solution for Open Vocabulary Keyword Spotting

Published

IEEE Access

Date

2024.07.01

Research Areas

Abstract

This article presents an improved solution to the open vocabulary keyword spotting task with the keyword given by text. This solution is based on the acoustic model and unigram language model architectures. Such models are commonly used in the automatic speech recognition task. However, they can also be minimized and deployed to mobile devices while preserving the ability to transcribe extensive vocabulary data. Our improvements can be applied to any type of sequence-to-sequence model architecture generating token probabilities. Furthermore, they do not increase the latency since they are applied to the acoustic model output. We propose three modifications: 1) leveraging multiple hypotheses generated by the beam search algorithm, 2) modifying the method of the language model initialization, and 3) smoothing the acoustic model outputs. We evaluated those improvements on the public testsets (MOCKS: Multilingual Open Custom Keyword Spotting Testset and Google Speech Commands) with an exemplary highly compressed acoustic model. Comparison of the results with the baseline solution revealed an equal error rate reduction by 1.6–1.7% relative depending on the testset.