AI
In the ever-growing landscape of technology, voice intelligence-based solutions evolved and reshaped how we interact with devices, applications, and services. From virtual assistants (like our Bixby) to automotive infotainment systems, the ability to communicate with technology through speech is becoming a seamless part of our daily lives. Currently, all leading voice assistants use keyword detection to initialize conversations. This solution requires the user to start the conversation with the voice assistant by uttering a keyword – the keyword phrase depends on the assistant and cannot be provided by the user.
One of the groundbreaking innovations in this field is Custom Keyword Spotting. The ability to set a custom wakeup offers personalization and adaptation to individual needs and preferences. So instead of starting our command with the keyword “Hi Bixby!” we could use “Hello Sunshine!” or similar. Along with the tremendous benefits of introducing this solution come a couple of difficulties in preparing a system good enough to recognize with high accuracy every keyword we set.
The biggest obstacle in training and evaluating custom keyword spotting models or even comparing existing solutions is the lack of public datasets with a generous number of keywords that contain challenging negative examples of those keywords. Therefore we decided to fill in this gap by proposing a public testset built upon data from public datasets: LibriSpeech and Mozilla Common Voice. We named it MOCKS: Multilingual Open Custom Keyword Spotting Testset.
Examining previous works on keyword spotting showed us that existing testsets were insufficient for several reasons: a small number of keywords, keywords being very short, only positive samples available, negative samples not challenging, or testsets designed only for offline evaluation. To start working on completing a suitable dataset, we decided to define a list of requirements that such a dataset should meet, i.e.:
• Keywords should be selected among phrases with the phonetic transcription length p such that 6 <= p <= 16
• The testset should contain positive and negative samples for each keyword.
• Similarity between phrases should be measured with normalized phonetic Levenshtein distance.
• Keywords should be selected among the words with at least two occurrences.
• Positive samples for each keyword should contain phrases with precisely the exact phonetic transcriptions.
• Negative samples for each keyword should contain the following:
o Recordings containing “similar phrases,” i.e., the phonetic distance between the keyword and the tested phrase is in
the interval (0.0, 0.5).
o Recording containing “different phrases,” i.e., the phonetic distance between the keyword and the tested phrase is in
the interval (0.5, inf).
• The testset should allow for evaluating performance in challenging conditions and both offline and online modes.
Table 1. Properties of MOCKS subsets based on LibriSpeech and MCV
The table above shows the properties of MOCKS subsets based on LibriSpeech (LS) and Mozilla Common Voice (MCV). Our dataset supports five languages: English (en), German (de), Spanish (es), French (fr), and Italian (it). To clarify the difference between offline and online scenarios as well as LibriSpeech clean and other scenarios – the online scenario is used in streaming models, and offline is the opposite – given the whole audio clip, you need to classify it correctly. We have kept LibriSpeech’s original split into the “clean” scenario – including less challenging data and the “other” scenario, which is the opposite of clean and contains more challenging, unclear speech. Additionally, we provided some other statistics of the created dataset, such as the distribution of keyword lengths and gender distribution.
Figure 1. Distribution of keyword lengths in MOCKS subsets
Figure 2. Gender distribution in MCV and MCV-originated MOCKS subsets
In this post, we have defined custom keyword spotting and described one of the problems that arise when preparing systems for custom wakeup. We introduced our work on the MOCKS dataset and its very characteristics and emphasized the versatility of our proposal.
Our hard work has not gone unnoticed, and dataset publication “MOCKS 1.0 Multilingual Open Custom Keyword Spotting Testset” got into this year’s edition of the INTERSPEECH conference, which is one of the most valued conferences in the field of speech recognition. This is a great honor for us. It confirms that the problem we defined exists and is significant in keyword spotting.
[1] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP 2015). Brisbane, Australia: IEEE, 2015, pp. 5206–5210.
[2] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in International Conference on Language Resources and Evaluation, 2019.
[3] T. Bluche and T. Gisselbrecht, “Predicting Detection Filters for Small Footprint Open-Vocabulary Keyword Spotting,” in Proc. Interspeech 2020, 2020, pp. 2552–2556.
[4] X. Qin, H. Bu, and M. Li, “Hi-mia: A far-field text-dependent speaker verification database and the baselines,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7609–7613.
[5] B. Kim, M. Lee, J. Lee, Y. Kim, and K. Hwang, “Query-by-example on-device keyword spotting,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 532–538.
[6] J. Hou, Y. Shi, M. Ostendorf, M.-Y. Hwang, and L. Xie, “Region proposal network based small-footprint keyword spotting,” IEEE Signal Processing Letters, vol. 26, no. 10, pp. 1471–1475, 2019
[7] L. Lugosch, S. Myer, and V. S. Tomar, “Donut: Ctc-based query-by-example keyword spotting,” arXiv preprint arXiv:1811.10736, 2018.