Text-to-speech (TTS) is a subfield of artificial intelligence (AI) that deals with converting texts into audio. This process encompasses acoustics, linguistics, digital signal processing, and computer science. TTS technology makes machines exhibit the capability of speech production, a crucial component in human-computer interaction.
Samsung Electronics has been conducting research in the TTS field for over a decade and has developed the on device TTS applications for 36 languages. Users can easily download the app from the Samsung App Store and enjoy safe, enhanced technological benefits anytime, anywhere even without internet. Samsung’s in-house TTS solution has also been integrated into Bixby Voice Assistant, providing a natural and smooth voice experience.
In addition, Samsung Electronics pursues technological innovation continuously and has published several papers on the TTS at international academic conferences of the speech field, such as the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), INTERSPEECH, and the Automatic Speech Recognition and Understanding (ASRU) workshop. In the past few years, Samsung ranked 1st in the Blizzard challenge 2020 and ranked 2nd in Voice Conversion challenge international speech synthesis competition 2021 respectively. This year, Language Intelligence Team at Samsung R&D Institute China – Beijing (SRC-B) and the AI Team at MX (Mobile eXperience) Business participated as one Samsung Team in the Blizzard Challenge 2023
The Blizzard Challenge was established in 2005 and is the largest and most significant competition in speech synthesis globally. It offers an opportunity to better understand and compare research techniques in building corpus-based speech synthesizers on the same dataset. It attracts numerous top-level scientific research institutions and first-class enterprises worldwide every year. More than 20 teams from around the world participated in Blizzard Challenge 2023.
The Blizzard Challenge 2023 announced two tasks. The first task was developing a French TTS system with audiobook reading feature through 50 hours of speech data, and the second task was developing a specified speaker’s TTS through two hours of speech data. The organizers subjectively evaluated the submitted audio files using mean opinion score (MOS) from three dimensions: intelligibility, similarity with the speaker’s pronunciation, and quality.
Samsung team managed the competition as a mini-project, established clear project goals in the early stage, synchronized project progress weekly, and overcame the tasks together. After 2 months collaborating, they achieved excellent results in this competition, ranking 1st in intelligibility (both general text and homograph handeling) and ranking 3rd in similarity (both in 50-hour task and 2-hour task). These results demonstrated that the Samsung team can generate high intelligibility and personalized TTS model using a small amount of data and quickly expand new languages.
Members of the Samsung Team at SRC-B Language Intelligence Team and MX AI Team