[SR Talks] ① Interview with a Speech and Language AI Expert at Samsung Research AI Center

Q: Please briefly introduce yourself, Samsung Research AI Center, and the kind of work that goes on there. What projects are you working on?

I am Hoon-Young Cho, a vice president who joined the Language & Voice team of Samsung Research AI Center in February 2022. We are working on advanced technologies in areas of natural language processing, speech and audio processing, visual processing, data research, and several major applications such as digital avatars and large-scale language modeling.

This year, I am focusing on two research areas: text-to-speech (TTS) and audio processing technologies for earbuds or mobile devices. For TTS, we have developed highly efficient, small-footprint neural TTS technology, which has been successfully deployed to Galaxy phones, watches, and Bixby. The key feature of our TTS technology is that it generates artificial intelligence (AI) voices for multiple speakers and multilingual functionalities in a single end-to-end deep neural network. Additionally, it can be easily personalized with only a few recorded sentences. Our ongoing work includes the generation of highly expressive and natural AI voices that can express various emotions, dialects, and the like. In the future, we will try exciting ideas on robots and avatars. The robot’s voice should vary according to the intelligence level—for example, it may hesitate or laugh while talking; this is something we can try later.

Some of major research topics related to the earbuds or other wearable devices are noise cancellation or intelligent hearing technology. Using the noise cancellation, users can enjoy much clear music sounds in noisy environments. With the sound recognition technology, users are provided with new experiences. For example, it has been applied to robot vacuum machines to help detect dogs barking while users are away and send the users video recordings of their pets. Currently, our main challenge is intelligent hearing when earbuds wearers are in noisy environments, such as crowded restaurants. It would be great if users can choose what they want to focus on and suppress all the other sounds.

Q: Please tell me about the importance of your research field or technology.

These days, the meaning of AI is very broad. However, the first image that pops up in peoples’ minds when they hear the words “artificial intelligence” would be a humanoid robot that converses with humans with fluency. At the core of this are speech recognition, language understanding and generation, and speech synthesis. As such, spoken language technology is the most essential within the entire AI research field.

One of Samsung’s great advantages for AI researchers is that they offer a variety of devices such as refrigerators, laundry machines, air purifiers, televisions, robots, mobile devices, and wearables, to name a few. These individual devices will become more intelligent and connected to each other. A group of devices will share their context, and users may enjoy the new value created by cooperation between multiple home consumer electronics (CE) devices in the future. Robots will play a central role in the home environment. Unlike other devices, robots can move around freely at home while checking security, cleaning rooms, communicating with other devices, or even controlling them if necessary. They can read exciting books to children in any language they want and can even talk to elderly people living alone on any topic. They will always be there for you, helping you with their natural voices, and they’ll evolve on their own.

Wearable devices are also a very interesting area. With smart glasses on, you will be able to hear the voice of the person sitting in front of you much more clearly even in a noisy restaurant by utilizing lip and face information together with a microphone array–based speech-boosting technique.

Q: You recently joined Samsung Research, please tell me about your previous research history and main achievements?

From 2006 to 2011, I was a senior researcher working on a speech recognition system at a government-funded research institute. One day, one of my friends visited me and asked if I could develop a music identification system based on audio fingerprints. Over the course of several weekends, I had implemented a binary audio fingerprint extraction and music search system. It worked well in a noisy acoustic environment, and based on the technology, we established a startup company, of which I became cofounder. We applied the same technology to searching for video clips online, tracking illegal multimedia content, and monitoring automatic background music, and eventually, the company was acquired by a major telecommunications company in Korea by the end of 2011.

In 2016, I moved to the AI center of a major online video game company and began to set up a speech AI Lab. In the company, there were many exciting demands on AI-generated voices. Hundreds of voice actors record the voices for thousands of game characters all year. The voice styles are very dynamic and extremely expressive. In 2017, we began working on deep learning–based TTS technology. We broadcasted our AI voices in the company’s restaurants and resting area. We introduced new books using the AI voices of company employees, regional dialects of voice actors, and the like. The reactions were extremely positive and very motivating to the researchers. The technology was later applied to a K-pop fan service called UNIVERSE in early 2021, where the artists could communicate with their fans using their own AI Voices.

Q: What is your team's recent achievement that you want to show off?

I would like to briefly introduce our ‘SR Translate’ service.
The SR Translate beta service is a web translation service developed by my team, the Language & Voice Team. It is designed to provide high-quality translation performance to service users while allowing them to check translation quality through automatic reverse translation (back-translate) functions so as to improve upon user trust and allow them to use the translation results. It also provides more convenient and intuitive translation editing and allows users to efficiently reuse their own modified translation results. We will continue to provide convenient and excellent customer-centered translation technology and services through SR Translate, so please give us your attention and your highest expectations.
*Service link: https://translate.samsung.com/

Q: What is your vision for the future, and what goal would you want to achieve?

I strongly hope that our researchers work together happily, learning from each other by making use of and exploring the vast amount of knowledge and experiences possessed by each person. From my past experiences, my conclusion is simple: “Researchers become the most creative and productive when they are happy at work.” As such, one of my goals is to create that kind of research atmosphere.

As a speech and language AI researcher, I think people need more natural, humanlike interaction than the current technology level. The AI should be able to understand the context and history of the conversation and know more about its user using all multimodal information at the same time. I dream of having a personal avatar on my phone or robots that can really interact with me as a friend.