Samsung Research America (SRA) was founded in 1988 in the Bay Area. SRA is currently headquartered in Mountain View at the heart of Silicon Valley, with offices across the United States and Canada. SRA is at the forefront of cutting-edge technologies to create new businesses and develop core technology, including artificial intelligence (AI), 5G/6G, digital health, enterprise security, mobile innovation, and so on. SRA also plays a key role in providing the infrastructure to support Samsung’s open innovation and university collaboration activities. SRA aims to create new businesses and develop core technology to enhance the competitiveness of Samsung products to impact the future.
At the SRA AI Center, we focus on building the next-generation human assistant powered by voice interaction. I am Vijay Srinivasan, the Head of the Knowledge and Dialog Lab within the AI Center at SRA. We have two major complementary research themes within our team. Firstly, we have a language AI theme focusing on solving problems, such as spoken natural language understanding, natural language generation, and multi-turn dialog systems. Secondly, we have a knowledge theme, which includes (a) exploring how to leverage external world/domain knowledge and personal knowledge in deep learning-based language AI solutions such as natural language generation for proactive voice interactions and (b) knowledge discovery and mining from various sources, including phone usage and sensor data, visual documents, and user voice interactions with the phone.
Natural language understanding (NLU) is a key focus area of our team and is critical to improving the voice interaction experience on Samsung devices. For example, consider a user command such as “Turn off the living room lights when I leave home.” While such a command is easy for a human to understand, AI has to break down the sentence and detect the overall intent (“turn off device”) and multiple semantic elements (“slots”), including which device to turn off (“light”), where the device is located (“living room”), the condition upon which the lights must be turned off (“leave location”), and what location is associated with this condition (“home”). Being able to understand and act on such complex utterances accurately can significantly improve the end-user experience with our Bixby voice assistant. Our recent team focus has been on developing a novel and effective joint NLU model to solve this intent and slot detection problem for Bixby.
Knowledge discovery and mining from heterogeneous data sources is another key theme of our team. For example, knowledge about user activities and behavior patterns mined from device usage and sensor data can be used to provide smart reminders and a more personalized and intuitive UI on devices, offer recommendations for content and device actions, and preload content or apps to reduce latency. As voice assistants grow increasingly intelligent and human-like, knowledge about the user can also be created from voice interactions that users have with personal assistants. And the key focus of our team is on completely developing on-device knowledge discovery approaches that do not leak the user’s private knowledge outside the device.
Finally, the two themes of language AI and knowledge discovery come together naturally in the third focus of our team: natural language generation (NLG) and, more broadly, proactive multi-turn dialog systems, including conversational state tracking and response generation. We explore how knowledge about the user and domain/world knowledge can be effectively used to generate responses either proactively or in response to user utterances. NLG and Dialog are active focus areas of our team starting this year.
We have had major rewarding moments and achievements for our three focus areas.
In the knowledge discovery and mining theme, a major highlight that comes to mind is our work on on-device user behavior pattern mining on smartphones (MobileMiner). MobileMiner can mine user behavior patterns entirely on the device in an efficient manner, thus enabling a plethora of personalization applications. This work won the best paper nomination at ACM Ubicomp, a prestigious top-tier conference. Another highlight was applying our research on indoor location sensing and place clustering into the Samsung Pay product to provide just-in-time reminders to Samsung Pay users to use our payment service. It was very rewarding to see our research applied to the product and quickly improve the monthly active users for Samsung Pay.
In addition to these achievements, some other highlights in our knowledge discovery theme include our research on how to do power-efficient sensor-based activity recognition on smartphones and wearable devices, behavior pattern compression/summarization from device usage and sensor data, and user behavior prediction using contextual sequence prediction models. Most recently, given our focus on language AI, we have been working on on-device knowledge discovery from complex visual documents that the user may interact with, such as event posters or product webpages, and published this work in a top-tier knowledge conference (CIKM). This work has applications in “smart scanning,” or being able to scan documents you see in the real world, convert them into structured knowledge, and absorb it in the digital world. For example, you can scan an event poster and automatically add it to your digital calendar based on the extracted event title, location, time, and other details.
In the NLU theme, our recent focus has been on improving the accuracy and interpretability of joint intent and slot labeling NLU models for the Bixby voice assistant. We are working with the Bixby team to transition our NLU models from the existing random forest-based NLU model to a state-of-the-art NLU model based on a pretrained language model. Existing joint NLU models compute features collectively for all slot types and have no way to inherently explain the intent or slot-filling decisions. We proposed a novel approach that learns to generate intent and slot-type specific self-attentions and provide inherent explanations of intent and slot-filling decisions for the first time in a joint NLU model while improving accuracy. This line of research was very rewarding as we were able to publish our research at a top-tier NLP conference (EMNLP). We were also able to push some of our research contributions to the production code base and reduced the error rate for the Bixby NLU engine compared to the existing NLU model in production.
In the NLG and pro-active multi-turn dialog theme, starting from earlier this year, we have been exploring knowledge-augmented NLG techniques, particularly for automatic information-seeking curiosity-driven question generation given a single topic sentence or query. We got this work published by the Association for the Advancement of Artificial Intelligence (AAAI) and plan to use our research to automatically generate curiosity-driven question-answer pairs to proactively start a conversation with the user based on the current context.
For AI to match human-level intelligence in the future, we expect AI agents such as voice assistants to proactively generate intelligent and personalized responses, sometimes even without a user query or command. A key focus area of our team going forward is on personalized and knowledge-augmented natural language response generation, both in response to a user command or question and proactively in response to the current context the user is in. Our NLG research also ties into our past work on knowledge discovery very effectively. A key focus of our research is how to incorporate personal knowledge discovered about the user and world/domain knowledge to generate a natural response to the end user effectively. Our team is working on advanced personalized and world knowledge-augmented NLG models that can make proactive voice assistant scenarios possible. Another closely related area we plan to work on is a multi-turn dialog, which requires careful tracking of the conversation state over time. Samsung’s unique device ecosystem makes this problem more challenging since we need to consider and track both the conversational state and the device state in to generate an effective response to the end user.