Enabling Device Control Planning Capabilities of Small Language Model

By Sudipta Paul Samsung Research America
By Lingyu Zhang Samsung Research America
By Yilin Shen Samsung Research America
By Hongxia Jin Samsung Research America

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) is an annual flagship conference organized by IEEE Signal Processing Society.

And ICASSP is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. It offers a comprehensive technical program presenting all the latest development in research and technology in the industry that attracts thousands of professionals.

In this blog series, we are introducing our research papers at the ICASSP 2024 and here is a list of them.

#1. MELS-TTS: Multi-Emotion Multi-Lingual Multi-Speaker Text-To-Speech System via Disentangled Style Tokens (Samsung Research)

#2. Latent Filling: Latent Space Data Augmentation for Zero-Shot Speech Synthesis (Samsung Research)

#3. FSPEN: An Ultra-Lightweight Network for Real Time Speech Enhancement (Samsung R&D Institute China-Beijing)

#4. Enabling Device Control Planning Capabilities of Small Language Model (Samsung Research America)

#5. Dynamic Video Frame Interpolator with Integrated Difficulty Pre-Assessment (Samsung R&D Institute China-Nanjing)

#6. Object-Conditioned Bag of Instances for Few-Shot Personalized Instance Recognition (Samsung R&D Institute United Kingdom)

#7. Robust Speaker Personalisation Using Generalized Low-Rank Adaptation for Automatic Speech Recognition (Samsung R&D Institute India-Bangalore)


Wouldn't it be nice when you say ‘the room is too bright' and an automated system can plan to close the window and/or level down the lights to make the room less bright? To perform such device control task, the system not only requires understanding the intent of utterance, but also needs to devise multi-step plan to achieve the intended goal. Moreover, adjusting the plan conditioned on home configuration with varying set of available devices makes the task even more complex.

With the aid of proper prompting techniques and in-context examples [1], Large Language Models (LLM) like GPT-3 [2] and GPT-4 [3] can perform aforementioned planning tasks for different home configurations seamlessly. However, these large models with billions of parameters cannot be deployed on devices with limited memory. One possible solution is to deploy small language models. However, small language models have weak planning capabilities which results in inferring wrong plans.

In this blog post, we focus on how to enable smart home device control planning capabilities of small language models for different home configurations without using manually annotated device control data. Motivated by the enormous capabilities of LLM, our objective is to develop an automated approach to transfer device control planning capabilities of LLM to the small language model. Towards this goal, we utilize GPT-3 to synthesize instruction-devices-plan triplets for device control task automatically in a self-regulatory manner. We generate base plans and contrastive plans by systematically altering the home configurations for the same instruction. To make sure the small language model can adjust planning for different home configurations, we finetune the model with both base plans and contrastive plans.

Our Contributions

We propose an approach to enable small language model with device control planning capabilities. Our main contributions are listed below:

      • We propose a novel approach to enable small language model with home configuration-based device control planning
       capability without any manually annotated data.
      • We propose a novel method to generate contrastive plans for device control task and show that leveraging these
        contrastive plans can lead to better sensitivity of the model for different home configurations.
      • We introduce a new smart home device control dataset with instruction-devices-plan triplets. We use this dataset to
        empirically evaluate the performance of our learned small language model on device control planning task.


Setup: We consider standard room equipped with multiple devices that can be controlled by AI assistant. Here, the set of all possible devices are D and the set of all possible instructions/utterances are and the set of all possible instructions/utterances are u. For an user instruction uU and a set of available devices DD, our objective is to learn a small language model that can come up with n necessary device control plans/steps S = to achieve the intended goal. We also consider that a set of manually annotated {( u,D,S )} triplets is not available to finetune the model.

Figure 1. (a) Automated instruction generation using GPT-3 and filter in the loop. (b) Base plan generation for all instructions (considering all possible devices are available)

Diverse Instruction Generation: First, we generate a large set of diverse instructions leveraging GPT-3. Similar to Self-Instruct [4], the process starts with a seed set of manually generated instructions (30 samples). A pool for generated instructions is maintained throughout the process. In any step, we randomly sample 6 instructions from the seed set and 2 instructions from the pool of generated instructions as in-context examples. Then the LLM generates new instructions based on the examples. A generated instruction is added to the pool if its' ROUGE-L similarity with any existing instruction is less than 0.7.

Base Plan Generation: For each instruction in the pool, we generate device control plan considering all the devices are available (i.e., D = D). We use few manually annotated samples as in-context examples.

Figure 2. (a) Abstract instruction detection and identification of relevant devices in base plan. (b) Contrastive plan generation by removing relevant devices identified in the base plan

Contrastive Plan Generation: Aforementioned base plan considers all the devices are available. However, in a realistic scenario, there would be dynamic home configurations with varying list of available devices. As a result, the device control plan should be adjusted based on the home configuration. One possible way is to randomly consider different set of available devices and use the LLM to generate plans. However, there can be a lot of possible combinations for available devices. To learn the configuration dependency, we would need to sample abundant examples from different combinations. Instead, we propose a method to consider different available device combinations based on the base plan.

First, we categorize all the instructions into two types using LLM: i) direct instruction, ii) abstract instruction. Direct instruction mentions one or more devices specifically, whereas abstract instruction does not mention any specific device. As a result, we can have different plans based on user preferences and availability of devices. We only consider abstract instructions for contrastive plan generation. The idea is to remove devices from the set of all possible devices, based on the base plan. Specifically, we consider the steps of base plan and try to identify the device that was mentioned by the steps. We use a retrieval model to identify the mentioned devices. Then we remove the retrieved device/devices from the list of all possible devices. After that, we use LLM to generate new plan based on updated list of available devices. It forces the LLM to generate a new plan excluding the retrieved device/devices. This guarantees a new plan for the same instruction with a different set of available devices.

Aligning Small Language Model: Using the LLM, we generate both base plans and the contrastive plans for a diverse set of instructions. Combining them, we aggregate N triplets to finetune small language model. We use the standard autoregressive language modeling objective [5] to fine-tune the small language model.


Our introduced smart home device control planning dataset contains 28273 instruction-devices-plan triplets. It has 15830 unique instructions with base plans. It also has 12439 contrastive plans for varying home configurations with abstract instructions.

Figure 3. An example scenario where pretrained LLaMA-7B model fails to generate proper plan compared to GPT-3. (yellow marked texts represent repeated mention of devices)

Is Pretrained Small Language Model a Good Device Control Planner? We consider LLaMA-7B [6] and Vicuna-7B [7] as the small language model. We first analyze the performance of these two pretrained 7B parameter models for the device control planning task. As shown in Figure 3, pretrained LLaMA-7B is unable to do device control planning with in-context examples. Whenever prompted with instruction and available list of devices, it generates steps mostly involving all the available devices, even when the devices are not relevant to the instruction.

Table 1. This table reports the performance improvement of finetuned Vicuna-7B using synthesized data for different plans.

Does Finetuning with Synthesized Triplets Help? Since our goal is to mimic the behavior of GPT-3 by the small language model, we evaluate if the finetuned model can generate plans similar to GPT-3. We consider three setups to generate plans for the validation set and the testing set: i) Pretrained: pretrained LLaMA-7B or Vicuna-7B model is used directly, ii) Base: pretrained model is finetuned with base plans, and iii) Ours (Base + Contrastive): Pretrained model is finetuned with both base plans and contrastive plans. Table 1 shows the evaluation scores for both contrastive plans (plans generated considering one/two relevant devices are not available) and all plans (base plans + contrastive plans) of validation set and testing set. In Table 1, for both validation set and testing set, we see that the pretrained Vicuna-7B model has low BLUE or ROUGE scores indicating the difference between the plans of small language model and GPT-3. However, finetuning the model with base plans result in significant improvement of evaluation scores. This indicates that the generated plans are more similar to the plans generated by GPT-3. Further, if we utilize both base plans and contrastive plans to finetune the pretrained model, then we gain more improvement.

Figure 4. An example scenario where finetuned Vicuna-7B with only base plan fails while, Ours succeed to generate plan based on home configuration. (yellow marked texts represent wrong plan as it refers to unavailable device)

Are the Contrastive Plans Useful? Figure 4 shows an example case where Vicuna-7B model, finetuned with only base plan, fails to generate plans only using available devices. In this case, the generated plan mentions ``music player' which is not available. However, Vicuna-7B model finetuned with base + contrastive plans can generate proper plan based on available devices.


In summary, we propose an approach to enable small language model with device control planning capability without using manually annotated data. We show that pretrained small language models are not capable of performing device control planning task. Motivated by the success of LLM, we propose a method to synthesize planning data leveraging GPT-3 in an automated manner. We also propose contrastive plan generation to learn better planning capabilities based on varying home configurations.


[1] Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer, “Rethinking the role of demonstrations: What makes in-context learning work?,” arXiv preprint arXiv:2202.12837, 2022.

[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.

[3] OpenAI, “Gpt-4 technical report,” 2023.

[4] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi, “Self-instruct: Aligning language models with self-generated instructions,” arXiv preprint arXiv:2212.10560, 2022.

[5] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al., “Improving language understanding by generative pre-training,” 2018.

[6] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Roziere, Naman Goyal, Eric Hambro, Faisal Azhar, et al., “Llama: Open and ef- ficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.

[7] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica, “Judging llm-as-a-judge with mt-bench and chatbot arena,” 2023.

Link to the paper