Task-Driven and Experience-Based Question Answering Corpus for In-Home Robot Application in the House3D Virtual Environment

By YangLiu Samsung R&D Institute China - Beijing
By ZhuoQun Xu Samsung R&D Institute China - Beijing


At present, more and more work has begun to pay attention to the long-term housekeeping robot scene. Naturally, we wonder whether the robot can answer the questions raised by the owner according to the actual situation at home.

These questions usually do not have a clear text context, are directly related to the actual scene, and it is difficult to find the answer from the general knowledge base (such as Wikipedia). Therefore, the experience accumulated from the task seems to be a more natural choice[1][2].

In this paper, we present a corpus called TEQA (task-driven and experience-based question answering) in the long-term household task. Based on a popular in-house virtual environment (AI2-THOR) and agent task experiences of ALFRED, we design six types of questions along with answering including 24 question templates, 37 answer templates, and nearly 10k different question answering pairs. Our corpus aims at investigating the ability of task experience understanding of agents for the daily question answering scenario on the ALFRED dataset.

Task-Driven and Experience-Based Question Answering

We propose the task-driven and experience-based question answering corpus based on the ALFRED benchmark[3]. There are some premises before we discuss TEQA. Firstly, the virtual environment for us to build Q&A is interactive and fully observable, so semantics and information can be easily obtained. Secondly, the agent has been trained and has a certain ability to complete tasks. Finally, humans can complete tasks in a similar environment and utilize the experience to complete these latent questions. On this premise, we consider that TEQA is effective and meaningful.

Figure 1.  Main workflow of TEQA

A trained agent is required to complete a daily task in a virtual environment, which can learn experiences from action sequences. (1) The agent completes the goal through a series of behaviors. (2) With the generation of changes in the actions(attributes and positions of objects), the agent collects object information through vision. (3) QA is based on the content of the task, the objects in the current environment, and the action sequences of the agent. (4) The words in the dataset are filled in the template to generate a question, similarly, we can create massive questions. (5) The robot yields the answer in the same way.

Figure 2.  Example of TEQA

For example, if the task happen in the kitchen, and the agent is washing dishes to complete a task. And when you need to use a tool to slice the tomato, you ask the agent "What tool do you need to slice the tomato?" (certainly, replacing words in the template can generate different questions. The question can also be "How do you cut potato?" " Why is the knife in the fridge?" "Why did the position of the pot change?" "Can you finish the task with a spoon?").
Regarding this question, the agent has seen a knife on the kitchen table, and it used to use a knife to slice something. The agent tries to associate the butter knife with the task of 'slice' through experience. We hope that the agent will answer you "The knife is on the table" after retrieving the experience[4]. It may also generate these answers: "The butter knife is in the sink." "The knife is on the shelf." Of course, the subject may not be a knife "The cup is on the counter". This QA detects whether the agent can learn knowledge in the environment and actions.

After the interaction and behavior, we set up a series of questions that contain six types. Through the agent's answer, we can know its action process and logic rather than just the result. This will help improve the agent's task success rate, language understanding, and learning ability.

The following table shows some question and answering pair in TEQA.

Figure 3.  Instance QA in TEQA

Discussion and Conclusion

We believe that the learning ability of the agent is the key factor to improve the task success rate. Therefore, we introduce TEQA to research the comprehension of agents. By the feedback of the agent, we comprehend the steps of its actions (not just the results) to further improve its internal algorithms.

For a natural language text, changing one of the words may make the meaning completely different from the previous one. Semantic comprehension should not only consider the connection between symbols but also the agent learns logic similar to humans. Thus, we consider that the agent should not acquire results from its surmise. As a result, we use grounded information to force the agent to understand knowledge(rather than guess). On this basis, we propose a corpus to accomplish the TEQA.

The way of grounded language learning is more similar to the human language environment that presets the context of the conversation. The richness and authenticity of grounded information cannot be described by data and text. The finite objects have infinite random arrangements, and there are many possibilities for agent behavior. The virtual 3D environment is closer to the scene of daily life, the agent in which is like a baby, constantly exploring and learning. And TEQA is like its test papers and examinations.

Within the scope of a problem domain and natural language rules, our QA corpus has sufficient feasibility and high efficiency, which will be beneficial to research on domestic robots and grounded language learning.

Link to the paper


[1] Winograd, T. . (1974). Understanding natural language. Leonardo, 3(1), 1-1

[2] Das, A. , Datta, S. , Gkioxari, G. , Lee, S. , Parikh, D. , & Batra, D. .(2018). Embodied Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

[3] Shridhar, M. , Thomason, J. , Gordon, D. , Bis Fox, D. . (2020). ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

[4] Wang, P. , Wu, Q. , Shen, C. , Hengel, A. , & Dick, A. . (2015). Explicit knowledge-based reasoning for visual question answering. Computer Science.