ESC: Exploration with SoF Commonsense Constraints for Zero-shot Object Navigation


International Conference on Machine Learning (ICML)



Research Areas


Navigating to the right place to localize the desired object is the fundamental ability of embodied agents that interact with the objects and complete real-world tasks. Such object navigation tasks usually require large-scale training in visual environments with labeled objects. In this work, we find that the knowledge in pre-trained models for semantic scene understanding and commonsense reasoning can be transferred to open-world object navigation without any navigation experience nor any other training on the visual environments to achieve training-free zero-shot object navigation. However, these large pre-trained models may not directly generate navigation actions well. To mitigate the gap between the pre-trained knowledge and navigation actions, we propose a framework combining the commonsense knowledge with an existing exploration method to enable exploration with commonsense (EwC) using Probabilistic Soft Logic (PSL). Extensive experiments on MP3D, HM3D, and RoboTHOR (Deitke et al., 2020; Chang et al., 2017; Ramakrishnan et al., 2021) benchmarks shows that our method improves significantly over former baselines, and achieves new state-of-the-art results for zero-shot object navigation (e.g., 215% relative Success Rate improvement than CoW (Gadre et al., 2022) on MP3D). Our ablation studies also validate the efficacy of commonsense reasoning.