Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style
Published
Computer Vision and Pattern Recognition (CVPR)
Abstract
Zero-shot sketch-based image retrieval (ZS-SBIR) is typically performed at category level. Most previous works simply learn to associate sketches and photos with class information, and the details of visual correspondence between them are under-explored.
Moreover, the auxiliary semantic knowledge (e.g., class description) is often required to assist the category transfer. However, such text information requires extra annotation and is sometimes not available. In this work, we seek to pair sketch and image by exploring their local correspondences, thereby naturally favors explainability in our proposed framework.
The key of our proposed method is a transformer based hybrid attention network. Specifically, a self-attention module with a learnable tokenizer is first exploited to produce the structure-aware visual tokens from both sketch and photo images, while the most informative tokens (i.e., regions) could be explored in association with a proposed novel retrieval token. Then, a cross-attention relation module is further proposed to examine the full semantic relationship between the visual tokens across two modalities. The key hypothesis is that the learned visual correspondence is generaliable across different scenarios. Extensive experiments on category-level ZS-SBIR, fine-grained SBIR and open world ZS-SBIR all demonstrate the superiority of our model.