[CVPR 2023 Series #7] Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style

By Da Li Samsung AI Centre - Cambridge
By Timothy Hospedales Samsung AI Centre - Cambridge

The Computer Vision and Pattern Recognition Conference (CVPR) is a world-renowned international Artificial Intelligence (AI) conference co-hosted by the Institute of Electrical and Electronics Engineers (IEEE) and the Computer Vision Foundation (CVF) which has been running since 1983. CVPR is widely considered to be one of the three most significant international conferences in the field of computer vision, alongside the International Conference on Computer Vision (ICCV) and the European Conference on Computer Vision (ECCV).

In this relay series, we are introducing a summary of the 7 research papers at the CVPR 2023 and here is a summary of them.

- Part 1 : SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting with Neural Radiance Fields (by Samsung AI Center – Toronto)

- Part 2 : StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos (by Samsung AI Center – Toronto)

- Part 3 : GENIE: Show Me the Data for Quantization (by Samsung Research)

- Part 4 : A Unified Pyramid Recurrent Network for Video Frame Interpolation (By Samsung R&D Institute - Nanjing)

- Part 5 : MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation (By Samsung R&D Institute United Kingdom)

- Part 6 : LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models (By Samsung AI Center - Cambridge)

- Part 7 : Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style (By Samsung AI Center - Cambridge)

This paper was selected as a highlight at CVPR 2023. Highlights constitute roughly 10% of the accepted papers at CVPR, as selected by the program committee. (10% of accepted papers, 2.5% of submissions).


Zero-shot sketch-based image retrieval (ZS-SBIR) is a central problem to sketch understanding [6]. This paper aims to tackle all problems associated with the current status quo for ZS-SBIR, including category-level (standard) [4], fine-grained [1], and cross-dataset [3]. In particular, we advocate for (i) a single model to tackle all three settings of ZS-SBIR, (ii) ditching the requirement on external knowledge to conduct category transfer, and more importantly, (iii) a way to explain why our model works (or not). Our solution first is a transformer-based cross-modal network, that (i) sources local patches independently in each modality, (ii) establishes patch-to-patch correspondences across two modalities, and (iii) computes matching scores based on putative correspondences. We approach (i) by proposing a novel CNN-based learnable tokenizer, that is specifically tailored to sketch data. This is because the vanilla non-overlapping patch-wise tokenization proposed in ViT [2] is not friendly to the sparse nature of sketches (as most patches would belong to the uninformative blank). In the same spirit of class token developed in ViT for image recognition, we introduce a learnable retrieval token to prioritize tokens for cross-modal matching. To establish (ii) patch-to-patch correspondences, a novel cross-attention module is proposed that operates across sketch-photo modalities. Specifically, we propose cross-modal multi-head attention, in which the query embeddings are exchanged between sketch and photo branches to reason patch-level correspondences with only category-level supervision. Finally, with the putative matches in place, inspired by relation networks [5], we propose a kernel-based relation network to aggregate the correspondences and calculate a similarity score between each sketch-photo pair. We achieve state-of-the-art performance across all said ZS-SBIR settings with some illustration in Figure 1.

Figure 1.  Attentive regions of self-/cross-attention and the learned visual correspondence for tackling unseen cases. (a) The proposed retrieval token [Ret] can attend to informative regions. Different colors are attention maps from different heads. (b) Cross-attention offers explainability by explicitly constructing local visual correspondence. The local matches learned from training data are shareable knowledge, which enables ZS-SBIR to work under diverse settings (inter- / intra-category and cross datasets) with just one model. (c) An input sketch can be transformed into its image by the learned correspondence, i.e., sketch patches are replaced by the closest image patches from the retrieved image.

Our Solution

Technically, our method is built over a transformer-based cross-modal network, with three novel components (i) a self-attention module with a learnable tokenizer to produce visual tokens that correspond to the most informative local regions, (ii) a cross-attention module to compute local correspondences between the visual tokens across two modalities, and finally (iii) a kernel-based relation network to assemble local putative matches and produce an overall similarity metric for a sketch-photo pair. Experiments show ours indeed delivers superior performances across all ZS-SBIR settings. And the important goal of explainability is elegantly achieved by visualizing cross-modal token correspondences, and for the first time, via sketch to photo synthesis by replacement of all matched photo patches in the gallery, as shown in Figure 1(c).

As shown in Figure 2, (a) a learnable tokenizer using filters with large receptive field generates structure-preserving tokens, preventing generation of the uninformative tokens. (b) A self-attention module finds the most informative regions of each input sketch/photo for the subsequent cross-modal local matching. Meanwhile, the original classification token [CLS] is replaced by a retrieval token [RET] for the retrieval task. (c) A cross-attention module learns the visual correspondences between generated visual tokens from each modality. (d) A token-level relation network infers the final matching results based on the local correspondences generated by the cross-attention module.

Figure 2.  Network overview.

Experimental Results

Table 1.  Category-level ZS-SBIR comparison results. “ESI” : External Semantic Information. “-” : not reported. The best and second best scores are color-coded in red and blue.

Table 2.  Zero-shot FG-SBIR results (%). Note that all competitors are not zero-shot models, they are trained on Chair-V2. Best in bold.

Table 3.  Cross-dataset ZS-SBIR results. “S”, “T” and “Q” denote Sketchy Ext, TU-Berlin Ext, and QuickDraw Ext, respectively. “(·)” denotes the number of test categories which are unseen to ensure the zero-shot setting. E.g., S->T(21) denotes that, we train on the training split of Sketchy Ext, then test on a subset (21 unseen classes) of the testing split of TU-Berlin Ext. Rows with a grey background indicate using ViT backbone for fair comparisons. Best in bold.

The experimental findings across various settings, including category-level zero-shot SBIR, zero-shot fine-grained SBIR, and cross-dataset zero-shot SBIR, as presented in Tables 1-3, demonstrate the effectiveness of our proposed approach. Our method consistently outperforms existing state-of-the-art (SoTA) techniques in the majority of cases.


In our endeavor to address Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) and emphasize the importance of interpretability, we drew inspiration from traditional vision techniques – bag-of-words. Our novel approach introduces a patch matching framework that not only achieves explainability but also effectively handles all ZS-SBIR scenarios simultaneously. We firmly believe that our advanced retrieval system will significantly impact the progress of ZS-SBIR and bring SBIR closer to practical applications.

About Samsung AI Center – Cambridge

Established in 2018, the Cambridge AI Center performs world-class blue-sky AI research with an open and collaborative approach to science. The center publishes its results in top scientific venues and releases open source code and datasets to facilitate engagement with the wider academic community. In service of our larger mission to Samsung consumers, the Cambridge AI Center has been successful in transferring technology developed in-house into the hands of users. Through collaborations with other parts of the company, techniques and AI models developed in the center are now used by millions of consumers world-wide on Samsung platforms, such as the Galaxy smartphone. The centre’s high-level themes include on-device AI and foundational vision & language models. Broader scientific interests span video understanding, AutoML, action recognition, neuro-symbolic models, meta-learning, domain adaptation, on-device AI, unsupervised and self-supervised learning, efficient reasoning, speech recognition and audio modelling, and federated learning.

Link to the paper


[1] Ayan Kumar Bhunia, Yongxin Yang, Timothy M Hospedales, Tao Xiang, and Yi-Zhe Song. Sketch less for more: On-the-fly fine-grained sketch-based image retrieval. In CVPR, 2020.

[2] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.

[3] Kaiyue Pang, Ke Li, Yongxin Yang, Honggang Zhang, Timothy M Hospedales, Tao Xiang, and Yi-Zhe Song. Generalising fine-grained sketch-based image retrieval. In CVPR, 2019.

[4] Yuming Shen, Li Liu, Fumin Shen, and Ling Shao. Zero-shot sketch-image hashing. In CVPR, 2018.

[5] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, 2018

[6] Jialin Tian, Xing Xu, Fumin Shen, Yang Yang, and Heng Tao Shen. Tvt: Three-way vision transformer through multi-modal hypersphere learning for zero-shot sketch-based image retrieval. In AAAI, 2022.