SRC-B's New Achievements in the Computer Vision Field

Four papers published by Samsung R&D Institute China – Beijing (SRC-B) have been recently accepted by the International Conference on Computer Vision (ICCV), which will take place at the Paris Convention Center in Paris, France, on October 2023. This conference is hosted by the Institute of Electrical and Electronics Engineers (IEEE), one of the top three conferences in the computer vision field, along with the Conference on Computer Vision and Pattern Recognition (CVPR) and the European Conference on Computer Vision (ECCV). The paper acceptance rate is around 25%, and it receives high praise in the computer vision field. This top-level conference attracts famous enterprises and schools from around the world to participate and exchange the most advanced ideas in computer vision. These accepted papers by ICCV represent Samsung’s breakthroughs and innovations in computer vision.

1. CLIP-Cluster: CLIP-Guided Attribute Hallucination for Face Clustering
This study proposes an attribute hallucination framework, Contrastive Language-Image Pre-training (CLIP) - Cluster, to solve large intra-class variance problems in face clustering, outperforming state-of-the-art face clustering methods on public dataset MS Celeb-1M (MS1M) with high inference efficiency. The proposed method can be commercialized in Samsung smartphones to improve face clustering performance in the future.

Figure 1. Overview of the proposed CLIP-Cluster to mitigate the effect of intra-cluster facial variations for better face clustering

CLIP-Cluster leverages the powerful language-visual model CLIP for text-driven face attribute hallucination and opens a new avenue for face clustering. With CLIP-guided text-driven face attribute hallucination, faces can be morphed to exhibit various ages, poses, and expressions. Benefiting from CLIP’s zero-shot image classification capability, we can eliminate the demand for a large amount of annotated paired data and turn to more convenient text-based manipulation. Furthermore, while a face is changed across various attributes, a neighbor-aware proxy generator is designed to fuse them into a proxy feature by learning the neighbor-adaptive attention. With these proxy representations to construct the affinity graph, the subsequent Graph Convolutional Network (GCN) - based edge predictor can perform face clustering more easily. Extensive experiments show that the proposed CLIP-Cluster significantly boosts face clustering performance on standard partial MS1M from 93.22 to 94.22 pairwise F-score, and the inference process can be completed efficiently within 280 seconds.


2. Coordinate Transformer: Achieving Single-Stage Multi-Person Mesh Recovery from Videos
We collaborated with Sun Yat-sen University on the multi-person three-dimensional (3D) mesh recovery from videos. We propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery end-to-end, displaying state-of-the-art (SOTA) performance on public 3D Poses in the Wild (3DPW) datasets. The results have proven to be both practically and theoretically significant.

Figure 2. Comparison of video-based multi-person mesh recovery pipelines

Multi-person 3D mesh recovery from videos is a critical first step toward automatic perception of group behavior in virtual reality, physical therapy, and more. However, existing approaches rely on multi-stage paradigms, where the person detection and tracking stages are performed in a multi-person setting, while temporal dynamics are only modeled for one person at a time. Consequently, their performance is severely limited by the lack of interpersonal interactions in the spatial-temporal mesh recovery and detection and tracking of defects. To address these challenges, we propose the CoordFormer that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery end-to-end. Instead of partitioning the feature map into coarse-scale patch-wise tokens, CoordFormer leverages a novel Coordinate-Aware Attention to preserve pixel-level spatial-temporal coordinate information. In addition, we propose a simple yet effective Body Center Attention mechanism to fuse position information. Extensive experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves state of the art performance, outperforming the previously best results by 4.2%, 8.8%, and 4.7%, according to the Mean Per Joint Position Error (MPJPE), Procrustes Analysis Mean Per Joint Position Error PAMPJPE, and Per Vertex Error (PVE) metrics, respectively, while being 40% faster than recent video-based approaches.


3. Multi-Frequency Representation Enhancement with Privilege Information (MFPI) for Video Super-Resolution
This paper proposes a new video super-resolution (VSR) model that contains a novel multi-frequency representation enhancement module (MFE) and a novel model privilege training method. The new model outperforms state-of-the-art methods by a large margin while maintaining good efficiency on REDS4, Vimeo, Vid4, and UDM10.

Figure 3. (a) An overview framework of MFPI, (b) An overview of the MFE, (c) The details of the spatial-frequency representation enhancement branch (SFE) in MFE extract spatial-level and long-range dependencies, (d) The details of the energy-frequency representation enhancement branch (EFE)

This paper proposes a VSR model consisting of an MFE to improve the low-resolution frames representation in the frequency domain and a novel VSR training method called privilege training, which encodes privilege information from high-resolution videos to facilitate model training. MFE is used to aggregate information in the frequency domain by operating the spatial and energy-frequency components via SFE and EFE. It enables Convolutional Neural Network (CNN) - based VSR models to capture long-range dependencies with minor additional parameters and computations. Abundant experiments demonstrate the effectiveness of various datasets.


4. TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses
In this paper, we present TrajectoryFormer, a novel point-cloud-based 3D multi-object tracking (MOT) framework. To recover the missed object by a detector, we generate multiple trajectory hypotheses with hybrid candidate boxes, including temporally predicted boxes and current-frame detection boxes, for trajectory-box association. The predicted boxes can propagate an object’s trajectory history information to the current frame; thus, the network can tolerate short-term miss detection of the tracked objects. We combine a long-term object motion feature and a short-term object appearance feature to create a per-hypothesis feature embedding, which reduces the computational overhead for spatial-temporal encoding. In addition, we introduce a Global–Local Interaction Module to conduct information interaction among all hypotheses and model their spatial relations, leading to accurate estimations of hypotheses. Our TrajectoryFormer achieves state-of-the-art performance on the Waymo 3D MOT benchmarks. (Ours/SOTA: Vehicle 64.6/60.6, Pedestrian 62.3/60.6, Cyclist 64.6/61.6)

Figure 4. The overall framework of the proposed TrajectoryFormer