Object-Centric Representation Learning for Video Scene Understanding

By Yi Zhou Samsung R&D Institute China-Beijing
By Hui Zhang Samsung R&D Institute China-Beijing

1 Introduction

Video understanding is crucial for various real-world applications, such as surveillance, autonomous driving, augmented reality, and the Metaverse. The Depth-aware Video Panoptic Segmentation (DVPS) task [1] has been proposed to achieve this goal. In general, all panoptic objects in the video can be categorized into foreground instances (things), e.g., people, cars, etc., and background semantics (stuff), e.g., sky, road, etc. The DVPS task requires estimating the semantic class and 3D depth for all things and stuff and uniquely segmenting and consistently tracking all items in the scene.

Figure 1. Comparison of previous methods and Slot-IVPS on object representation and relationship utilization.

Prior methods [1], [2], [3] typically approach the DVPS task as a multi-task learning problem, employing a multi-branch framework, and handle things and stuff separately in the video. As illustrated in Fig. 1(a), Siyuan et al. [1] deconstruct the DVPS task into several sub-tasks, encompassing monocular depth estimation, semantic segmentation, instance segmentation, and multiple object tracking. However, this approach requires intricate post-processing and falls short in fully harnessing the relationships between different tasks. Follow-up methods [2], [3] attempt to address the above limitations by introducing object representations and unifying some of the sub-tasks. TubeFormer [2] (see Fig. 1(b)) introduces semantic-only object representations and unifies related sub-tasks of video segmentation by directly predicting tube-level classes and masks based on a Transformer structure. PolyphonicFormer [3] (see Fig. 1(c)) further employs separate object representations (depth and mask queries), and aligns depth estimation with the segmentation pipeline by predicting instance-level depth maps based on depth queries. Although this task-level unification or alignment streamlines the framework and obviates the necessity for complex post-processing, separate object representations are insufficient to fully utilize relationships between depth and semantic information, leading to sub-optimal performance.

To effectively utilize relationships between depth and semantics, we introduce Slot-IVPS, a unified object-centric pipeline to learn robust and compact object representations (see Fig. 1(d)) in an end-to-end fashion. Integrated Panoptic Slots (IPS) are introduced to unify both semantic and depth information for all panoptic objects within a video, incorporating both stuff and things. Each IPS corresponds to a stuff class or an object instance in a video and can be dynamically updated through interaction with video features. The updated IPS for each panoptic object further empowers the model to directly predict the corresponding depth map, class, mask, and object ID.

To the best of our knowledge, this is the first fully unified end-to-end pipeline for the DVPS task. It fully leverages the semantic and depth information without requiring multiple separate representations or task-specific branches. It is validated on both DVPS and VPS tasks, surpassing the state-of-the-arts methods on Cityscapes-DVPS, SemKITTI-DVPS, Cityscapes-VPS, and VIPER datasets.

2 Method

Figure 2. Overview of the Slot-IVPS. (a): The pipeline of Slot-IVPS. (b)(c)(d): Detailed structures of Panoptic Retriever, Temporal Connector, and Retriever. Best viewed in color.

2.1 Slot-IVPS pipeline

Our proposed Slot-IVPS pipeline, depicted in Fig. 2(a), comprises a backbone, an integrated feature generator, and N stages of Integrated Video Panoptic Retriever (IVPR). The backbone, consisting of Resnet50, FPN and several deformable convolutions, extracts the multi-scale features. The integrated feature generator produces depth-aware features. Each IVPR stage contains several IVPR modules and deals with features at a certain scale. Take two successive frames from an input video as an example, features of two frames at a certain scale extracted from the backbone are represented as Xt, Xt-1 ∈ RD×C, where D,C indicate the spatial size (height×width) and the number of channels of feature maps respectively, and t represents the time index. Then the depth-aware features generated by the integrated feature generator can be represented as Xmt, Xmt-1 ∈ RD×C. The IVPR takes the certain scale depth-aware features of two frames, position embedding, and IPS (S ∈ RL×C) as inputs to produce spatiotemporal coherent IPS, where L denotes the number of slots. The output IPS are directly fed into the prediction heads to predict the classes, masks, depths, and IDs of panoptic objects in the video. Note that for clarity, the mini-batch dimension index is omitted.

Integrated panoptic slots (IPS). To unify the object representations within a video, we define IPS to represent both the semantic and depth information for all panoptic objects in the video, including both things and stuff. Each slot corresponds to an object and represents its various 3D and 2D object information. IPS is represented by a set of parameters, denoted as S ∈ RL×C. It is initialized randomly and optimized gradually by interacting with spatio-temporal information. The slot number L refers to the maximum number of panoptic objects (e.g., 100) within the video.

2.2 Integrated feature generator

The goal of the integrated feature generator is effectively leveraging pixel-level depth information and the correlations between depth and semantics. It is composed of a feature decomposer to extract semantic and depth features, a depth predictor to provide auxiliary depth supervision, and a depth-aware feature fuser to fuse semantic and depth information.

2.3 Integrated Video Panoptic Retriever (IVPR)

The IVPR is composed of three modules: the integrated feature enhancer, the Panoptic Retriever, and the Temporal Connector. For each module i ∈ {1,…,U}, where U refers to the total number of the IVPR modules in the network, the integrated feature enhancer first polishes the depth-aware features Xmt, Xmt-1 ∈ RD×C with instance-level depth information extracted from the input IPS Sit, Sit-1 ∈ RL×C. Then the Panoptic Retriever associates the IPS with the polished depth-aware features to produce spatially coherent output panoptic slots S()it, S()it-1 ∈ RL×C of each frame, by employing an attention structure called Retriever. Then the Temporal Connector takes the IPS S()it and S()it-1 as input, and further employs the Retriever to extract their temporal correlations, enhancing the IPS temporally. The spatio-temporal refined IPS are then fed to the next IVPR for iterative refinements.

Integrated feature enhancer. Considering that the input depth-aware features mainly contain pixel-level information while the IPS encodes the unified 3D instance-level information, integrated feature enhancer targets to utilize the instance-level depth information from IPS to polish the depth-aware features.

Retriever (RE). The Retriever module, whose structure is depicted in Fig. 2(d), can retrieve the correlated information of the query from the feature. In the spatial domain, it aids in learning the mapping from spatial features to IPS. In the temporal domain, it assists in learning the IPS’s correlations between different frames. Different from conventional dot-product attention [4], we introduce a slots competition mechanism [5] to Retriever, considering that every object should be distinctive from others in both spatial and temporal domains. Such design helps achieve better object discrimination.

Panoptic Retriever. Panoptic Retriever sequentially processes each frame’s IPS and depth-aware features, together with position embedding. As illustrated in Fig. 2(b), it consists of three modules: Self-attention, Retriever, and Feed Forward Network (FFN). IPS is refined with Self-attention first, associated with spatial features through Retriever, and then further refined with FFN. In this process, Retriever associates IPS with every pixel in the spatial features to retrieve the object’s information, such as depth, location, and appearance.

Temporal Connector. The Temporal Connector comprises the Retriever and FFN, as illustrated in Fig. 2(c). Its purpose is to link IPS across frames and refine them with information from each other. This can help improve the consistency of IPS that describes the same object across frames. In Temporal Connector, the output IPS from Panoptic Retriever are concatenated along the slot dimension, forwarded into the Retriever, then further refined by the following FFN.

Prediction heads. To predict depth, classification, mask, and object ID from the IPS, we employ four prediction heads, each comprising an FFN (i.e. two linear layers) and a corresponding functional layer. Thanks to the coherent IPS, precise results can be generated by all prediction heads without requiring complex fusion operations.

3 Results

The DVPS results on Cityscapes-DVPS val are shown in Table I. With the same backbone (ResNet50-FPN), our model outperforms the Polyphonicformer [3] by 3.8 DVPQ with 2× FPS as [3]. Especially the performance on things is improved from 35.6 to 41.7 (+6.1) DVPQ. We also achieve state-of-the-art performance on more datasets such as SemKITTI-DVPS, Cityscapes-VPS, and VIPER. Please refer to our TPAMI paper for more details.

Table 1. Comparison to the DVPS State-of-the-art on Cityscapes-DVPS Val. DVPQ, DVPQth, DVPQst are the averaged scores for All / Things / Stuff classes respectively.

4 Conclusion

To effectively leverage the relationship between semantic and depth information in the video, we introduce the Slot-IVPS, a fully unified end-to-end pipeline to learn unified object representations based on object-centric learning. We propose IPS to unify the semantic and depth information for all panoptic objects (including things and stuff) in the video. We design the integrated feature generator and enhancer to extract the depth-aware features and IVPR to retrieve the spatio-temporal information of objects in the video and encode it into the IPS. The effectiveness of Slot-IVPS is validated with experiments on totally four datasets of the DVPS and VPS tasks.

Link to the paper

Paper :


[1] S. Qiao, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “Vip-deeplab: Learning visual perception with depth-aware video panoptic segmentation,” arXiv preprint arXiv:2012.05258, 2020.

[2] D. Kim, J. Xie, H. Wang, S. Qiao, Q. Yu, H.-S. Kim, H. Adam, I. S. Kweon, and L.-C. Chen, “Tubeformer-deeplab: Video mask transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13914-13924, 2022.

[3] H. Yuan, X. Li, Y. Yang, G. Cheng, J. Zhang, Y. Tong, L. Zhang, and D. Tao, “Polyphonicformer: unified query learning for depth-aware video panoptic segmentation,” in European Conference on Computer Vision, pp. 582-599, Springer, 2022.

[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, pp. 5998–6008, 2017.

[5] F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf, “Object-centric learning with slot attention,” arXiv preprint arXiv:2006.15055, 2020.