AI
Scene flow task involves estimating both 3D structure and 3D motion of a complex and dynamic scene between a pair of video frames. This task has attracted increasing attention in recent years due to its importance in various applications such as robotics, augmented reality and autonomous driving.
Some previous scene flow methods assume that the scene can be well approximated as a collection of rigidly moving objects, such that the scene flow can be approximated by estimating the rigid motions of individual components. These methods construct a modular network to estimate scene flow based on multiple sub-tasks. For example, DRISF [1] jointly trains optical flow and instance segmentation networks, and then infers the 3D motions by differentiating through Gauss-Newton updates. RigidMask [2] predicts the optical flow and segmentation masks of the background and multiple rigidly moving objects, and then parameterize them by their 3D rigid transformations to update 3D scene flow. One limitation of these approaches is that the network requires additional training of an instance segmentation branch for scene flow estimation. Another limitation is that each sub-task is independent, making it difficult to utilize the complementary properties between the sub-tasks in a unified network for overall optimization.
Recently, RAFT-3D [3] method proposes to learn dense rigid-motion embeddings and iteratively updates the dense SE3 motion field in an end-to-end differentiable architecture. Based on RAFT-3D, M-FUSE [4] proposes to estimate the scene flow with multiple frame fusion of forward and backward flow. EMR-MSF [5] proposes a self-supervised monocular scene flow estimation by exploiting ego-motion rigidity and an ego-motion aggregation module. However, these methods implicitly learn motion embedding to softly group a pixel with its neighborhood. The learned motion embedding is not supervised which easily breaks the 3D motion consistency of the same rigid object, particularly for the pixels in challenging regions that are occluded or have poor matching quality to learn informative motion representation.
In this paper, we propose a novel scene flow estimation method called OAMaskFlow, which is an end-to-end differentiable optimization architecture with occlusion-aware motion (OAM) mask. Our proposed the concept of OAM mask differs from the instance segmentation mask in that it discriminates the pixels in the scene in terms of motion and occlusion instead of instance.
For example, two different static instances will have the same label in the OAM mask, and the same instance will have different labels due to occlusion between two frames. We generate the ground truth annotation for the OAM mask through the constraints of photo-metric consistency and geometry consistency. With the OAM mask, we explicitly supervise the dense motion embedding to learn informative motion representation for grouping the neighbouring pixels of the same motion. To improve the accuracy of 3D motion in challenging regions, we propose a motion propagation module to spread the reliable 3D motion to these regions, such as occluded dynamic object regions and occluded static regions with an attention network. Finally, our proposed OAM mask can be applied in various other vision geometry estimation tasks, such as optical flow estimation, simultaneous localization and mapping (SLAM), etc., to help improving the performance.
Figure 1 illustrates our OAMaskFlow pipeline. Given a pair of synchronized RGB and depth frames, the network outputs a dense transformation field that represents the 3D motion between the pair of frames. Each 3D motion can be decomposed into a rotation component and a translation component, and it can be projected onto the image to recover dense optical flow and dense scene flow of the pair. At the beginning, the SE3 motion field is initialized with an identity rotation and zero translation matrix. Different from RAFT-3D, there are two main differences in OAMaskFlow network. One is that we generate the ground truth OAM mask to supervise the motion embedding in the training stage. The other is the 3D motion propagation module to propagate the reliable 3D motion to the challenging regions with a learned attention feature.
Figure 1. Overview of our OAMaskFlow pipeline with supervision. We highlight the different parts between our OAMaskFlow and RAFT-3D in orange red, including 1) supervised motion embedding with our generated ground truth OAM mask, 2) 3D motion propagation module to propagate the reliable 3D motion to the occluded regions with attention.
To enhance the pixel-wise motion representation, it's necessary to supervise the motion embeddings of each pixel to ensure 3D motion consistency with its neighborhood. However, it is non-trivial to supervise the motion embedding because there are only 3D scene flow labels instead of dense motion labels in the scene. It's also not easy to convert the scene flow labels into the motion labels. To alleviate the problem, we compromise by generating a motion mask to distinguish the motion of the scene and then use it to supervise the motion embedding. From a 3D motion perspective, a scene can be divided into static regions and dynamic objects. Static regions include the background and some static objects that have the same motion as the camera ego-motion. Dynamic objects move differently from each other, and each pixel on the same rigid object has the same motion. The challenging occluded regions from two different perspectives are also considered to improve the representation of motion embedding.
To get reliable 3D motions for the challenging occluded regions, we propose a 3D motion propagation module that propagates the accurately estimated 3D motion from the non-occluded regions to the occluded regions. As pixels in occluded regions may have a similar appearances or structures to their non-occluded neighborhood, we extract attention feature from the reference image to measure the feature self-similarity globally. The feature similarity for each pixel in attention feature with respect to all pixels in the image is computed using a simple matrix multiplication to calculate their correlations. The final SE3 field is computed by weighting the normalized similarity on the SE3 field.
To validate our OAMaskFlow method, we perform experiments on the synthetic FlyingThings3D and the real-world KITTI datasets. FlyingThings3D consists of stereo and RGB-D images rendered with multiple moving objects along randomized 3D trajectories from ShapeNet. It is the most diverse and challenging, and contains dense, accurate, and multi-task ground truth. KITTI consists of autonomous stereo images with sparse and multi-task ground truth.
The quantitative results on the FlyingThings3D dataset are listed in Table 1.We notice that EPE2D and EPE3D of our OAMaskFlow have decreased by 20.4% and 21.0% respectively in comparison with the baseline RAFT-3D method which verifies the proposed OAM mask and motion propagation effectiveness. In our OAMaskFlow network, we modify the context encoder branch with the same structure as the correlation encoder branch except that the output dimension is 512. With this design, the parameters of the entire network have decreased from 45.8M to 7.7M, reducing parameters by 83.2%. In addition, our OAMaskFlow also exceeds the best CamLiRAFT method in 2D and 3D threshold metrics with less parameters.
Table 1. Quantitative comparison results with the state-of-the-art methods on the FlyingThings3D dataset.
We quantitatively compares our OAMaskFlow with the state-of-the-art methods on KITTI2015 scene flow benchmark on Table 2. It can be seen that, with the OAM and motion propagation modules, the percentage of erroneous pixels in ``all' regions has reduced from 5.77% to 4.37% in comparison with the baseline, which verifies importance of OAM and motion propagation modules. Following the CamLiFlow and CamLiRAFT methods, we perform additional background refinement on our results. Overall, we can see that our OAMaskFlow achieves the lowest scene flow error on both all pixels and background regions with the background refinement strategy on the KITTI2015 scene flow benchmark leaderboard. This result also shows that motion embedding branch has a good generalization capability from synthetic to real-world.
Table 2. Quantitative comparison results on KITTI Scene Flow benchmark.
In this paper, we propose a novel OAMaskFlow method to learn motion consistency representative with the supervision of generated OAM mask in an end-to-end scene flow estimation network. We further extensively applies the OAM mask in DROID-SLAM framework by weighting the dense flow. The experimental results validate the effectiveness of our proposed OAM mask in the scene flow and SLAM tasks. One limitation of our approach is that if the entire object is occluded, we cannot accurately estimate the scene flow of the object due to insufficient measurements. The other limitation of our approach is that the depth data is not fully explored. In our approach, we simply concatenate RGB and depth image as input, similar to the baseline RAFT3D, which limits ability to utilize most 3D structural information from depth. In the future, one promising research direction is the scene flow estimation from the image and point cloud with the OAM mask. The other is the scene flow estimation with 3D Gaussian Splatting.
https://ojs.aaai.org/index.php/AAAI/article/view/32696
[1] Ma W C, Wang S, Hu R, et al. Deep rigid instance scene flow[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 3614-3622.
[2] Yang G, Ramanan D. Learning to segment rigid motions from two frames[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 1266-1275.
[3] Teed Z, Deng J. Raft-3d: Scene flow using rigid-motion embeddings[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 8375-8384.
[4] Mehl L, Jahedi A, Schmalfuss J, et al. M-fuse: Multi-frame fusion for scene flow estimation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023: 2020-2029.
[5] Jiang Z, Okutomi M. EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting Ego-Motion Rigidity[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 69-78.