Robotics

DVI-SLAM: A Dual Visual Inertial SLAM Network

By Xiongfeng Peng Samsung R&D Institute China-Beijing
By Zhihua Liu Samsung R&D Institute China-Beijing

1 Introduction

State-of-the-art deep learning-based SLAM methods can be categorized into two classes based on camera pose estimation processes. Direct regression-based methods like DEMON, DeepTAM, and DytanVO directly estimate the camera pose using a deep network, while optimization-based methods like CodeSLAM, BA-Net, DeepFactors, and DROID-SLAM minimize residuals from various related factors. CodeSLAM encodes images into compact representations and estimates camera pose by minimizing classic photometric and depth residuals with a non-differentiable damped Gauss-Newton algorithm. BA-Net integrates a differentiable optimizer into a deep neural network and estimates pose by minimizing photometric residual in feature space.

DeepFactors combines photometric, re-projection, and depth factors on CodeSLAM, but does not consider each factor's reliability. DROID-SLAM optimizes camera pose and depth iteratively by minimizing re-projection residual through a dense bundle adjustment optimizer. Methods mentioned above utilize only visual factors. Including an additional inertial measurement unit (IMU) enhances the robustness of visual SLAM methods, particularly during rapid movements. Some direct regression-based deep-learning SLAM methods have attempted to incorporate IMU factors like VINet and DeepVIO.

How to effectively fuse different visual factors and IMU factors into an end-to-end deep network remains an open question for the research community. We propose a novel optimization-based deep SLAM method called Dual Visual Inertial SLAM network (DVI-SLAM), which infers camera pose and dense depth by dynamically fusing multiple factors with an end-to-end trainable differentiable structure. DVI-SLAM includes feature-metric and re-projection factors, providing complementary cues to explore visual information. It can also extend to effectively include the IMU factor or other factors. Its key design is a dynamic multi-factor fusion that learns confidence maps to adjust relative strengths among different factors for each iteration in the optimization process.

The re-projection factor dominates at the early stage of BA optimization to obtain a good initialization and prevent being trapped in local minima. Both the feature-metric factor and re-projection factor smoothly guide the joint optimization toward a true minimum in the later stage. The network dynamically adjusts the confidence map of the IMU factor according to the estimated pose error for reliable visual-inertial fusion. Dual visual and IMU constraints are tightly coupled and minimized within DVI-SLAM. As far as we know, it is the first deep learning-based framework allowing tight coupling between dual visual inertial SLAM and DBA optimization.

2 Method

2.1 Two-view Reconstruction

In this section, we introduce our DVI-SLAM with two-view reconstruction for clarity. The complete system is described in section 2.2. Here, the two-view reconstruction refers to the estimation of camera pose, IMU motion and inverse depth from two views.
Figure 1 illustrates our DVI-SLAM network for the two-view reconstruction. Given inputs of images Ii, Ij and the accelerometer and gyroscope measurements of IMU {αk, ωk} between the two frames, the network outputs the updated camera pose Ti, Tj, IMU motion Mi, Mj and dense inverse depth di. Where the camera pose T=(R,p) contains rotation R and translation p. The IMU motion M=(v,ba,bg) includes velocity v, accelerator bias ba, and gyroscope bias bg. Our DVI-SLAM network is built upon DROID-SLAM. It consists of two modules and one layer: (a) a feature extraction module, (b) a multi-factor data association module, and (c) a multi-factor DBA layer.

Figure 1. Overview of our DVI-SLAM structure (the two-views reconstruction only for better clarity)

2.1 DVI-SLAM System

In the following section, we introduce the DVI-SLAM system initialization and the process in detail.
Initialization: Same as DROID-SLAM method, we initialize DVI-SLAM when there is only visual stream input. When there is both visual and inertial stream inputs, we follow VINS-Mono method for DVI-SLAM initialization. Firstly, multiple keyframes are collected to build a frame graph, and camera pose and inverse depth states are optimized by DBA layer. Then we initialize the gravity direction and IMU motion according to the estimated camera pose and IMU pre-integration result. For the monocular visual and inertial streams, the absolute scale is also initialized. Finally, we minimize re-projection residual, feature-metric residual, and inertial residual with DBA layer to refine camera pose, motion, and inverse depth.
VI-SLAM process: After the system initialization is finished, for a new frame and its corresponding IMU data, we first estimate the average flow between the new frame and its latest keyframe. If the new frame is considered a keyframe, it is added to the sliding window and the frame graph which establishes edges with its temporally adjacent keyframes. The new frame pose and IMU motion are optimized in the sliding window with local BA. We use the same strategy as DROID-SLAM to keep a constant sliding window size and get the poses of non-keyframes. The camera pose after optimizing with local BA is called visual-inertial odometry (VIO) output. Based on the VIO result in the sequence, we build a new frame graph in which the keyframes are connected spatially and temporally. After performing an extra global BA, the results are called VI-SLAM output which reflects the accuracy of the reconstructed global map.

3 Results

In this section, we compare the VO/VIO and V/VI-SLAM output of our proposed DVI-SLAM method with the SOTA methods. For a fair comparison, we train DROID-SLAM from scratch. Our trained DROID-SLAM model has nearly the same accuracy as the author claimed and is considered as the baseline (DROID-SLAM*).

3.1 Results on Synthetic Data

Table 1 shows the trajectory error comparison of our DVI-SLAM with the SOTA methods on the TartanAir monocular SLAM challenge. We compare the trajectory error across all Hard sequences. From the tables, we can see that the results of our method are significantly better than the SOTA methods except for the MH004 sequence. After examining the sequence, we found that a few frames with mostly a white wall lead to larger pose estimation errors.

Table 1. e trajectory error comparison of our method with SOTA methods on TartanAir monocular SLAM challenge


Table 2. e trajectory error comparison of our method with the SOTA VO and VIO methods on EuRoC dataset

3.1 Results on Real-world Data

Besides the synthetic dataset, we also compare our DVI-SLAM with the SOTA methods on real-world visual-inertial datasets. Table 2 compares the trajectory error of our method with the SOTA VO and VIO methods on EuRoC dataset. As shown in the table, our VO result exceeds the DROID-SLAM's result for most of the sequences. In comparison with other SOTA traditional or learning-based VO and VIO methods, our method also achieves the lowest average ATE.

4 Conclusion

We present DVI-SLAM, a dual visual and inertial SLAM method that leverages dynamic multiple factors fusion to improve the SLAM accuracy. We verify that our DVI-SLAM method outperforms existing SLAM methods significantly in localization accuracy and robustness through extensive experiments. In the future, we would like to speed up the algorithm and reduce model memory requirements, for example by adjusting from dense to sparse tracking, which is an important step towards achieving real-time pose estimation on consumer-grade mobile processor platforms. In addition, in order to provide the accurate and complete 3D map, we also would like to better integrate RGB and depth sensors with Neural Radiance Fields (NeRFs) or 3D Gaussian Splatting methods.

Link to the paper



https://arxiv.org/abs/2309.13814

References

[1] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox, “Demon: Depth and motion network for learning monoc- ular stereo,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5038–5047.

[2] H. Zhou, B. Ummenhofer, and T. Brox, “Deeptam: Deep tracking and mapping,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 822–838.

[3] S. Shen, Y. Cai, W. Wang, and S. Scherer, “Dytanvo: Joint refinement of visual odometry and motion segmentation in dynamic environ- ments,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 4048–4055.

[4] Davison, “Codeslam-learning a compact, optimisable representation for dense visual slam,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2560–2568.

[5] C. Tang and P. Tan, “Ba-net: Dense bundle adjustment network,” arXiv preprint arXiv:1806.04807, 2018.

[6] J. Czarnowski, T. Laidlow, R. Clark, and A. J. Davison, “Deepfactors: Real-time probabilistic dense monocular slam,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 721–728, 2020.

[7] Z. Teed and J. Deng, “Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras,” Advances in neural information processing systems, vol. 34, pp. 16 558–16 569, 2021.