Robotics

DH-LC: Hierarchical Matching and Hybrid Bundle Adjustment Towards Accurate and Robust Loop Closure

By Xiongfeng Peng Samsung R&D Institute China - Beijing
By Zhihua Liu Samsung R&D Institute China - Beijing
By Qiang Wang Samsung R&D Institute China - Beijing

Introduction

Visual simultaneous localization and mapping (SLAM) inevitably produces the accumulated drift in mapping and localization due to camera calibration errors, feature matching errors, etc. It is challenging to achieve drift-free localization and obtain an accurate global map. The loop closure (LC) module in most SLAM systems identifies the current frame from the global map and optimizes the global map to reduce the accumulated drift for drift-free localization. Therefore, an accurate and robust LC module can significantly improve the SLAM performance.

VINS-Mono [1] proposed four degrees of freedom (4DOF) pose graph optimization to enforce global consistency of camera poses in the global map with the low computational cost. However, it does not maintain and optimize the global map, which results in insufficient localization accuracy. ORB-SLAM3 [2] proposed to improve LC recall by replacing the temporal consistency check of three keyframes with the local consistency check between the query keyframe and three covisible keyframes. However, when there are large viewpoint changes, fewer inliers will be obtained to estimate the relative pose between the query keyframe and the retrieval keyframe, and LC also fails. In addition, this method used full BA (FBA) to optimize the global map with the high computational cost. ReID-SLAM [3] proposed feature re-identification (ReID) method by identifying existing features with the proposed spatial-temporal sensitive sub-global map with pose prior. When the pose is not reliable, feature ReID easily fails. In addition, IBA cannot adequately optimize the global map when there is a large accumulated drift. In all, the existing LC methods have the following issues. Firstly, in the relative pose estimation step, feature matching uses local features in a small patch with a limited perception field which may not be reliable when the camera viewpoint changes are large. Secondly, in the global optimization step, different optimization methods have drawbacks in different scenarios. For example, FBA has a high computational cost to optimize the global map; IBA is not accurate enough when the accumulated drift is large; Pose graph optimization does not maintain the accurate global map.

To cope with the above two issues, we propose DH-LC, a novel accurate and robust LC method by hierarchical spatial feature matching (HSFM) and hybrid BA (HBA). Our main contributions are as follows:

• Our proposed HSFM method is able to estimate a reliable relative pose between the query image and the retrieval image in a coarse-to-fine way, which can tolerate large viewpoint changes.

• Our proposed HBA method adaptively utilizes the advantages of different BA methods according to the accumulated drift and temporal relative pose verification to optimize the global map efficiently.

• When plugging our proposed DH-LC module into a baseline SLAM system [4], experimental results show that LC recall and localization accuracy exceed the state-of-the-art methods on public EuRoC and KITTI datasets.

Our Method

The pipeline of our proposed DH-LC is shown in Figure1. The pipeline takes stereo images as inputs. For each query image, we firstly retrieve an image from candidate images by DBoW2. The candidate images selection strategy is similar to ORB-SLAM3 [2]. Then HSFM estimates an initial relative pose between the query image and the retrieval image in a coarse-to-fine way. After that, with the initial relative pose, the projection-based search method [2] is used to search for point matching pairs between the keypoints of the query image and the local map points corresponding to the retrieval image, and then a perspective-n-point (PNP) method estimates inliers of point matching pairs and the relative pose. Finally, according to our proposed optimization strategy, HBA adaptively selects IBA or FBA to optimize the global map effectively.

Figure 1.  Our proposed DH-LC pipeline

Figure 2.  Our proposed HSFM pipeline

A. HSFM

To tolerate large viewpoint changes in feature matching and improve the recall of LC module, we propose a HSFM method. It consists five steps: 3D point generation, 3D point clustering, coarse matching, fine matching and pose-guided matching. Figure 2 visualizes each steps in HSFM. 3D points are firstly triangulated from the query and retrieval images and then clustered into cubes according to the spatial distribution. The descriptor of each cluster center is voted by the descriptors of all 3D points in the cube. The cluster centers are first matched then the 3D points in the cube are matched and we get a coarse relative pose. Finally, according to the coarse relative pose, pose-guided matching gets more point matching pairs to estimate the initial relative pose.

1) 3D point generation: In the first step, we extract dense and uniform keypoints with ORB descriptors from the image, then triangulate 3D points with stereo epipolar constraints, these 3D points are described by ORB descriptors of those keypoints. This provides more uniform and denser 3D points to match and estimate the initial relative pose.

2) 3D point clustering: To enlarge the 3D point perception field and speed up 3D point matching, 3D points are clustered based on their spatial distribution. Figure 2 (a) visualizes 3D point clustering process. 3D points are clustered into cubes, and the descriptor of each cluster center is obtained by voting from all the 3D point descriptors in the cube.

3) Coarse matching: After getting all cluster centers, we compute coarse cube-level matching pairs through the NN search and the mutual check . As shown in Figure 2 (b), the cubes connected by the dotted lines are coarse matching pairs between the query image and the retrieval image.

4) Fine matching: After coarse matching, we implement the NN search and the mutual check for all points described by and which lie in the spatial neighborhood of the matched cube pair. and represent the set of 27 cubes in the spatial neighborhood of the cube and the set cubes in the spatial neighborhood of the cube. Then we estimate the coarse relative pose between the query image and the retrieval image based on 3D point matching pairs. As visualized in Figure 2 (c), the points connected by solid lines are fine matching pairs between the query image and the retrieval image.

5) Pose-guided matching: With the guided coarse relative pose , we project the 3D points of the retrieval image to the query image coordinate system. Similar to the fine matching part, we perform the NN search and the mutual check according to the distances of point positions and the hamming distances of ORB descriptors. Finally, the initial relative pose between the query image and the retrieval image is estimated based on 3D point matching pairs. As visualized in Figure 2 (d), there is an overlap between red 3D points and black 3D points that are matched pairs, and the gray 3D points represent outliers.

B. HBA

After getting the relative pose and point matching pairs between the keypoints of the query image and the local map points corresponding to the retrieval image. The next step is to optimize the global map. In order to maintain more effective optimization under the premise of accuracy, we propose HBA. HBA combines the advantages of IBA and FBA to effectively optimize the global map with the low computational cost.

In the SLAM system, the accumulated drift can be calculated by the relative poses from SLAM estimation and LC estimation. Then we can calculate the angle drift and distance drift from the accumulated drift . In order to judge the relative pose accuracy from the LC module, we calculate the error of loop drifts between the previous query images within the temporal window and the current query image. represents the index of the current query image. Then in the temporal window, we can also get the angle error and distance error from and . Finally, we calculate the number of temporal consistent loops denoted by. If the estimated relative pose is accurate enough, it means that should satisfy temporal consistency with the previous query images within the temporal window and should be at least greater than zero.

With , and , we make following optimization strategy. If the accumulated drift is small ( and are less than the given thresholds and ) or the estimated relative pose does not verify temporal consistency ( is less than the given threshold ) , we only add point matching pair constraints to optimize the related keyframe poses and map points with IBA. Otherwise, the current SLAM system has a large accumulated drift, and the estimated relative pose is temporal consistency and accurate enough. We add the estimated relative pose and point matching pair constraints to optimize all keyframe poses and all map points by FBA.To accelerate optimization speed, we firstly optimize 6DOF pose of all keyframes and then optimize all keyframe poses together with all map points by FBA.

Experiments

In this section, we evaluate the performance of our proposed DH-LC module on public EuRoC[6] and KITTI[7] datasets. The evaluation indicator is the absolute trajectory error (ATE) [8]. All experiments are carried out on a desktop PC with i7 3.4GHz CPU and 8G memory.

Table 1 shows the localization accuracy comparison of our DH-LC module plugged into the baseline SLAM system with the state-of-the-art methods on EuRoC dataset based on indoor scenes. The results for VINS-Fusion [5], ORB-SLAM3 [2] and ReID-SLAM [3] are obtained from the ReID-SLAM published paper. We run the code for other SLAM systems. It can be seen that our DH-LC module plugged into the baseline SLAM system achieves the best localization accuracy compared to other SLAM systems, such as the baseline SLAM, VINS-Fusion, ORB-SLAM3 and ReID-SLAM. The baseline SLAM has no LC module. VINS-Fusion has the LC module, but the LC module is easy to fail in the face of large viewpoint changes and it does not optimize the global map. The experiments show that the localization accuracy of the baseline and VINS-Fusion is lower than other SLAM systems. In ORB-SLAM3 system, local mapping and LC modules use BA method to optimize the map, which achieves accurate localization on indoor scenes. In ReID-SLAM system, both feature ReID and LC modules use IBA to optimize the global map, which achieves accurate localization on indoor scenes with the small accumulated drift. In addition, when replacing HSFM with ORB feature matching or BoW+ORB feature matching method, our proposed HSFM method achieves the best localization accuracy.

Table 1.  Localization accuracy compared with the state-of-the-art methods on EuRoC dataset. Red means the smallest; Blue means the second smallest

Table 2.  Localization accuracy compared with the state-of-the-art methods on KITTI dataset. * means that there are loops in the sequence; Red means the smallest; Blue means the second smallest

To further validate our proposed DH-LC module on outdoor scenes, we quantitatively compare the localization accuracy of our DH-LC module plugged into the baseline SLAM system with the state-of-the-art methods on KITTI dataset in Table 2. We run the released source code for VINS-Fusion and ORB-SLAM3 results. We also modify ReID-SLAM and the baseline SLAM code to support pure visual stereo camera inputs. It can be seen that our method achieves the best localization accuracy compared to other SLAM systems, which proves that our proposed DH-LC module reduces the accumulated drift effectively. In ORB-SLAM3 system, the LC module uses FBA to optimize the global map, but the LC module is also easy to fail in the face of large viewpoint changes. In ReID-SLAM system, both feature ReID and LC modules use IBA to optimize the global map. The experiments show that IBA is not accurate enough when there is a large accumulated drift. In addition, when replacing HSFM with ORB feature matching or BoW+ORB feature matching method, our proposed HSFM method achieves the best localization accuracy.

Conclusions

We propose HSFM and HBA methods to improve LC recall, localization accuracy and the efficiency of BA optimization when plugged into the SLAM system. Our proposed method can be easily plugged into any stereo visual SLAM system with extensive application scenarios including virtual/augmented reality and robotics fields.

Link to the paper

https://ieeexplore.ieee.org/document/9981061

References

[1] Q. Tong, L. Peiliang, and S. Shaojie, “Vins-mono: A robust and versatile monocular visual-inertial state estimator,” IEEE Transactions on Robotics, vol. PP, no. 99, pp. 1–17, 2017.

[2] C. Campos, R. Elvira, J. J. G. Rodr´ıguez, J. M. Montiel, and J. D. Tardo´s, “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,” IEEE Transactions on Robotics, 2021.

[3] X. Peng, Z. Liu, Q. Wang, Y.-T. Kim, M. Jeon, and H.-S. Lee, “Accurate visual-inertial slam by feature re-identification,” 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021.

[4] H. Liu, M. Chen, G. Zhang, H. Bao, and Y. Bao, “Ice-ba: Incremental, consistent and efficient bundle adjustment for visual-inertial slam,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1974–1982.

[5] T. Qin, S. Cao, J. Pan, and S. Shen, “A general optimization-based framework for global pose estimation with multiple sensors,” arXiv preprint arXiv:1901.03642, 2019.

[6] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,” The International Journal of Robotics Research, vol. 35, no. 10, pp. 1157–1163, 2016.

[7] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 3354–3361.

[8] Sturm, Jürgen, et al. "A benchmark for the evaluation of RGB-D SLAM systems." 2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012.