AI

MambaNext: An Enhanced Backbone Network with Focus Linear Attention

By Dafeng Zhang Samsung R&D Institute China-Beijing
By Shizhuo Liu Samsung R&D Institute China-Beijing

1 Introduction

The human visual system, when observing an object, initially focuses on it and if unclear, observes surrounding information for re-judgment. However, existing linear attention does not conform to the characteristics of the human visual system. To enhance the recognition and representational capabilities of model, we propose a module that aligns with the characteristics of the human visual system, named Focus Linear Attention Module (FLAM). In this module, we use a focus attention to simulate the center of human visual focus, reducing interference from redundant background information and enhancing the model's recognition ability. Concurrently, a separate global linear attention module is employed to perceive the global information of the image. To utilize global information more effectively and avoid background interference, we eschew traditional element-wise sum or concatenation operation for fusing local and global information. Instead, we introduce a novel operation called Star Fusion to fuse local and global information. This operation treats global information as an attention map to guide the local information towards more relevant features and reduce background interference. Additionally, the Star Fusion implicitly increases the feature dimensions, further improving recognition capability of model.

2 Method

2.1 Overall Architecture

The MambaNext architecture is depicted in Figure 1 (a). Unlike VMamba, which directly uses a patch-sized convolution kernel to segment the input RGB image into non-overlapping patches, we utilize a stem module to gradually extract deeper features, thereby reducing the loss of spatial information. Each position in the feature maps is treated as a token. In our experiments, the patch size is set to 4×4, yielding (h/4)×(w/4) tokens with a channel dimension of C via stem block. Subsequently, the token sequence is fed into a module composed of N1 Focus Linear Attention Module (FLAM) to extract higher-semantic features. We refer to this process as stage 1, where the number of tokens remains unchanged.

To extract richer hierarchical features and expand the receptive field, the patch merging module in stage 2 merges the token sequence while increasing the model's width. Similar to stage 1, the token sequence is then passed through a module composed of N2 FLAM for further feature extraction. Stages 3 and 4 follow the same strategy as stage 2, where the number of tokens is reduced to 1/4 of the previous stage, and the channel number is doubled.

Figure 1. (a) is the architecture of MambaNext. (b) is the Focus Linear Attention Module (FLAM). (c)-(d) are the Multi-Head Self-Attention, Window-based Multi-Head Self-Attention and Sliding-Window Multi-Head Self-Attention, respectively. The red box represents the pixel size for calculating attention.

2.2 Focus Linear Attention Module (FLAM)

The FLAM is the core component of MambaNext, inspired by the human visual system [1]. The Focus Attention Module (FAM) within FLAM, based on sliding windows, focuses attention on the object of interest, enhancing the model's concentration. The Linear Attention Module (LAM) is used to perceive global context, providing additional information to enhance the recognition capabilities of the FAM.

1) Focus Attention Module (Sliding-Window Attention Module). In the human visual system, the focal point of the eyes primarily concentrates on the object of interest. Existing self-attention or linear attention mechanisms such as ViT or Mamba do not align with the characteristics of the human visual system. To address these limitations, we introduce a FAM or named Sliding-Window Attention Module (SLW-MSA) designed to mimic human visual characteristics, as depicted in Figure 1 (e).

In Figure 1, MSA represents the Multi-Head Self-Attention, exemplified by ViT, which perform self-attention across the entire image. Due to its quadratic computational complexity, MSA demands substantial computational resources. W-MSA stands for Window-based Multi-Head Self-Attention, as implemented in the Swin Transformer, which confines self-attention within fixed-size windows to balance computational efficiency and performance. Our SLW-MSA differs from W-MSA by simulating eye movements and focusing attention on the center of each sliding window through the self-attention.

2) Linear Attention Module. The effective receptive field is positively correlated with the performance of the model [2-3]. However, traditional self-attention has quadratic computational complexity with the number of tokens. Hence, to enhance the receptive field while maintaining computational efficiency, we utilize linear attention to extract global features. Since self-attention operation is permutation invariant, it disregards crucial positional information in two-dimensional images. To incorporate positional encoding, we use Locally-Enhanced Positional Encoding (LePE) [4] to impose positional information onto the linearly projected values to improving the performance, as shown in Figure 1 (b).

2.3 Star Fusion

In order to leverage global information more efficiently while mitigating background interference, we dispense with the conventional element-wise sum or concatenation operation for fusing local and global information. Instead, we propose a novel operation known as Star Fusion. By considering global information as an attention map, this operation guides local information towards more germane features, thereby reducing background interference. Moreover, Star Fusion implicitly expands feature dimensions [5-7], ultimately enhancing the recognition capacity of our model.

3 Results

3.1 Image Classification

Table.1 presents the performance of MambaNext in image classification tasks on the ImageNet-1K dataset, comparing it with ConvNeXt, Swin Transformer, and VMamba. MambaNext is designed to be computationally efficient while maintaining high accuracy. For instance, the smallest variant, MambaNext-T, has 27.3 million parameters and requires 4.4 GFLOPs, which is less than VMamba-T (30.0M parameters, 4.9G FLOPs). This shows that MambaNext can achieve a similar or even better performance with fewer resources. Compared to ConvNeXt-T and Swin-T, MambaNext also demonstrates a favorable balance between parameter size and FLOPs. The enhancement in MambaNext's performance can be attributed to its innovative Focus Linear Attention Module (FLAM) and Star Fusion strategy. These mechanisms enable more effective feature extraction and representation, leading to higher accuracy.

Table 1. Classification benchmarks on ImageNet-1K.

3.2 Object Detection and Instance Segmentation

Table.2 illustrates the object detection and instance segmentation results on the COCO dataset. MambaNext achieves higher AP scores for both box and mask predictions compared to ConvNeXt, Swin Transformer, and VMamba models. For instance, MambaNext-T has an APbox of 47.4 and an APmask of 43.1, outperforming ConvNeXt-T, Swin-T, and VMamba-T. MambaNext-S and MambaNext-B similarly outperform their counterparts, achieving APbox scores of 49.0 and 49.5, 43.9 and 44.3 in APmask, respectively. This suggests that MambaNext's architecture is more adept at capturing fine-grained details and contextual information, leading to better detection and segmentation.

Table 2. Object detection and instance segmentation results on COCO.

3.3 Semantic Segmentation

Table.3 evaluates the semantic segmentation performance of MambaNext using UPerNet on the ADE20K dataset. MambaNext demonstrates exceptional performance in semantic segmentation. MambaNext-T achieves a Mean Intersection over Union (mIoU) of 49.2, significantly higher than ConvNeXt-T (46.0), Swin-T (44.5), and VMamba-T (47.9). The larger models, MambaNext-S and MambaNext-B, also outperform their competitors, achieving mIoU scores of 50.8 and 51.3, respectively. This indicates that MambaNext's architecture is better suited for understanding and segmenting scenes into different categories.

Table 3. Semantic segmentation results with UperNet on ADE20K.

4 Conclusion

Our MambaNext model has demonstrated its effectiveness through extensive experiments on various benchmarks, including image classification, object detection, and segmentation. By drawing inspiration from the human visual system, MambaNext introduces the Focus Linear Attention Module (FLAM) and the Star Fusion strategy, which together enhance the model's capacity to recognize and represent visual information. The FLAM component effectively reduces background interference, allowing the model to focus more accurately on salient features within images, while Star Fusion facilitates the fusion of global context, enhancing the model's ability to understand complex scenes. Additionally, it also implicitly increases the feature dimensions to improve linear separability. Our MambaNext's architecture ensures scalability and adaptability, making it suitable for a wide range of visual tasks.

Link to the paper

https://ieeexplore.ieee.org/document/10887988

References

[1] Shi D. TransNeXt: Robust Foveal Visual Perception for Vision Transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 17773-17783.

[2] Luo W, Li Y, Urtasun R, et al. Understanding the effective receptive field in deep convolutional neural networks[J]. Advances in neural information processing systems, 2016, 29.

[3] Ding X, Zhang X, Han J, et al. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 11963-11975.

[4] Dong X, Bao J, Chen D, et al. Cswin transformer: A general vision transformer backbone with cross-shaped windows[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 12124-12134.

[5] Ma X, Dai X, Bai Y, et al. Rewrite the Stars[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 5694-5703.

[6] Shawe-Taylor J. Kernel Methods for Pattern Analysis[J]. Cambridge University Press google schola, 2004, 2: 181-201.

[7] Cortes C. Support-Vector Networks[J]. Machine Learning, 1995.