Welcome to Francis Academic Press

Academic Journal of Computing & Information Science, 2025, 8(11); doi: 10.25236/AJCIS.2025.081102.

MambaVision-Count: A Crowd Counting System Based on a Hybrid Architecture

Author(s)

Lu Chen, Lei Ding

Corresponding Author:
Lu Chen
Affiliation(s)

School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi'an, Shaanxi, China

Abstract

Counting people in highly crowded, heavily occluded, and scale-varying scenes remains a challenging task. This paper presents a novel crowd counting framework, MambaVision-Count, built upon the efficient visual backbone MambaVision. The framework integrates the strengths of convolution, state-space modeling, and self-attention mechanisms, enabling the model to capture long-range dependencies and global contextual information effectively. This design allows the model to better handle complex variations in crowd distribution. A dual-branch regression head is introduced to simultaneously predict density maps and total counts. Additionally, an EFC feature fusion module is incorporated to enhance the representation of small target regions, thus improving the overall accuracy and robustness of crowd counting. Extensive experiments conducted on datasets such as ShanghaiTech demonstrate that the proposed method outperforms existing state-of-the-art approaches, achieving superior accuracy and inference efficiency. The results highlight its strong practical potential in real-world applications.

Keywords

Crowd Counting; Feature Fusion; Transformer; Mamba

Cite This Paper

Lu Chen, Lei Ding. MambaVision-Count: A Crowd Counting System Based on a Hybrid Architecture. Academic Journal of Computing & Information Science (2025), Vol. 8, Issue 11: 14-22. https://doi.org/10.25236/AJCIS.2025.081102.

References

[1] Q. Wang, T. Breckon, Crowd counting via segmentation guided attention networks and curriculum loss, IEEE Trans. Intell. Transp. Syst. (2022).

[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention Is All You Need, Adv. Neural Inf. Process. Syst., vol. 30, pp. 5998–6008, 2017.

[3] A. Hatamizadeh, J. Kautz, MambaVision: A Hybrid Mamba-Transformer Vision Backbone, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 25261–25270, 2025.

[4] X. Chu, A. Zheng, X. Zhang, J. Sun, Detection in Crowded Scenes: One Proposal, Multiple Predictions, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 12214–12223, 2020.

[5] Y. Li, X. Zhang, D. Chen, Csrnet: Dilated convolutional neural networks for under-standing the highly congested scenes, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 1091–1100.

[6] D. Liang, X. Chen, W. Xu, Y. Zhou, X. Bai, Transcrowd: weakly-supervised crowd counting with transformers, Sci. China Inf. Sci. (2022).

[7] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M.Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: transformers for image recognition at scale, ArXiv abs/2010.11929 (2021).

[8] Y. Tian, X. Chu, H. Wang, Cctrans: simplifying and improving crowd counting with transformer, ArXiv abs/2109.14483 (2021).

[9] A. Gu, T. Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces, Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 36, pp. 1–13, 2023.

[10] H.-Y. Ma, L. Zhang, S. Shi, VMambaCC: A Visual State Space Model for Crowd Counting, arXiv preprint arXiv: 2405. 03978, 2024.

[11] Y. Xiao, T. Xu, X. Yu, Y. Fang, and J. Li, “A lightweight fusion strategy with enhanced interlayer feature correlation for small object detection,” IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–11, 2024, doi: 10.1109/TGRS.2024.3457155.

[12] Y. Zhang, D. Zhou, S. Chen, S. Gao, Y. Ma, Single-image crowd counting via multi-column convolutional neural network, in: 2016 IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR), 2016, pp. 589–597. https://doi.org/10.1109/CVPR.2016.70.

[13] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S.A. Al-Maadeed, N.M. Rajpoot, M. Shah, Composition loss for counting, density map estimation and localization in dense crowds, ArXiv abs/1808.01050 (2018).

[14] V.A. Sindagi, R. Yasarla, V.M. Patel, Jhu-crowd++: Large-scale crowd counting dataset and a benchmark method, Technical Report (2020).

[15] J. Chen, W. Su, and Z. Wang, “Crowd counting with crowd attention convolutional neural network,” arXiv preprint arXiv: 2204.07347, 2022.

[16] Y. Miao, Z. Lin, G. Ding, and J. Han, “Shallow feature based dense attention network for crowd counting,” in Proc. AAAI Conf. Artif. Intell., vol. 34, no. 01, pp. 10077–10084, 2020.