Academic Journal of Computing & Information Science, 2025, 8(11); doi: 10.25236/AJCIS.2025.081107.
Mengran Zhou1, Chao Qin2
1School of Electrical and Information Engineering, Anhui University of Science and Technology, Huainan, China
2School of Computer Science and Engineering, Anhui University of Science and Technology, Huainan, China
To address the issues of severe video quality degradation caused by high-concentration coal dust in confined underground coal mine spaces, which leads to difficulties in behavior detection and discriminative feature learning, this study proposes an improved CRR-YOLO algorithm based on YOLOv11n. To tackle the challenge of learning discriminative features, a cross-modal scene-object matching module, CM-SOM, is designed. By introducing a Vision-Language Model (VLM), it establishes cross-modal interaction between visual and linguistic modalities, enhancing the feature space distinction between targets and backgrounds, thereby improving the semantic discrimination capability of the target detection model in scenarios lacking discriminative features. In the backbone network, a context prior-guided feature extraction network, RepVIT, is embedded. It constructs a dynamic contextual information flow through gated dynamic spatial aggregation to enhance the model, achieving dual guidance of features and weights, and strengthening the model's global semantic understanding and contextual dependency modeling of the scene. Furthermore, a feature fusion network with a recalibration mechanism, Re-FPN, is designed. Through a selective boundary aggregation module and a lightweight feature enhancement module, it enables complementary enhancement of boundary details and high-level semantic information via a bidirectional interaction mechanism, optimizing multi-scale feature fusion. Experiments on the dedicated underground coal mine behavior dataset DsLMF+ demonstrate that CRR-YOLO achieves 84.3% [email protected] and 79.1% F1-score, outperforming several advanced models. With only 2.4M parameters and 6.2 GFLOPs, it achieves an inference speed of 253 FPS, striking a favorable balance among accuracy, speed, and complexity, and exhibits strong potential for practical application.
Behavior Detection, Real-Time Monitoring, Cross-Modal Guidance, Yolov11n
Mengran Zhou, Chao Qin. Real-Time Personnel Behavior Detection in Dusty Coal Mines via Dehazing-Enhanced YOLO with Cross-Modal Guidance. Academic Journal of Computing & Information Science (2025), Vol. 8, Issue 11: 62-70. https://doi.org/10.25236/AJCIS.2025.081107.
[1] Wang H, Mou L. An Improved YOLOv8 Based Unsafe Behavior Detection Algorithm for Coal Mine Underground Personnel[C]//2025 6th International Conference on Computer Engineering and Application (ICCEA). IEEE, 2025: 01-05.
[2] Yu W, Chunhua Y U, Xiaoqing C, et al. Recognition of unsafe behaviors of underground personnel based on multi modal feature fusion[J]. Journal of Mine Automation, 2023, 49(11): 138-144.
[3] Jinjin L U O, Wei C, Zijian T, et al. Real-time detection algorithm of underground personnel behavior based on YOLOv8-ECW[J]. Journal of Mining Science and Technology, 2025, 10(2): 316-327.
[4] Chen W, Mu H X, Guan Y Y, et al. Improving YOLOv8s for behavior detection of underground miners in coal mine[J].Journal of Liaoning Technical University (Natural Science),2025,44(3):257-264. (in Chinese).
[5] Redmon J, Farhadi A. YOLOv11: A New Generation of Real-Time Object Detection[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2023: 1-9.
[6] Wang Z, He X, Li Y, et al. EmbedFormer: embedded depth-wise convolution layer for token mixing[J]. Sensors, 2022,22(24):9854.
[7] Xu H, Ghosh G, Huang P Y, et al. VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2021.
[8] Ding X, Zhang Y, Ge Y, et al. Unireplknet: A universal perception large-kernel convnet for audio video point cloud time-series and image recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 5513-5524.
[9] Koroteev M V. BERT: A Review of Applications in Natural Language Processing and Understanding[J]. Computation and Language,2021
[10] Berger C, Premaraj N, Ravelli R B G, et al. Cryo-electron tomography on focused ion beam lamellae transforms structural cell biology[J]. Nature Methods, 2023, 20(4): 499-511.
[11] Alzubaidi L, Zhang J, Humaidi A J, Al-Dujaili, et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions[J]. Journal of big Data,2021, 8(1):53.
[12] Liu W, Quijano K, Crawford M M. YOLOv5-Tassel: Detecting tassels in RGB UAV imagery with improved YOLOv5 based on transfer learning[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2022, 15: 8085-8094.
[13] Wang C, Sun W, Wu H, et al. A low-altitude remote sensing inspection method on rural living environments based on a modified YOLOv5s-ViT[J]. Remote Sensing, 2022, 14(19): 4784.
[14] Liu N, Huang G, Xu D, et al. Research on Illegal Behavior Detection Algorithm of Underground Mine Workers Based on Improved YOLOv8[C]//2024 17th International Conference on Advanced Computer Theory and Engineering (ICACTE). IEEE, 2024: 164-168.
[15] Ni Y, Huo J, Hou Y, et al. Detection of underground dangerous area based on improving YOLOV8[J]. Electronics, 2024, 13(3): 623.