Academic Journal of Computing & Information Science, 2023, 6(3); doi: 10.25236/AJCIS.2023.060310.
Yifan Deng, Shaoqing Mo, Haiyun Gan, Jiangjiang Wu
School of Automobile and Transportation, Tianjin University of Technology and Education, Tianjin, China
At present, the detection of instances in traffic scenes based on deep learning is mainly divided into two mainstream directions: object detection and semantic segmentation. Among them, object detection realizes the specific location of a single object in the road scene, and semantic segmentation realizes the pixel level classification of objects and background categories in the road scene. However, when pedestrian and other objects have occlusion problems, semantic segmentation is difficult to directly separate a single instance, and the anchor frame generated by object detection contains redundant information. To solve this problem, this paper proposes a method combining target detection and semantic segmentation. This method first uses YOLOv5 model to complete the target detection, and detects people, vehicles and other objects in the captured traffic scene image. At the same time, the improved DeepLabv3+ network model is used to capture the semantic and regional information of roads in the captured traffic scene image. Finally, the prediction results of the output of the two task branches are drawn in the image to be detected, and finally the drawing results are combined and output uniformly. This method can effectively distinguish different people, vehicles, roads and other information in the traffic scene, "complement each other" to understand the driverless road scene, and improve the detection accuracy. The experimental results show that the average mAP of this method is 79.11%, and the segmentation accuracy is high, which is suitable for the unmanned driving scene on real urban roads.
Target detection; Semantic segmentation; Instance segmentation; Complex traffic scenarios
Yifan Deng, Shaoqing Mo, Haiyun Gan, Jiangjiang Wu. Research on Traffic Scene Recognition Algorithm Combining Object Detection and Semantic Segmentation. Academic Journal of Computing & Information Science (2023), Vol. 6, Issue 3: 75-83. https://doi.org/10.25236/AJCIS.2023.060310.
[1] Franke U, Joos A. Real-time stereo vision for urban traffic scene understanding[C]//Proceedings of the IEEE Intelligent Vehicles Symposium 2000 (Cat. No. 00TH8511). IEEE, 2000: 273-278.
[2] Uhrig J, Cordts M, Franke U, et al. Pixel-level encoding and depth layering for instance-level semantic labeling[C]//German conference on pattern recognition. Springer, Cham, 2016: 14-25.
[3] Kuutti S, Bowden R, Jin Y, et al. A survey of deep learning applications to autonomous vehicle control [J]. IEEE Transactions on Intelligent Transportation Systems, 2020, 22(2): 712-733.
[4] Ning J, Yang J, Jiang S, et al. Object tracking via dual linear structured SVM and explicit feature map [C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4266-4274.
[5] Pang Y, Yuan Y, Li X, et al. Efficient HOG human detection[J]. Signal processing, 2011, 91(4): 773-781.
[6] Girshick R, Iandola F, Darrell T, et al. Deformable part models are convolutional neural networks[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2015: 437-446.
[7] Schuldt C, Laptev I, Caputo B. Recognizing human actions: a local SVM approach[C] //Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004. IEEE, 2004, 3: 32-36.
[8] Hosang J, Benenson R, Schiele B. Learning non-maximum suppression[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4507-4515.
[9] Li P, Zhao W. Image fire detection algorithms based on convolutional neural networks[J]. Case Studies in Thermal Engineering, 2020, 19: 100625.
[10] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2014: 580-587.
[11] Girshick R. Fast r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1440-1448.
[12] Liu W, Anguelov D, Erhan D, et al. Ssd: Single shot multibox detector[C]//European conference on computer vision. Springer, Cham, 2016: 21-37.
[13] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 779-788.
[14] Li J, Mei X, Prokhorov D, et al. Deep neural network for structural prediction and lane detection in traffic scene[J]. IEEE transactions on neural networks and learning systems, 2016, 28(3): 690-703.
[15] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation [C]// Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3431-3440.
[16] Albawi S, Mohammed T A, Al-Zawi S. Understanding of a convolutional neural network[C]//2017 international conference on engineering and technology (ICET). Ieee, 2017: 1-6.
[17] Badrinarayanan V, Kendall A, Cipolla R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation [J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(12): 2481-2495.
[18] Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2881-2890.
[19] Chen L C, Zhu Y, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 801-818.
[20] He K, Zhang X, Ren S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition [J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(9): 1904-1916.
[21] He K, Gkioxari G, Dollár P, et al. Mask r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2961-2969.