Welcome to Francis Academic Press

Academic Journal of Computing & Information Science, 2022, 5(14); doi: 10.25236/AJCIS.2022.051417.

Building Image Segmentation Method with Multi-Attention Mechanism


Xinyang Tian1, Mingkun Xu1

Corresponding Author:
Mingkun Xu

1School of Computer Science (National Pilot Software Engineer School), Beijing University of Posts and Telecommunications, Beijing, China


In order to solve the problem of inaccurate edge segmentation and loss of small buildings caused by UNet which is difficult to take into account both global features and local features, CSUNet is proposed based on coordinate attention and self-attention. The CSUNet fuses the coordinate attention in the encoder, designs a Double-channel Skip Connection Transformer (DSCT) model in the skip connection, and designs a feature fusion module (FFM) based on CBAM channel attention to fuse the output of the skip connection with the upsampling result of the decoder. The model is tested on the instance dataset of typical Chinese cities and the WHU East Asia satellite dataset. On the instance dataset of typical Chinese cities, PA reaches 0.9390 and IoU reaches 0.8227, on the WHU East Asia satellite dataset, PA reaches 0.9847 and IoU reaches 0.8332. Compared with UNet, all indicators are improved. Visually, CSUNet can more accurately extract building details such as edges and corners, and can extract the location and contour of small buildings. Experiments show that CSUNet can improve the performance of building feature extraction.


building image segmentation, UNet model, coordinate attention, double-channel skip connection transformer, feature fusion

Cite This Paper

Xinyang Tian, Mingkun Xu. Building Image Segmentation Method with Multi-Attention Mechanism. Academic Journal of Computing & Information Science (2022), Vol. 5, Issue 14: 113-125. https://doi.org/10.25236/AJCIS.2022.051417.


[1] Xiaofei HE, Zhengrong ZOU, Chao TAO, et al. Combined Saliency with multi-convolutional neural network for high resolution remote sensing scene classification [J]. Acta Geodaetica et Cartographica Sinica, 2016, 45(9): 1073.

[2] Wang Z, Zhou Y, Wang S, et al. House building extraction from high resolution remote sensing image based on IEU-Net[J]. J. Remote Sens, 2021.

[3] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[C]//International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015: 234-241.

[4] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.

[5] Chen L C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation[J]. arXiv preprint arXiv:1706.05587, 2017.

[6] Zhou Z, Rahman Siddiquee M M, Tajbakhsh N, et al. Unet++: A nested u-net architecture for medical image segmentation [M]//Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer, Cham, 2018: 3-11.

[7] Zhu Q, Liao C, Hu H, et al. MAP-Net: Multiple attending path neural network for building footprint extraction from remote sensed imagery [J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 59(7): 6169-6181.

[8] Fu J, Liu J, Tian H, et al. Dual attention network for scene segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 3146-3154.

[9] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [J]. Advances in neural information processing systems, 2017, 30.

[10] Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers[C]//European conference on computer vision. Springer, Cham, 2020: 213-229.

[11] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale [J]. arXiv preprint arXiv:2010.11929, 2020.

[12] Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 10012-10022.

[13] Dong X, Bao J, Chen D, et al. CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [J]. 2021.

[14] Cao H, Wang Y, Chen J, et al. Swin-unet: Unet-like pure transformer for medical image segmentation [J]. arXiv preprint arXiv:2105.05537, 2021.

[15] Oktay O, Schlemper J, Folgoc L L, et al. Attention u-net: Learning where to look for the pancreas [J]. arXiv preprint arXiv:1804.03999, 2018.

[16] Hou Q, Zhou D, Feng J. Coordinate attention for efficient mobile network design[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 13713-13722.

[17] Gao Y, Zhou M, Metaxas D N. UTNet: a hybrid transformer architecture for medical image segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2021: 61-71.

[18] Woo S, Park J, Lee J Y, et al. CBAM: Convolutional Block Attention Module[C]// European Conference on Computer Vision. Springer, Cham, 2018.