Bi-Branch Weakly Supervised Semantic Segmentation with Transformer

<p>Yijiang Wang<sup>1</sup>, Hongxu Zhang<sup>2</sup></p>

doi:10.25236/AJCIS.2024.070303

Academic Journal of Computing & Information Science, 2024, 7(3); doi: 10.25236/AJCIS.2024.070303.

Bi-Branch Weakly Supervised Semantic Segmentation with Transformer

Author(s)

Yijiang Wang¹, Hongxu Zhang²

Corresponding Author:

Yijiang Wang

Affiliation(s)

¹School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China

²School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China

Download PDF
|
Download: 10
|
View: 267

Abstract

Weakly supervised semantic segmentation (WSSS) based on image-level labels has garnered widespread attention due to its cost-effectiveness. In image-level WSSS, existing methods typically rely on CNNs to generate Class Activation Mapping (CAM) for locating object regions and obtaining pseudo labels. However, CAM often focus solely on discriminative regions, neglecting other valuable information in each image and resulting in incomplete localization maps. To address the partial activation issue of CAM, we propose a Bi-Branch Weakly Supervised Semantic Segmentation with Transformer (Bi-Trans) approach, which includes class-specific seed (SC-CAM) generation and consistency loss (SCC Loss), as well as pairwise affinity consistency loss (PAC Loss). Specifically, the initial seeds of class-specific are directly extracted using the Multi-Head Self-Attention (MHSA) mechanism in the Transformer encoder, bypassing the need for complex training. The SCC Loss aims to minimize the distance between initial seeds generated from two different views, thereby enhancing the feature representation of the original seeds and improving their quality. Additionally, the PAC Loss ensures consistency in regional affinity within each view, enhances target similarity in the affinity matrix, and effectively mitigates background noise issues in the seed region. We evaluate our method on the PASCAL VOC and 2012 MS COCO 2014 segmentation benchmarks. The results demonstrate that our Bi-Trans approach produces superior pseudo-masks using only image-level labels, achieving improved WSSS performance.

Keywords

WSSS, CAM, Transformer, SCC Loss, PAC Loss

Cite This Paper

Yijiang Wang, Hongxu Zhang. Bi-Branch Weakly Supervised Semantic Segmentation with Transformer. Academic Journal of Computing & Information Science (2024), Vol. 7, Issue 3: 21-31. https://doi.org/10.25236/AJCIS.2024.070303.

References

[1] Ahn J, Kwak S. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 4981-4990.

[2] Bearman A, Russakovsky O, Ferrari V, et al. What’s the point: Semantic segmentation with point supervision[C]//European conference on computer vision. Cham: Springer International Publishing, 2016: 549-565.

[3] Zhang B, Xiao J, Zhao Y. Dynamic feature regularized loss for weakly supervised semantic segmentation [J]. arXiv preprint arXiv:2108.01296, 2021.

[4] Lee J, Yi J, Shin C, et al. Bbam: Bounding box attribution map for weakly supervised semantic and instance segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 2643-2652.

[5] Zhou B, Khosla A, Lapedriza A, et al. Learning deep features for discriminative localization[C]// Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2921-2929.

[6] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale [J]. arXiv preprint arXiv:2010.11929, 2020.

[7] Caron M, Touvron H, Misra I, et al. Emerging properties in self-supervised vision transformers[C]// Proceedings of the IEEE/CVF international conference on computer vision. 2021: 9650-9660.

[8] Xu L, Ouyang W, Bennamoun M, et al. Multi-class token transformer for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 4310-4319.

[9] Wang Y, Zhang J, Kan M, et al. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 12275-12284.

[10] Ahn J, Cho S, Kwak S. Weakly supervised learning of instance segmentation with inter-pixel relations[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 2209-2218.

[11] Ru L, Zhan Y, Yu B, et al. Learning affinity from attention: End-to-end weakly-supervised semantic segmentation with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 16846-16855.

[12] Hariharan B, Arbeláez P, Bourdev L, et al. Semantic contours from inverse detectors[C]//2011 international conference on computer vision. IEEE, 2011: 991-998.

[13] Everingham M, Van Gool L, Williams C K I, et al. The pascal visual object classes (voc) challenge [J]. International journal of computer vision, 2010, 88: 303-338.

[14] Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, 2014: 740-755.

[15] Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical image database[C]//2009 IEEE conference on computer vision and pattern recognition. IEEE, 2009: 248-255.

[16] Wu Z, Shen C, Van Den Hengel A. Wider or deeper: Revisiting the resnet model for visual recognition [J]. Pattern Recognition, 2019, 90: 119-133.

[17] Chen L C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected crfs[J]. arXiv preprint arXiv:1412.7062, 2014.

[18] Zhang D, Zhang H, Tang J, et al. Causal intervention for weakly-supervised semantic segmentation[J]. Advances in Neural Information Processing Systems, 2020, 33: 655-666.

[19] Lee J, Kim E, Yoon S. Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 4071-4080.

[20] Su Y, Sun R, Lin G, et al. Context decoupling augmentation for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 7004-7014.

[21] Rossetti S, Zappia D, Sanzari M, et al. Max pooling with vision transformers reconciles class and shape in weakly supervised semantic segmentation[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 446-463.

[22] Chen Q, Yang L, Lai J H, et al. Self-supervised image-specific prototype exploration for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 4288-4298.

[23] Lee S, Lee M, Lee J, et al. Railroad is not a train: Saliency as pseudo-pixel supervision for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 5495-5505.

[24] Jiang P T, Yang Y, Hou Q, et al. L2g: A simple local-to-global knowledge transfer framework for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 16886-16896.

[25] Chen Z, Wang T, Wu X, et al. Class re-activation maps for weakly-supervised semantic segmentation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 969-978.

[26] Ru L, Zheng H, Zhan Y, et al. Token contrast for weakly-supervised semantic segmentation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 3093-3102.

[27] Zhou T, Zhang M, Zhao F, et al. Regional semantic contrast and aggregation for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 4299-4309.

[28] Lee J, Choi J, Mok J, et al. Reducing information bottleneck for weakly supervised semantic segmentation [J]. Advances in Neural Information Processing Systems, 2021, 34: 27408-27421.