Academic Journal of Computing & Information Science, 2026, 9(1); doi: 10.25236/AJCIS.2026.090102.
Wei Wang, Yu Xiang, Yonghao Wu, Tongzhu Zhao, Tiancai Zhu
School of Information Science and Technology, Yunnan Normal University, Kunming, China
In traditional knowledge distillation, a significant capacity gap between the teacher and student models often leads to information loss and performance degradation. To address this issue, this study proposes a two-stage knowledge distillation framework. In the first stage, a progressive distillation strategy is employed, transferring knowledge from RoBERTa to BERT and then to BiLSTM, gradually reducing model complexity while sharing model weights to enhance knowledge transfer. In the second stage, a Conditional Generative Adversarial Network (CGAN) is introduced, utilizing the first-stage student model’s output as a conditional input for adversarial training, guiding optimization between RoBERTa and BiLSTM to improve the classification performance of the student model. Additionally, zscore normalization is applied to ensure that the student model focuses on relative relationships between classes rather than absolute logits values, effectively mitigating performance bottlenecks caused by the capacity gap. Experimental results on multiple NLP datasets demonstrate that the proposed method significantly enhances the classification performance of the student model, achieving 85%-90% of the teacher model’s performance, while substantially reducing model parameters, outperforming traditional knowledge distillation methods.
Knowledge Distillation, Genarative Adversatial Networks, Text Classification, Artificial Intelligence
Wei Wang, Yu Xiang, Yonghao Wu, Tongzhu Zhao, Tiancai Zhu. Adversarial Distillation: Combining Two-Stage Knowledge Distillation with Conditional Generative Adversarial Networks. Academic Journal of Computing & Information Science (2026), Vol. 9, Issue 1: 12-23. https://doi.org/10.25236/AJCIS.2026.090102.
[1] Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
[2] Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
[3] Sun, S., Ren, W., Li, J., Wang, R., Cao, X.: Logit standardization in knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15731– 15740 (2024)
[4] Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967–3976 (2019)
[5] Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J.: Decoupled knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11953–11962 (2022)
[6] Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5191–5198 (2020)
[7] Li, Z., Li, X., Yang, L., Zhao, B., Song, R., Luo, L., Li, J., Yang, J.: Curriculum temperature for knowledge distillation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1504–1512 (2023)
[8] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in neural information processing systems 27 (2014)
[9] Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
[10] Zhao, J., Mathieu, M., LeCun, Y.: Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126 (2016)
[11] Berthelot, D., Schumm, T., Metz, L.: Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717 (2017)
[12] Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802 (2017)
[13] Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathematical statistics 22(1), 79–86 (1951)
[14] Bridle, J.S.: Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In: Neurocomputing: Algorithms, Architectures and Applications, pp. 227–236. Springer, Berlin, Heidelberg (1990)
[15] Shannon, C.E.: A mathematical theory of communication. The Bell system technical journal 27(3), 379–423 (1948)
[16] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013)
[17] Li, X., Roth, D.: Learningquestionclassifiers. In: COLING2002: The 19th Internationa lConference on Computational Linguistics (2002)
[18] Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Advances in neural information processing systems 28 (2015)
[19] Conneau, A., Kiela, D.: Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449 (2018)
[20] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
[21] Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (long and Short Papers), pp. 4171–4186 (2019)
[22] Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional lstm networks. In: Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., vol. 4, pp. 2047–2052 (2005). IEEE