Improving Text Classification by Leveraging Large Language Models for Data Augmentation

<p>Siyun Yu</p>

doi:10.25236/AJCIS.2024.071213

Academic Journal of Computing & Information Science, 2024, 7(12); doi: 10.25236/AJCIS.2024.071213.

Improving Text Classification by Leveraging Large Language Models for Data Augmentation

Author(s)

Siyun Yu

Corresponding Author:

Siyun Yu

Affiliation(s)

Tibet National University, School of Information Engineering, Xianyang, Shaanxi, 712082, China

Download PDF
|
Download: 40
|
View: 2856

Abstract

In recent years, with the rapid development of large language models, optimizing BERT's performance on small-scale datasets using large language models has gradually become a research hotspot. To this end, this paper proposes a data augmentation method based on large language models to enhance BERT's performance in text classification tasks. Specifically, we first use large language models to back-translate the training data. By translating the text into other languages and then back into the original language, we generate new samples that are semantically consistent but have diverse expressions, thereby increasing the diversity of the training data. Subsequently, the augmented training set is used to train the BERT model, which significantly improves classification accuracy on the Reuters News Classification dataset. Experimental results show that this method effectively mitigates the limitations of small-scale datasets and significantly enhances the model's generalization ability, providing a novel and efficient solution for text classification tasks.

Keywords

GPT-3.5, Bert, Text Classification, Data Augmentation

Cite This Paper

Siyun Yu. Improving Text Classification by Leveraging Large Language Models for Data Augmentation. Academic Journal of Computing & Information Science (2024), Vol. 7, Issue 12: 91-95. https://doi.org/10.25236/AJCIS.2024.071213.

References

[1] Zhao H, Chen H, Ruggles T A, et al. Improving Text Classification with Large Language Model-Based Data Augmentation[J]. Electronics, 2024, 13(13): 2535.

[2] Maharana K, Mondal S, Nemade B. A review: Data pre-processing and data augmentation techniques[J]. Global Transitions Proceedings, 2022, 3(1): 91-99.

[3] Mumuni A, Mumuni F. Data augmentation: A comprehensive survey of modern approaches[J]. Array, 2022, 16: 100258.

[4] Lu Z, Zhou A, Ren H, et al. Mathgenie: Generating synthetic data with question back-translation for enhancing mathematical reasoning of llms[J]. arXiv preprint arXiv:2402.16352, 2024.

[5] Nadi F, Naghavipour H, Mehmood T, et al. Sentiment Analysis Using Large Language Models: A Case Study of GPT-3.5[C]//The International Conference on Data Science and Emerging Technologies. Singapore: Springer Nature Singapore, 2023: 161-168.

[6] Ćavar D, Tiganj Z, Mompelat L V, et al. Computing Ellipsis Constructions: Comparing Classical NLP and LLM Approaches[C]//Proceedings of the Society for Computation in Linguistics 2024. 2024: 217-226.

[7] Yang JF, Jin HY, Tang RX et al (2023) Harnessing the power of LLMs in practice: a survey on ChatGPT and beyond. ACM Trans Knowl Discov Data 18(6):1–32. https://doi.org/10.1145/3649506