Welcome to Francis Academic Press

Academic Journal of Computing & Information Science, 2022, 5(10); doi: 10.25236/AJCIS.2022.051015.

Research on protein solubility prediction based on ensemble learning and feature fusion


Hongqi Feng, Tao Wu

Corresponding Author:
Tao Wu

School of Computer Science and Artificial Intelligence, Aliyun School of Big Data, School of Software, Changzhou University, Changzhou 213164, China


Protein solubility is one of the momentous properties of a protein that can effectively participate in and inhibit the physiological and biochemical processes of cancer cells in the human body. Therefore, understanding the solubility of proteins may be significant to find the mechanism of diseases caused by the solubility of proteins. In this paper, to improve the protein solubility prediction performance and address the inadequacy of existing protein solubility prediction methods that more feature information about protein sequences is difficult to be obtained. A protein solubility prediction model named EL-FFsol is proposed, which is based on the CatBoost ensemble learning framework and multiple feature fusion of protein sequences. First of all, protein sequence features were introduced to build fusion representation, including the Physicochemical Properties, One-hot Feature Encoding, Amino Acid Composition and Statistical Features. Additionally, the CatBoost was employed to construct an ensemble learning model to predict protein solubility. Finally, EL-FFsol was tested on the benchmark dataset to predict the solubility of proteins. In terms of accuracy, matthews correlation coefficient, sensitivity, specificity, area under ROC curve and area under P-R curve, EL-FFsol achieved 0.7679, 0.5480, 0.6630, 0.8729, 0.8540 and 0.8440 performances. Compared with the DeepSOL and DDcCNN, the matthews correlation coefficient was increased by 1.68% and 0.79%, the area under ROC curve was increased by 1.60% and 2.20% and the area under P-R curve was increased by 1.70% and 2.40%, respectively.


protein solubility; sequence information; multiple feature fusion; ensemble learning

Cite This Paper

Hongqi Feng, Tao Wu. Research on protein solubility prediction based on ensemble learning and feature fusion. Academic Journal of Computing & Information Science (2022), Vol. 5, Issue 10: 90-100. https://doi.org/10.25236/AJCIS.2022.051015.


[1] Tanaka S, Takizawa K, Nakamura F. One-step visualization of natural cell activities in non-labeled living spheroids. Sci Rep 2022; 12:1–11.

[2] Cho H, Li Y, Archacki S, et al. Splice variants of lncRNA RNA ANRIL exert opposing effects on endothelial cell activities associated with coronary artery disease. RNA Biology 2020; 17:1391–1401

[3] Monteiro L, Da Silva L, Lipinski B, et al. Assessing Cell Activities rather than Identities to Interpret Intra-Tumor Phenotypic Diversity and Its Dynamics. iScience 2020; 23:101061.

[4] Havugimana PC, Hart GT, Nepusz T, et al. A Census of Human Soluble Protein Complexes. Cell 2012; 150:1068–1081.

[5] Aqeel A, Hassan A, Khan MA, et al. A Long Short-Term Memory Biomarker-Based Prediction Framework for Alzheimer’s Disease. Sensors 2022; 22:1475.

[6] Meng L, Li X, Li C, et al. Effects of Exercise in Patients With Amyotrophic Lateral Sclerosis: A Systematic Review and Meta-Analysis. American Journal of Physical Medicine & Rehabilitation 2020; 99:801–810.

[7] Goh GS, Zeng GJ, Tay DK, et al. Patients With Parkinson’s Disease Have Poorer Function and More Flexion Contractures After Total Knee Arthroplasty. The Journal of Arthroplasty 2021; 36:2325–2330.

[8] Vihinen M. Solubility of proteins. ADMET and DMPK 2020; 8:391–399.

[9] Marmamula S, Barrenakala NR, Challa R, et al. Visual outcomes after cataract surgery among the elderly residents in the ‘homes for the aged’ in South India: the Hyderabad Ocular Morbidity in Elderly Study. British Journal of Ophthalmology 2021; 105:1087–1093.

[10] Peng Y, Zhang C, Rui Z, et al. A comprehensive profiling of soluble immune checkpoints from the sera of patients with non-small cell lung cancer. Journal of Clinical Laboratory Analysis 2022; 36:e24224.

[11] Gu D, Ao X, Yang Y, et al. Soluble immune checkpoints in cancer: production, function and biological significance. j. immunotherapy cancer 2018; 6:132.

[12] Wang X-F, Gao P, Liu Y-F, et al. Predicting Thermophilic Proteins by Machine Learning. Current Bioinformatics 2020; 15:493–502.

[13] Parsons CM, Hashimoto K, Wedekind KJ, et al. Soybean protein solubility in potassium hydroxide: an in vitro test of in vivo protein quality. Journal of Animal Science 1991; 69:2918–2924.

[14] Rizvi NB, Aleem S, Khan MR, et al. Quantitative Estimation of Protein in Sprouts of Vigna radiate (Mung Beans), Lens culinaris (Lentils), and Cicer arietinum (Chickpeas) by Kjeldahl and Lowry Methods. Molecules 2022; 27:814.

[15] Hou Q, Bourgeas R, Pucci F, et al. Computational analysis of the amino acid interactions that promote or decrease protein solubility. Sci Rep 2018; 8:1–13.

[16] Guzman-Chavez F, Arce A, Adhikari A, et al. Constructing Cell-Free Expression Systems for Low-Cost Access. ACS Synth. Biol. 2022; 11:1114–1128.

[17] Yan M, Zhang X, Hu L, et al. Bacterial Community Dynamics During Nursery Rearing of Pacific White Shrimp (Litopenaeus vannamei) Revealed via High-Throughput Sequencing. Indian J Microbiol 2020; 60:214–221.

[18] Grant-Kels JM, Sloan B, Kantor J, et al. Big data and cutaneous manifestations of COVID-19. Journal of the American Academy of Dermatology 2020; 83:365–366.

[19] Han G-S, Yu Z-G, Anh V. A two-stage SVM method to predict membrane protein types by incorporating amino acid classifications and physicochemical properties into a general form of Chou’s PseAAC. Journal of Theoretical Biology 2014; 344:31–39.

[20] Magnan CN, Randall A, Baldi P. SOLpro: accurate sequence-based prediction of protein solubility. Bioinformatics 2009; 25:2200–2207.

[21] Smialowski P, Doose G, Torkler P, et al. PROSO II – a new method for protein solubility prediction. The FEBS Journal 2012; 279:2192–2200.

[22] Agostini F, Cirillo D, Livi CM, et al. cc SOL omics : a webserver for solubility prediction of endogenous and heterologous expression in Escherichia coli. Bioinformatics 2014; 30:2975–2977.

[23] Rawi R, Mall R, Kunji K, et al. PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine. Bioinformatics 2018; 34:1092–1098.

[24] Savojardo C, Bruciaferri N, Tartari G, et al. DeepMito: accurate prediction of protein sub-mitochondrial localization using convolutional neural networks. Bioinformatics 2020; 36:56–64.

[25] Khurana S, Rawi R, Kunji K, et al. DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics 2018; 34:2605–2613.

[26] Wang X-F, Liu Y-F, Du Z-Y, et al. Design of protein solubility prediction model based on deep neural network. Journal of Henan Normal University(Natural Science Edition) 2021; 49:31–39.

[27] Fu L, Niu B, Zhu Z, et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 2012; 28:3150–3152.

[28] Chang CCH, Song J, Tey BT, et al. Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction. Briefings in Bioinformatics 2014; 15:953–962.

[29] Wang G, Dunbrack RL. PISCES: a protein sequence culling server. Bioinformatics 2003; 19:1589–1591.

[30] Cheng J, Randall AZ, Sweredoski MJ, et al. SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Research 2005; 33:W72–W76.

[31] Seibert P, Raßloff A, Ambati M, et al. Descriptor-based reconstruction of three-dimensional microstructures through gradient-based optimization. Acta Materialia 2022; 227:117667.

[32] Zhang F, Fleyeh H, Bales C. A hybrid model based on bidirectional long short-term memory neural network and Catboost for short-term electricity spot price forecasting. Journal of the Operational Research Society 2022; 73:301–325.