Diabetes mellitus (DM) is a chronic metabolic disease which is characterized by upregulated blood glucose levels (Sajan et al., 2021). This disorder was ranked as the 7th leading cause of mortality in 2016 and caused 1.6 million deaths worldwide (WHO, 2018). It is predicted that roughly one-half of DM patients globally have not been diagnosed, so they are at higher risk of developing serious complications than those receiving treatment (Saeedi et al., 2019). Many bioactive peptides have shown potential anti-diabetic activities and are promising to be alternative treatment measures in the management of diabetes (Acquah et al., 2022).
The occurrence of DM leads to long-term damage and dysfunction of numerous tissues, but the most affected is the vascular system (Aqib et al., 2019). There is no cure for DM so far. Conventional insulin injections have been used for DM treatment, but this treatment causes multiple complications such as hyperinsulinemia, pain, and discomfort. Glucose-lowering therapy with anti-diabetic drugs can also be employed to manage DM (Husain et al., 2019). However, the existing anti-diabetic agents pose negative impacts on patient health (Aqib et al., 2019; Daliri, Oh & Lee, 2017; Toroski et al., 2019). Another problem is that the cost of these drugs is increasingly high and are not affordable to three-quarters of the global low-income population (Saeedi et al., 2019). Since the current therapeutic medications have distinct drawbacks, developing new anti-diabetic drugs to prevent, manage, and treat diabetes is urgently needed.
Many peptides have been demonstrated to have impressive anti-diabetic effects. These anti-diabetic peptides modulate blood glucose levels through multiple mechanisms, such as activating the glucagon-like peptide-1 (GLP-1) receptor (Pratley et al., 2018) and inhibiting specific enzymes (Lu et al., 2019; Hatanaka et al., 2012; Mojica & de Mejia, 2016; Mudgil et al., 2019). The BioDADPep database has been developed for storing and managing anti-diabetic peptides (Roy & Teron, 2019). It holds 2,544 entries with 1,143 unique peptides and their relevant detailed information. Out of 1,143 peptides, 1,097 are natural peptides containing only 20 natural amino acids, while 46 are modified peptides. Each entry includes peptide sequence, peptide source (start position–end position), protein function, assay_method/preclinical/clinical and reference, etc. Most of anti-diabetic peptides in BioDADPep were manually collected from published articles, while a small part of anti-diabetic peptides was from other databases (THPdb, ADP3, SATPdb, etc.). BioDADPep is a valuable resource for the design and discovery of novel anti-diabetic peptides. Machine learning based computational models can effectively identify anti-diabetic peptides. However, there is currently no classifier for predicting such peptides.
Here we propose the first bioinformatics tool, called AntiDMPpred. AntiDMPpred is a machine learning-based model to identify anti-diabetic peptides from sequence information. To improve the ease of use, a web server was constructed for AntiDMPpred (http://i.uestc.edu.cn/AntiDMPpred/cgi-bin/AntiDMPpred.pl). We expect that AntiDMPpred could help users find novel anti-diabetic peptides and further advance peptide drug discovery for diabetes treatment.
Materials and Methods
Overall workflow of AntiDMPpred
The overall framework of AntiDMPpred is shown in Fig. 1. First, a benchmark dataset with 236 anti-diabetic and 236 non-anti-diabetic peptides was established. Second, each peptide was represented by diverse peptide descriptors that were then concatenated into a high-dimensional feature vector. Third, different feature scoring methods were utilized to obtain the optimal features which were then fed into various machine learning classifiers. This step was to select the optimal combination of machine learning and feature scoring methods with the highest accuracy in the nested five-fold cross-validation. Finally, the optimal combination was utilized to develop the final model, and the model was then implemented into a web service, which is freely available at http://i.uestc.edu.cn/AntiDMPpred/cgi-bin/AntiDMPpred.pl.
Anti-diabetic peptides were downloaded from the BioDADPep database (Roy & Teron, 2019). We collected 2,544 peptide records. Modified peptides were first excluded because the number of such peptides in BioDADPep is very limited. In this study, natural peptides refer to peptides containing only 20 natural amino acids, while modified peptides represent peptides with unnatural residues. After eliminating peptides containing less than five amino acids or more than 50 amino acids, 2,125 anti-diabetic peptides were remained. We then used CD-HIT (Fu et al., 2012) to reduce redundant sequences with 80% sequence identity cut-off. Finally, the anti-diabetic dataset consisted of 236 anti-diabetic peptides. For the lack of experimentally validated non-anti-diabetic peptides, the non-anti-diabetic peptides were obtained from the AVPdb database (Qureshi et al., 2014). For these peptides, their anti-diabetic activity has not been reported. AVPdb is a database specifically tailored for experimentally validated antiviral peptides (AVPs) that target over 60 viruses and have exhibited tremendous potential in inhibiting viruses (Qureshi et al., 2014). It contains 2,683 peptides, of which 624 are modified peptides. We downloaded 2,059 natural peptides where 2,025 peptides had between 5–50 amino acids from AVPdb. After excluding highly similar peptides by CD-HIT (80% sequence identity threshold), the non-anti-diabetic dataset was composed of 699 peptide sequences. To balance the sample number of anti-diabetic and non-anti-diabetic dataset, we randomly selected 236 peptides from the non-anti-diabetic dataset for training. Finally, the training dataset was comprised of 236 anti-diabetic and 236 non-anti-diabetic peptides.
Each peptide was encoded by the following four types of peptide descriptors: amino acid composition (AAC), dipeptide composition (DPC), pseudo amino acid composition (PseAAC) and composition of k-spaced amino acid group pairs (CKSAAGP). These sequence-derived features have been successfully utilized to classify various types of peptides (He et al., 2016; He, Chen & Huang, 2019; Li et al., 2017). Each peptide descriptor was calculated by in-house scripts.
AAC calculates the percentage of each amino acid in a peptide sequence and generates a 20-dimensional feature vector, and it is formulated as: (1) where represents the number of any one of 20 amino acids (i.e., “ACDEFGHIKLMNPQRSTVWY”).
DPC, a feature extraction method, calculates the frequency of each amino acid pair within a peptide sequence, generating a 400-dimensional feature vector. It can be defined as follows: (2) where denotes the number of any one of 400 amino acid pairs (i.e., “AA, AC, AD…YY”).
The PseAAC compensates for AAC and incorporates sequence-order information by introducing discrete factors. In the calculation formula of PseAAC, the weight factor ω is designed to put weight to the additional PseAAC with respect to the conventional AAC. λ represents the counted rank correlation factor along a peptide sequence. PseAAC generates a vector of 20 + λ dimensions for each peptide sequence. In this study, we set ω = 0.05 and λ = 5 to compute PseAAC, so the 25-dimensional feature vector consisted of the first 20 features from AAC and last five features representing the rank-different correlation factors for the sequence-order information. The explicit description of PseAAC can be found at (Chou, 2009).
The CKSAAGP feature encoding is transformed from the composition of k-spaced amino acid pairs (CKSAAP) (Chen et al., 2009). CKSAAP calculates the frequency of residue pairs separated by k-spaced residues, while CKSAAGP computes the occurrence of group pairs separated by k-spaced residues. For generating CKSAAGP, 20 amino acid residues are first classified into five categories: including the aliphatic residue group (g1: GAVLMI), the aromatic residue group (g2: FYW), the positive charged residue group (g3: KRH), the negative charged residue group (g4: DE), and the uncharged residue group (g5: STCPNQ) (Lee et al., 2011). For a peptide sequence with L amino acids, if k = 0, a feature vector of the 25 0-spaced amino acid group pairs (i.e., g1g1, g1g2, g1g3, g1g4, g1g5, …, g5g1, g5g2, g5g3, g5g4, g5g5) is provided in the Fig. 2 and Eq. (3).
In this work, k = 0, 1 and 2 were jointly considered. Thus, the CKSAAGP encoding of each anti-diabetic/non-anti-diabetic peptide has a dimensionality of 25 × 3 = 75.
Finally, each peptide was encoded by the four types of feature extraction methods, implemented by the Matlab programming language. After direct mergence of all features, the feature vector with a dimension of 520 was finally obtained (see in Table 1).
|Peptide descriptor||Feature dimension|
|(AAC, DPC, PseAAC, CKSAAGP)||520|
Feature scoring and selection
The 520 features generating by four types of peptide descriptors may contain redundant information or information irrelevant to the classification of anti-diabetic/non-anti-diabetic peptides (Chowdhury et al., 2020). These redundant and irrelevant information may lead to a decrease in the predictive performance of the constructed model (Liang et al., 2021). Therefore, we adopted six feature scoring approaches, including F-score (Fscore), K-means (Kmeans), Lasso, Pearson correlation (Pearson), Spearman correlation (Spearman) and Student’s t-test (Ttest), to rank features in the feature vector (AAC, DPC, PseAAC, CKSAAGP), which was generated by joining the individual feature vector. We then combined machine learning methods used in the training process (see in the next section) with these feature scoring strategies to select features for model construction based on the accuracy of the five-fold cross-validation.
Machine learning method and model training
Machine learning classification algorithms, including support vector machine (SVM), random forest (RF) and K-nearest neighbor (KNN), were employed to construct the prediction models to distinguish the anti-diabetic peptides from non-anti-diabetic peptides. For the SVM algorithm, two commonly used kernel functions, linear function and radial basis function (rbf), were employed to implement classification. In this study, we named the SVM with linear function as LinearSVM and SVM with rbf as rbfSVM. Therefore, there are 24 combinations of feature scoring and machine learning methods in this work. LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) was utilized to accomplish SVM (Chang & Lin, 2011). RF and KNN were implemented by MATLAB built-in functions. Parameters were optimized based on each subdataset for training in the five-fold cross-validation through a grid search for each model. The parameter combinations that need to be optimized are listed in the Tables S1 and S2. The optimal features and parameters were then fed into different classifiers for model construction that was performed by in-house scripts.
The classification models were assessed by a nested five-fold cross-validation that has an inner loop for optimizing models/parameters and an outer loop for evaluating the performance of the models constructed in the inner layer. In this study, in the outer layer of the five-fold cross-validation, the training dataset (236 anti-diabetic peptides and 236 non-anti-diabetic peptides) was randomly split into five equally sized subsets where a subset served as the testing-set and the remaining four subsets as the training-set. In the inner loop, the training-set constructed in the outer layer were equally redivided into five subsets, among which four subsets were utilized for tuning parameters and one for estimating models. The pseudo code for the nested five-fold cross-validation can be found in the Supplemental File nested_CV.txt. We utilized Sensitivity (Sn), Precision (Pr), Specificity (Sp), F1 score (F1), Accuracy (Acc) and Matthew’s correlation coefficient (MCC) to evaluate the predictive performance of the models implemented in the study. These metrics are calculated by the following equations:
(9) where TP denotes the number of true positives, TN as the number of true negative, FP as the number of false positives and FN as the number of false negatives.
The receiver operating curve (ROC) was also plotted and the area under the ROC curve (AUCROC) was calculated to assess the classification results.
Data augmentation technology may be potentially more efficient to solve issues of insufficient data, or unbalanced data when data collection is difficult (Han, Xie & Lin, 2020; Mahmud et al., 2021). This is common to the field of computer vision (Chaitanya et al., 2021; Wang et al., 2021a). In the field of bioinformatics, data augmentation technology has also been reported to improve the predictive accuracy of models, such as ACP-DA, a predictor for anticancer peptides (Chen et al., 2021). In this work, we also used the data augmentation technique to build prediction models to test whether it could improve the prediction performance of classifiers. Noise adding oversampling (NAO) (Chen, Liu & Zhang, 2022a) was utilized to augment the positive and negative samples in the feature space, respectively, generating new samples in the feature space. The new samples are generated by calculating as follows:
(10) where represents a random sample from the training dataset (benchmark dataset), V represents a randomly generated group of 520-dimensional random numbers between 0 and 1 that obey uniform distribution, and c represents the coefficient of the perturbation which was set as 0.02 in this study.
Final model construction
The final model was constructed by using the RF algorithm based on the training dataset, which was comprised of 236 anti-diabetic peptides and 236 non-anti-diabetic peptides. Feature selection was performed by RF and Pearson correlation. The optimal features were then fed into the RF classifier. Selection of the optimal parameter combination was executed through a grid search. All these steps were performed by in-house scripts. The pseudo code for final model construction can be found in the Supplemental File final-model-construction.txt. The model construction and evaluation were implemented at a computational server (Sugon I840-G20: Dawning Information Industry Co., LTD., Beijing, China).
Comparison with other machine learning methods
To select the optimal pair of feature scoring and machine learning methods, we combined six feature scoring approaches with four machine learning algorithms in this study. The performance of these 24 combinations in the five-fold cross-validation is given in the Supplemental File nested_CV_results.xlsx. Five-fold cross-validation results of the combinations of RF with six feature ranking strategies are shown in Table 2. The pair of RF and Pearson achieved an accuracy of 77.12% with 80.51% sensitivity, 73.73% specificity, 75.40% precision, 0.5436 MCC, 0.7787 F1 and 0.8193 AUCROC, which suggests that the selected features are able to capture the different discrepancies between anti-diabetic and non-anti-diabetic peptides. Compared with other combinations, the RF/Pearson combination outperformed the other five combinations for all evaluation indicators except AUCROC. The AUCROC achieved the highest level under the combination of Fscore and RF, and AUCROC scores under all combinations are not much different. The ROC curves of the above six predictors are presented in Fig. 3A.
|Machine learning method||Feature scoring method||Accuracy (%)||Sensitivity (%)||Specificity (%)||Precision (%)||MCC||F1||AUCROC|
The highest metric is highlighted in bold.
The performances of the classifiers constructed by the combinations of Pearson and four machine learning algorithms are shown in Table 3. It is observed that the RF-based predictor outperformed the other three machine learning predictors, and it’s all seven evaluation metrics reached the highest level. The ROC curves of these predictions are illustrated in Fig. 3B. Note that the ROC curve of the KNN-based model cannot be drawn, and its AUCROC value is unable to be calculated.
|Feature selection||Machine learning method||Accuracy (%)||Sensitivity (%)||Specificity (%)||Precision (%)||MCC||F1||AUCROC|
The highest metric is highlighted in bold.
Compared with data augmentation method
Here, we used data augmentation technology to randomly sample the original positive and negative samples 100, 200 and 300 times respectively, and finally obtained three new datasets to build predictive models, which were namely as 100% dataset, 200% dataset and 300% dataset respectively. Nested five-fold cross-validation results of the combinations of RF with Pearson strategy for different datasets are shown in Fig. 4. Generally, compared with the results of the original dataset, other model prediction performance metrics of the data augmentation technique on the RF algorithm are not improved, but decreased, except for AUCROC and Pr values when using 300% dataset to construct model. For other three machine learning algorithms, there is also no improvement of predictive performance (see Table S3). Therefore, the data augmentation technology was not applied to the construction of our final model.
The construction and performance of AntiDMPpred
The AntiDMPpred classifier was implemented by using the RF with 347 features that were selected by Pearson. The optimal number of decision trees in this model is 251. As depicted in Table 3, AntiDMPpred achieved an accuracy of 77.12% and AUCROC of 0.8193 with 80.51% sensitivity, 73.73% specificity, 75.40% precision, 0.5436 MCC and 0.7787 F1 in the nested five-fold cross-validation.
Assessment of AntiDMPpred by using other peptide databases
To further evaluate the performance of the AntiDMPpred predictor, 42,973 peptides were collected for testing. We conducted the following procedures to obtain these peptides: (1) collecting 68,390 peptides from different types of peptide databases, including BioPepDB (Li et al., 2018), CAMPR3 (Waghu et al., 2016), CancerPPD (Tyagi et al., 2015), CPPsite (Gautam et al., 2012), DRAMP 2.0 (Kang et al., 2019), HIPdb (Qureshi, Thakur & Kumar, 2013, SATPdb (Singh et al., 2016), THPdb (Usmani et al., 2017) and UniProt with query: “(type:peptide) length:(5 TO 50) AND reviewed: yes” as candidate testing data; (2) preserving the unique peptide sequences containing only 20 natural amino acids; (3) removing the sequences of less than five amino acids or more than 50 amino acids; (4) eliminating the amino acid sequences that appeared in the BioDADpep and AVPdb. It should be noted that whether these peptides have anti-diabetic properties has not been reported. Table 4 demonstrates the detailed information and predictive results of these peptides. Average 75.02% of the testing data are predicted to be non-anti-diabetic peptides, suggesting that our predictor is capable of effectively learning useful feature representations for distinguishing anti-diabetic peptides from non-anti-diabetic peptides.
|Database||Total number of peptides||Number of unique peptides for testing||Sp (%)|
Sp represents the percentage of peptide sequences that are predicted to be non-anti-diabetic peptides.
AntiDMPpred web service
To better serve the scientific community, a user-friendly web service was constructed for AntiDMPpred, which is freely available at http://i.uestc.edu.cn/AntiDMPpred/cgi-bin/AntiDMPpred.pl. The server-side processes were performed by Matlab scripts, and Perl and HTML were used to develop the web interfaces. AntiDMPpred allows users to input the query peptide sequences into the input box or upload a sequence file (Fig. 5A). Eventually, the predictive results can be obtained after pressing the Predict button (Fig. 5B).
Users can also modulate the threshold of the probability (tp) whose value can range from 0 to 1 to differentiate between predicted positives and negatives. In this work, the assessment of model performance under different tp values can serve as a reference for tp value selection. Detailed results are shown in Table 5, with the increase of the tp value, the value of sensitivity decreases, while the specificity increases. When tp is 0.5, the value of specificity and sensitivity is 73.73% and 80.51%, and ACC, MCC and F1 achieves the highest score of 77.12%, 0.5436 and 0.7787, respectively. This is the reason why the tp value was set as 0.5 by default. When tp is 0.95, the value of specificity is 100%, while the value of sensitivity is 0.
|tp||Accuracy (%)||Specificity (%)||Sensitivity (%)||Precision (%)||MCC||F1|
Compared with the current costly experimental verification, there has been growing interest in using machine learning (ML) methods for peptide activity prediction (Basith et al., 2020). The construction of anti-diabetic peptide predictors may promote the discovery of anti-diabetic peptides and even boost the development of anti-diabetic drugs. However, an anti-diabetic peptide predictor is currently lacking, and it is crucial to learn from other peptide predictors. Anticancer peptide predictors based on ML methods have been a research hotspot in the field of bioinformatics in recent years, and more than a dozen of anticancer peptide predictors have been successfully developed, such as ACP-Fuse (Rao et al., 2020), AntiCP 2.0 (Agrawal et al., 2021), CpACpP (Nasiri et al., 2021) and CL-ACP (Wang et al., 2021b). Most of them were designed based on conventional ML algorithms, such as SVM, RF and Naïve Bayes (NB) (Wang et al., 2021b), and SVM is the most popularly used (Liang et al., 2021). Partial predictors were achieved based on deep learning (DL) methods, such as long short-term memory neural network, and convolutional neural network. Generally, both traditional ML methods and DL methods use various features, such as AAC, DPC, PseAAC, Composition/Transition/Distribution (CTD) and Grouped Di-Peptide Composition (GDPC), as input to train prediction models. The successful implementation of these anticancer peptide predictors based on ML methods has greatly promoted the discovery of new anticancer peptides. ML methods were also applied in other peptide predictors, such as antimicrobial peptides (Xiao et al., 2021), blood-brain barrier penetrating peptides (Chen et al., 2022b), streptavidin-binding peptides (He et al., 2016) and therapeutic peptides (Wei et al., 2019). In our work, we also used the RF algorithm to achieve the first anti-diabetic peptide predictor, which may play a role in the discovery of new anti-diabetic peptides.
The construction of a high-quality benchmark dataset is the first step for the successful implementation of the prediction model (Charoenkwan et al., 2020a). In the present study, the benchmark dataset consisted of 236 anti-diabetic peptides from BioDADPep and 236 non-anti-diabetic peptides from AVPdb. However, the non-anti-diabetic peptides have not been experimentally identified with no anti-diabetic activity. We compared the data in BioDADPep database with many other existing peptide databases and analyzed the possibility of AVPdb database containing anti-diabetic peptides. We downloaded peptide sequences from AVPdb, BioPepDB, CancerPPD, CAMPR3, CPPsite, DRAMP 2.0, HIPdb, SATPdb, THPdb and UniProt with query: “(type: peptide) length: (5 TO 50) AND reviewed: yes”. After excluding peptide sequences with unnatural residues (“X”, “B” and “Z”, etc.) and eliminating peptides containing less than five amino acids or more than 50 amino acids, the remaining peptide sequences in each database were compared with anti-diabetic peptides from BioDADPep. It was found that only BioPepDB and SATPdb databases had overlapping peptides with the BioDADpep database, and the number was only 19 and 20, with a frequency of 0.51% and 0.13%, respectively. The average frequency of anti-diabetic peptides in the 10 peptide databases is very small with only 0.068%. Detailed descriptions of these databases are listed in Table S4. Moreover, the BioPepDB database is a database of food-derived bioactive peptides, and part of anti-diabetic peptides are derived from milk protein as the same as some peptides of BioPepDB. SATPdb is a database of structurally annotated therapeutic peptides, which is the combination of many peptide databases. It is not surprising that these two databases have overlapping peptides with the BioDADpep database. Therefore, we can speculate that the probability of anti-diabetic peptides appearing in other peptide databases is very low and we consider the peptides in AVPdb as non-anti-diabetic peptides. In the future, if the data of non-anti-diabetic peptides verified by experiments are enough to construct a predictor, we will further improve the anti-diabetic peptide predictor using experimentally verified non-anti-diabetic peptides.
In the present study, we constructed the negative training dataset by collecting peptides from the other database (AVPdb in this study), which is a common-used strategy for peptide predictor studies (Ma et al., 2022; Chu et al., 2022; Charoenkwan et al., 2020b). Although it is of low probability that these peptides are anti-diabetic peptides, they might show anti-diabetic activity. We will update the AntiDMPpred predictor when sufficient experimentally validated non-anti-diabetic peptides emerge.
The accurate and fast identification of anti-diabetic bioactive peptides can help speed up peptide-based drug discovery in the process of searching newly effective therapeutic peptide agents for diabetes treatment. We have proposed AntiDMPpred, the first web service to predict anti-diabetic peptides from sequence information based on RF and hybrid features. We demonstrated its performance through a nested five-fold cross-validation. The results show that the AntiDMPpred model has achieved promising results. Its AUCROC value reached 0.8193, and its MCC value reached 0.5436. It is freely accessible at http://i.uestc.edu.cn/AntiDMPpred/cgi-bin/AntiDMPpred.pl.
Parameters to be optimized in the machine learning methods and feature selection.
Data generated or analyzed during this study.
Benckmark dataset can be found in the Supplementary benchmark_dataset.fasta. The pseudo code for the nested five-fold cross-validation can be found in the Supplementary file nested_CV.txt. The pseudo code for final model construction can be found in the Supplementary file final-model-construction.txt. The source code can be found in the Supplemental predictioncode.zip.