An analytical study on the identification of N-linked glycosylation sites using machine learning model

View article
RT @PeerJCompSci: Just published in @PeerJCompSci - An analytical study on the identification of N-linked glycosylation sites using machine…
RT @thePeerJ: Just published in @PeerJCompSci - An analytical study on the identification of N-linked glycosylation sites using machine lea…
RT @thePeerJ: Just published in @PeerJCompSci - An analytical study on the identification of N-linked glycosylation sites using machine lea…
Just published in @PeerJCompSci - An analytical study on the identification of N-linked glycosylation sites using machine learning model Read the full article https://t.co/obX3dPnltD #Bioinformatics #AI #DataMining #MachineLearning
RT @PeerJCompSci: Just published in @PeerJCompSci - An analytical study on the identification of N-linked glycosylation sites using machine…
Just published in @PeerJCompSci - An analytical study on the identification of N-linked glycosylation sites using machine learning model Read the full article https://t.co/RGK1pHpkE4 #Bioinformatics #AI #DataMining #MachineLearning
PeerJ Computer Science

Main article text

 

Introduction

Survey methodology

Review plan

Review conduct

Automated search in digital library

Inclusion and exclusion criteria for selection

  1. Inclusion Criteria

    1. The article included in review must contain prediction of N-linked glycosylation sites or Glycosylation sites.

    2. It must target any of the research question mentioned in Table 2.

    3. It is published in journal or in preprint repository since 2017.

    4. It should contain computation or semi computational approach for prediction.

  • 2. Exclusion Criteria

    1. Eliminate articles that do not address the N-linked glycosylation or glycosylation.

    2. Eliminate articles that purely identify N-linked sites through biological experimentation.

    3. Eliminate the books appeared in the result of search query.

Quality assessment as selection criteria

  1. The study has awarded score (1) if N-linked predictive tool has developed, otherwise scored (0).

  2. The study has awarded score (2) if the method developed to extract feature from data based on computational approach, score (1) for hybrid approach and score (0) in-case of experimental approach.

  3. The study has awarded score (1) if the computation method for training has provided, otherwise scored (0).

  4. The score (1) has been awarded if the data set used is available otherwise scored (0).

  5. The score (1) has been awarded if the organism type is available otherwise scored (0).

  6. The studies were rated by taking conference and journal rating list into account. The possible score for publication is shown in Table 4.

Selection based on snowballing

Review report

Assessment and discussion

Assessment of q1:

Which are the relevant publishing channel for N-linked glycosylation research? Which channel type and geographical area target this research?

Assessment of q2:

Which are the exiting prediction model (tool) for the identification of N-linked Glycosylation sites and for which kind of species these sites are identified?

Assessment of q3:

Which algorithm or method are used to construct N-Linked feature vector?

Assessment of q4:

Which algorithm or method are used to train N-Linked computation model?

Assessment of q5:

How effective are the existing model to predict the N-Linked sites?

Discussion and future direction

Taxonomy hierarchy

General observation and future direction

  • (a) Feature set construction method The performance of computational model deeply depends on the quality of feature set extracted from the data set which later used for training the machine learning model (Saeed, Mahmood & Khan, 2018; Khan et al., 2019; Naseer et al., 2021a). The discriminating features helps the model to learn proficiently and then perform the right prediction. Therefore, it is significant to discover the techniques which extract the useful information from the dataset. The various methods have been used by authors to construct the feature set, the widely used are: protein sequence feature, protein structure feature, statistical moments, word embedding technique and similarity voting. The majority of the authors (Liu et al., 2019; Bojar et al., 2021b; Magaret et al., 2019; Bojar et al., 2021a) only used the sequence based information of protein to train the model. It has also observed, the authors (Akmal, Rasool & Khan, 2017; Taherzadeh et al., 2019; Li et al., 2019; Park et al., 2019; Murad et al., 2021) applied the combination of multiple features such as sequence, structural and statistical to construct feature vector. More than 50% of the research article selected in this study, which got 10 points based on quality assessment score used combination of various features as mentioned above. The new techniques adopted in recent research articles are word embedding vector, graph statistical feature along with similarity voting and Chou’s five step method. The researchers can use these feature extraction techniques to improve the performance of N-linked prediction model or any PTM site identification model.

  • (b) Machine training algorithm The most significant part of computational model after the feature extraction method is to develop the method to train the machine model (Hussain, Rasool & Khan, 2020; Barukab et al., 2022; Khan et al., 2020a). The performance of model impacted most by the technique used for training the machine. The appropriate learning algorithm along with fine feature extraction method, results highly adequate model that predicts the independent data with great accuracy. Therefore, the development of appropriate machine learning method is very much essential. The researchers proposed various methods to predict the N-linked sites accurately. The most widely used methods include: Artificial Neural Network (ANN), Support Vector Machine (SVM), Deep Neural Network (DNN), Graph Neural Network (GNN) and Radial Basis Function (RBF) Network. The research article published in Q1 journal according to the JCR, used the ANN (Akmal, Rasool & Khan, 2017; Liu et al., 2019; Dimeglio et al., 2020) widely along with SVM (Taherzadeh et al., 2019; Pitti et al., 2019) method. It has also been analysed the research article (Taherzadeh et al., 2019; Le, Sandag & Ou, 2018; Ruiz-Blanco et al., 2017) in which web server has provided and present the accuracy above 90% used the Jrip Classifier, DNN, SVM and RBF algorithm. The authors (Akmal, Rasool & Khan, 2017; Tran, Pham & Ou, 2021; Hwang et al., 2020; Magaret et al., 2019; Desaire, Patabandige & Hua, 2021) who proposed prediction model without providing the webserver and also have accuracy above 90% used ANN, SVM, DNN and RF algorithms. The researchers can use these algorithms to improve the performance of N-linked prediction model or any PTM site identification model.

  • (c) Performance evaluation Once the model has trained, it then validated on the independent data to evaluate the performance. There are various techniques to measure the validity of model, the most significant metrics to evaluate the performance are Accuracy metric, Sensitivity and Specificity metric. The sensitivity test measures the true positive accuracy of a model while specificity measures the true negative accuracy of the model. In this study, the performance has evaluated on aforementioned metrics. Around 40% of the authors have not validated their model on any of above mentioned performance metrics. Only 20% of the authors have performed each of the defined performance metrics. The predictive models in which PTM type is specialized to N-linked have better accuracy as compared to those in which PTM type is not specified or are the generalized ones. The highest accuracy of −99% was achieved by author Akmal, Rasool & Khan (2017) based on these evaluation criteria. It also presents the sensitivity and specificity measures of the model which were 99.8% and 99.9% respectively, but it did not provide the web server. The author Hwang et al. (2020) claims the accuracy of 99% along with the sensitivity of 100%, but did not provide the working tool, dataset, and result comparisons with other predictors. The most efficient predictive models with available web server are Sequon model Ruiz-Blanco et al. (2017) and Sprint-Gly model Taherzadeh et al. (2019) with the accuracy of 97.5% and 97% respectively. The Sequon model has trained on the human protein sequence only while Sprint-Gly is equally effective for both human and rat species. Therefore, Sprint-Gly considered to be a reliable model out of the currently available web servers.

Future direction

  • (a) Identify the O-linked glycosylation sites for threonine and serine using ANN.

  • (b) How the performance of C-linked glycosylation can be enhanced through exiting neural network classifiers.

  • (c) Develop a comprehensive predictive model to classify the type of glycosylation.

  • (d) How effective are the exiting classifier to predict the other PTM sites?

Conclusion

Additional Information and Declarations

Competing Interests

The authors declare that they have no competing interests.

Author Contributions

Muhammad Aizaz Akmal conceived and designed the experiments, performed the experiments, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Muhammad Awais Hassan conceived and designed the experiments, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Muhammad Shoaib performed the experiments, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Khaldoon S Khurshid analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Abdullah Mohamed analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

This is a literature review and does not have raw data.

Funding

The authors received no funding for this work.

2 Citations 1,113 Views 109 Downloads

MIT

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more