ECMPride: prediction of human extracellular matrix proteins based on the ideal dataset using hybrid features with domain evidence

View article
Bioinformatics and Genomics

Main article text

 

Introduction

Materials & Methods

Datasets

Feature extraction

ECM domains

Position-Specific Scoring Matrix (PSSM)

Physicochemical properties

Feature selection

Prediction model and Performance evaluation

Results

Construction of ECMPride

ECMPride achieves good performance

Construction of theoretical reference dataset of human ECM proteins

Validation of novel ECM components

Discussion

Conclusions

Supplemental Information

Detailed list of the standard dataset used in ECMPride

The standard training dataset consists of a positive dataset of 521 ECM proteins and a negative dataset of 11336 non-ECM proteins. For each protein, ”UniprotID” represents its Uniprot ID, “Class” represents its classification of ECM or non-ECM, and “FastaSequence” represents its fasta sequence.

DOI: 10.7717/peerj.9066/supp-1

Detailed list of studies used to construct the positive dataset

In addition to studies collected by ECM Atlas, more studies were collected to construct the positive dataset, and here is a list of collected studies.

DOI: 10.7717/peerj.9066/supp-2

Detailed list of all 167 features used in ECMPride

There are three classes and 167 features in total that introduced into ECMPride to represent the characteristics of ECM proteins, including ECM protein-related structural domains (63 features), physicochemical properties (24 features), and position-specific scoring matrix (PSSM, 80 features). Also, all features are scored using the mRMR method.

DOI: 10.7717/peerj.9066/supp-3

Detailed list of Physicochemical properties used in ECMPride

There are 24 physicochemical properties used in ECMPride. For each physicochemical property, the row from “A” to “V” represent its amino acid index (AAIndex) of 20 amino acids.

DOI: 10.7717/peerj.9066/supp-4

Performance of the prediction model with or without the under-sampling ensemble method

When the prediction model is built without the under-sampling ensemble method, the model gains a high specificity of 0.9995, but the sensitivity is very low, and also present a poor balanced accuracy of 0.7579. When the under-sampling ensemble method is used, the model achieves a balance between the specificity (0.9360) and sensitivity (0.8925) and gains a high balanced accuracy of 0.9142 as well. Therefore, the under-sampling ensemble method can solve the problem of imbalance of the dataset well and makes full use of the sample information at the same time.

DOI: 10.7717/peerj.9066/supp-5

Performance of all 167 candidate prediction models using 167 different feature subsets

All of the 167 features are ranked according to the order of the important scores from high to low. The IFS method is used to generate 167 feature subsets and further generate 167 corresponding candidate models.

DOI: 10.7717/peerj.9066/supp-6

A comprehensive collection of theoretical human ECM proteins predicted by ECMPride

For each ECM protein predicted by ECMPride, the information of six relevant databases (including Human Protein Atlas, ExoCarta, GO, MatrixDB, String, and EntreZ) is used for annotation. “label_HumanProteinAtlas” represents the location annotation from Human Protein Atlas (membrane, secreted, membrane&secreted, or intracellular). “label_ExoCarta” represents the exosome evidence provided by ExoCarta. “label_GO” shows the extracellular-relative Cellular Components annotation in GO, including “extracellular matrix”, “extracellular space”, “extracellular region” or “extracellular exosome”, also the corresponding GO term are listed in “label_GO_term”. “InterPro_core_Matrisome” shows domains that define the core Matrisome contained in the genes found only in ECMPrideDB. “–” represents this gene is both contained by ECMPrideDB and Matrisome. “Interaction_MatriDB” shows all Matrisome genes that interact with genes found only in ECMPrideDB (evidence from the MatrixDB database). “–” represents this gene is both contained by ECMPrideDB and Matrisome. “Interaction_String” shows all Matrisome genes that interact with genes found only in ECMPrideDB (evidence from the String database). “–” represents this gene is both contained by ECMPrideDB and Matrisome. “EntreZ link” provides the hyperlink to the corresponding entrez gene summary for each gene in ECMPrideDB.

DOI: 10.7717/peerj.9066/supp-7

Comparison of the performance assessment parameters on different imbalance datasets

As the ratio of ECM to non-ECM samples in the training dataset is approximately 1:21, we decide to construct 21 training datasets, with the ratio of ECM cases to non-ECM cases from 1:1 to 1:21. In each dataset, ECM cases are the entire ECM dataset and non-ECM cases are randomly selected from the non-ECM dataset. Then, the prediction model is implemented on these 21 different imbalance datasets separately using 10-fold cross-validation, and four performance assessment parameters are calculated, including Sensitivity, Specificity, Accuracy and balanced Accuracy. The above process is repeated 10 times, and the average performance result of 10 times was taken as the final result. As the ratio of non-ECMs to ECMs increases, the specificity increases gradually, while the sensitivity decreases significantly, which indicates a declining trend of model performance. At the same time, the Balanced Accuracy decreases along with the increasing of imbalance ratio, indicates this parameter can better represent the model performance than Accuracy.

DOI: 10.7717/peerj.9066/supp-9

Biological validation of new ECM proteins

Immunohistochemistry and immunofluorescence analysis of (A) STAB1, (B) STAB2, (C) JAG1 and (D) JAG2 on normal human skin tissues (scale bar: 100 μm). Black and white arrowheads pointed to the expressed locations of the new ECM proteins.

DOI: 10.7717/peerj.9066/supp-10

Biological validation of new ECM proteins predicted by ECMPride

Immunohistochemistry analysis of (A) DLL4 and (B) LRP1 on human normal liver and skin tissues. (C) Immunofluorescence analysis of FCGBP on RH-30 cell lines. Black and white arrowheads pointed to the expressed locations of the new ECM proteins. The data comes from https://www.proteinatlas.org database website.

DOI: 10.7717/peerj.9066/supp-11

A detailed description of the method in the main text

DOI: 10.7717/peerj.9066/supp-12

Additional Information and Declarations

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Binghui Liu conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Ling Leng performed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Xuer Sun performed the experiments, prepared figures and/or tables, and approved the final draft.

Yunfang Wang conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

Jie Ma conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Yunping Zhu conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The code of ECMPride is available at GitHub: https://github.com/Binghui-Liu/ECMPride.git.

Funding

This work was supported by the National Key Research Program of China (No. 2016YFB0201702, No. 2016YFC0901601, No. 2017YFC0906602, and No. 2017YFA0505002). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

4 Citations   Views   Downloads