DeepMethylation: a deep learning based framework with GloVe and Transformer encoder for DNA methylation prediction

View article
Bioinformatics and Genomics
Note that to present the best performance indexes, the results of SOTA methods are directly referenced as in Zhang, Xiao & Xu (2020b), while the proposed method is trained with the same dataset and configuration.

Main article text

 

Introduction

Materials and Methods

The overall framework

Data processing

where bi and bj are offsets, and N is the total number of words. Vi represents the word vector in the global dictionary to be obtained, ˜Vj is the separate context vector that helps solve Vi. Since J is a convex function, Vi can be solved via optimization algorithms such as gradient descent. In addition, the weighting factor f(Xi,j) is defined as

Feature extraction

Classification

Results and analysis

Dataset

where for each site E{A,T,G,C}. In the implementation, by following the rule of WGBS, δ is set to 20, and each fragment has 41 sites. In this way, a total of 893,326 DNA fragments are obtained, including 69,750 methylation-positive samples and 823,576 negative ones. As shown in Table 1, the ratio between the negative and positive samples is about 13.3, which coincides with the distribution of 5mC in real cases.

Performance evaluation

  • Sensitivity (Sen) refers to the ratio of correctly predicted positive samples to all positive samples.

    Sen=TPTP+FN

  • Specificity (Spe) refers to the ratio of correctly predicted negative samples to all negative samples.

    Spe=TNTN+FP

  • Accuracy (Acc) refers to the ratio of correctly classified samples, both positive and negative, to all tested samples.

    Acc=TP+TNTP+TN+FP+FN

  • The Matthews Correlation Coefficient (Mcc) considers the joint relationship between TP, TN, FP, and FN, and comprehensively evaluates the consistency between the predicted results and the ground truth.

    Mcc=TP×(TN)FP×(FN)(TP+FP)×(TP+FN)×(TN+FP)×(TN+FN)

  • Area under the curve (AUC) compares the performance of different models by calculating the area under the Receiver Operating Characteristic (ROC) curve, and a larger value indicates a higher degree of authenticity.

Performance comparison with SOTA methods

Influence of encoding methods

Influence of feature extraction methods

Influence of sub-sequence length and GloVe characteristic length

The impact of imbalance of dataset

Generality on m1A data

Conclusion

Supplemental Information

Additional Information and Declarations

Competing Interests

The authors declare that they have no competing interests.

Author Contributions

Zhe Wang conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Sen Xiang conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Chao Zhou conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Qing Xu analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The data is available at NCBI, GitHub and Zenodo:

- PRJNA523380

- https://github.com/sb111169/tf-5mc/tree/main/shuju.

- Zhe Wang. (2023). DeepMethylation. Zenodo. https://doi.org/10.5281/zenodo.8191512.

The code is available in the Supplemental File and at GitHub:

- https://github.com/sb111169/tf-5mc/tree/main.

Funding

This work was supported by the Natural Science Foundation of Hubei Province (No. 2022CFB349). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

1 Citation 1,379 Views 113 Downloads