Dispersion analysis of PoTRA ranked mRNA mediated dysregulated pathways in Breast Invasive Cancer from a TCGA Pan-Cancer study
- Published
- Accepted
- Subject Areas
- Bioinformatics, Computational Biology
- Keywords
- dispersion, tcga, pathway rank, cgc, PoTRA, pca, mRNA, dysregulated pathways, cancer, permutation
- Copyright
- © 2018 Linan et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2018. Dispersion analysis of PoTRA ranked mRNA mediated dysregulated pathways in Breast Invasive Cancer from a TCGA Pan-Cancer study. PeerJ Preprints 6:e27306v1 https://doi.org/10.7287/peerj.preprints.27306v1
Abstract
Background. Our publication of the new pathways of topological rank analysis (PoTRA) algorithm demonstrated a novel approach for using the Google Search PageRank algorithm to analyze gene expression networks to identify biological pathways significantly disrupted in hepatocellular carcinoma. In order to apply the PoTRA algorithm to analyze other cancer gene expression data sets, of various sizes and normal:tumor ratio composition, two important questions must be answered: 1. What is the optimal normal:tumor sample ratio?; and 2. What is the minimum number of samples that should be used for PoTRA analysis? To address these questions, the average standard deviation (SD) in PoTRA-ranked mRNA mediated dysregulated pathways was studied using randomly sampled data sets with various normal:tumor ratios and sizes drawn from the TCGA Breast Invasive Carcinoma (TCGA-BRCA) project.
Methods. To identify the optimal normal:tumor sample ratios, the SD analysis used random combinations of 1:N unbalanced normal:tumor data sets: (1:1, 1:2, 1:3, 1:5, 1:7, 1:9). To identify the minimum sample size, random resampling of normal and tumor samples of various sizes are used: (3 vs 3), (5 vs 5), (10 vs 10), (25 vs 25), (50 vs 50), (75 vs 75), (100 vs 100), and (113 vs 113).
Results. This analysis suggests that the 1:1 ratio achieves the lowest average rank variation and that the minimum sample size of 50 normal and 50 tumor samples reaches a steady state in the average rank variation.
Conclusion. In conclusion, future applications of the PoTRA algorithm to analyze gene expression data sets such as TCGA should use balanced data sets as well as a minimum sample size of 50 for both normal and tumor to ensure the most robust performance.
Author Comment
This is a submission to PeerJ for review.