A comprehensive simulation study on classification of RNA-Seq data

Gokmen Zararsiz; Dinçer Göksülük; Selçuk Korkmaz; Vahap Eldem; Gözde Ertürk Zararsız; İzzet Parug Duru; Ahmet Ozturk; Ahmet Öztürk

doi:10.7287/peerj.preprints.2761v2

A comprehensive simulation study on classification of RNA-Seq data

Gokmen Zararsiz ¹, Dinçer Göksülük², Selçuk Korkmaz², Vahap Eldem³, Gözde Ertürk Zararsız¹, İzzet Parug Duru⁴, Ahmet Ozturk¹, Ahmet Öztürk¹

1 Biostatistics, Erciyes University, Faculty of Medicine, Kayseri, TURKEY

2 Department of Biostatistics, Hacettepe University, Ankara, Turkey

3 Department of Biology, Istanbul University, Istanbul, Turkey

4 Department of Physics, Marmara University Istanbul, Istanbul, Turkey

DOI: 10.7287/peerj.preprints.2761v2

Published: 2017-08-31
Accepted: 2017-08-31

Subject Areas: Bioinformatics, Genomics, Statistics, Computational Science
Keywords: RNA sequencing, Classification, Next-generation sequencing, Overdispersion, Gene expression, Machine learning

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Zararsiz G, Göksülük D, Korkmaz S, Eldem V, Ertürk Zararsız G, Duru İP, Ozturk A, Öztürk A. 2017. A comprehensive simulation study on classification of RNA-Seq data. PeerJ Preprints 5:e2761v2 https://doi.org/10.7287/peerj.preprints.2761v2

Abstract

RNA sequencing (RNA-Seq) is a powerful technique for thegene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies.Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of diseases. Most of the statistical methods proposed for the classification of geneexpression data are either based on a continuous scale (eg. microarray data) or require a normal distribution assumption. Hence, these methods cannot be directly applied to RNA-Seq data since they violate both data structure and distributional assumptions. However, it is possible to apply these algorithms with appropriate modifications to RNA-Seq data. One way is to develop count-based classifiers, such as Poisson linear discriminant analysis and negative binomial linear discriminant analysis. Another way is to bring the data hierarchically closer to microarrays and apply microarray-based classifiers.In this study, we compared several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). We also examined the effect of several parameters such asoverdispersion, sample size, number of genes, number of classes, differential-expression rate, andthe transformation method on model performances.A comprehensive simulation study is conducted and the results are compared with the results of two miRNA and two mRNA experimental datasets. The results revealed that increasing the sample size, differential-expression rate, and number of genes and decreasing the dispersion parameter and number of groups lead to an increase in classification accuracy. Similar with differential-expression studies, the classification of RNA-Seq data requires careful attention when handling data overdispersion. We conclude that, as a count-based classifier, the power transformed PLDA and, as a microarray-based classifier, vst or rlog transformed RF and SVM clas sifiers may be a good choice for classification. An R/BIOCONDUCTOR package, MLSeq, is freely available at https://www.bioconductor.org/packages/release/bioc/html/MLSeq.html .

Author Comment

This preprint paper has been submitted and published in PLOS ONE. Please cite as follows:

A comprehensive simulation study on classification of RNA-Seq data Zararsız G, Goksuluk D, Korkmaz S, Eldem V, Zararsiz GE, Duru IP, et al. (2017) A comprehensive simulation study on classification of RNA-Seq data. PLoS ONE 12(8): e0182507. https://doi.org/10.1371/journal.pone.0182507

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article