Cross-platform normalization of microarray and RNA-seq data for machine learning applications

Jeffrey A Thompson; Jie Tan; Casey S Greene

doi:10.7287/peerj.preprints.1460v1

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

Jeffrey A Thompson^1,2, Jie Tan^1,3, Casey S Greene ^1,4,5,6

1 Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, United States of America

2 Quantitative Biomedical Sciences Program, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, United States of America

3 Molecular and Cellular Biology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, United States of America

4 Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America

5 Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America

6 Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennslyvania, United States of America

DOI: 10.7287/peerj.preprints.1460v1

Published: 2015-10-30
Accepted: 2015-10-30

Subject Areas: Bioinformatics, Computational Biology, Genomics
Keywords: gene expression, normalization, rna-sequencing, microarray, machine learning, quantile normalization, cross-platform normalization, training, distribution

Copyright: © 2015 Thompson et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.

Cite this article: Thompson JA, Tan J, Greene CS. 2015. Cross-platform normalization of microarray and RNA-seq data for machine learning applications. PeerJ PrePrints 3:e1460v1 https://doi.org/10.7287/peerj.preprints.1460v1

Abstract

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization and a simple log₂ transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

Compiled Supplementary Figures and Legends

DOI: 10.7287/peerj.preprints.1460v1/supp-1

Download

Supplemental Information

Compiled Supplementary Figures and Legends

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article