This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Kleiner M.2017. Normalization of metatranscriptomic and metaproteomic data for differential gene expression analyses: The importance of accounting for organism abundance. PeerJ Preprints5:e2846v1https://doi.org/10.7287/peerj.preprints.2846v1
Metatranscriptomics and metaproteomics make it possible to measure gene expression in microbial communities. So far these approaches were mostly used to get a general overview of the dominant metabolism and physiologies of community members. Recently, environmental microbiologists have started using metatranscriptomics and metaproteomics to look at gene expression differences between different environments or conditions. This has been mostly done by using makeshift adaptations of pure culture focused differential transcriptomics and proteomics approaches. However, since meta-omics data has many more variables attached to it as compared to pure culture derived data, such makeshift adaptations are problematic at best. One particular challenge is posed by the data normalization strategies used to account for technical and biological variables in meta-omic data. Here I discuss the most common normalization strategy for transcriptomic and proteomic data and why it is not valid by itself for meta-omic data. I provide logical proof that variation in species abundances between samples is an additional variable that must be accounted for during normalization of meta-omic data. Finally, I show how the existing normalization methods for transcriptomic and proteomic data can be augmented to be applicable to meta-omic data.
I wrote this perspectives piece to start a discussion and gather feedback on how to normalize meta-omics data for differential gene expression analyses. I plan to turn this into a more comprehensive article once I have some feedback from the community. So, I am very much looking forward to hearing your thoughts on this matter.
Table S1: Simulated gene expression data for a microbial community with two member species
Expression values for 50 genes of each species are given. For simplicity, the expression values are all identical. The expression values for each gene are identical for site 1/condition 1 and site 2/condition 2, however they have been skewed based on different species abundances.