T1000: A reduced toxicogenomics gene set for improved decision making

Othman Soufan; Jessica Ewald; Charles Viau; Doug Crump; Markus Hecker; Niladri Basu; Jianguo Xia

doi:10.7287/peerj.preprints.27839v1

T1000: A reduced toxicogenomics gene set for improved decision making

Othman Soufan¹, Jessica Ewald², Charles Viau¹, Doug Crump³, Markus Hecker⁴, Niladri Basu ², Jianguo Xia ¹

1 Institute of Parasitology, McGill University, Montreal, Canada

2 Faculty of Agricultural and Environmental Sciences, McGill University, Montreal, Canada

3 National Wildlife Research Centre, Carleton University, Ottawa, Canada

4 School of the Environment & Sustainability and Toxicology Centre, University of Saskatchewan, Saskatoon, Canada

DOI: 10.7287/peerj.preprints.27839v1

Published: 2019-07-03
Accepted: 2019-07-03

Subject Areas: Bioinformatics, Computational Biology, Toxicology, Data Mining and Machine Learning
Keywords: toxicogenomics, gene signature, co-expression network, graph clustering, machine learning, gene selection

Copyright: © 2019 Soufan et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Soufan O, Ewald J, Viau C, Crump D, Hecker M, Basu N, Xia J. 2019. T1000: A reduced toxicogenomics gene set for improved decision making. PeerJ Preprints 7:e27839v1 https://doi.org/10.7287/peerj.preprints.27839v1

Abstract

There is growing interest within regulatory agencies and toxicological research communities to develop, test, and apply new approaches, such as toxicogenomics, to more efficiently evaluate chemical hazards. Given the complexity of analyzing thousands of genes simultaneously, there is a need to identify reduced gene sets.Though several gene sets have been defined for toxicological applications, few of these were purposefully derived using toxicogenomics data. Here, we developed and applied a systematic approach to identify 1000 genes (called Toxicogenomics-1000 or T1000) highly responsive to chemical exposures. First, a co-expression network of 11,210genes was built by leveraging microarray data from the Open TG-GATEs program. This network was then re-weighted based on prior knowledge of their biological (KEGG, MSigDB) and toxicological (CTD) relevance. Finally, weighted correlation network analysis was applied to identify 258 gene clusters. T1000 was defined by selecting genes from each cluster that were most associated with outcome measures. For model evaluation, we compared the performance of T1000 to that of other gene sets (L1000, S1500, Genes selected by Limma, and random set) using two external datasets. Additionally, a smaller (T384) and a larger version (T1500) of T1000 were used for dose-response modeling to test the effect of gene set size. Our findings demonstrated that the T1000 gene set is predictive of apical outcomes across a range of conditions (e.g.,in vitroand in vivo, dose-response, multiple species, tissues, and chemicals), and generally performs as well, or better than other gene sets available.

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

Density plot of lactate dehydrogenase (LDH) activity (%) across human and rat in vitrohepatic experiments from the OPEN TG-GATEs Project

About 86% of experiments were indicated normal in the range of 95%-105% and the remaining 14% were cytotoxic cases. 95% and 105% are cut-offs that appear at 5% of left and right tails, respectively.

DOI: 10.7287/peerj.preprints.27839v1/supp-1

Download

Ratios plot of BMDt/BMDa for each NTP experimental group determined with T384 gene set

DOI: 10.7287/peerj.preprints.27839v1/supp-2

Download

Ratios plot of BMDt/BMDa for each NTP experimental group determined with T1500 gene set

DOI: 10.7287/peerj.preprints.27839v1/supp-3

Download

PCA plots for 158 chemicals of the Human in vitrodataset of Open TG-GATEs showing similarity of patterns when all genes are used and when only T1000 genes are considered

DOI: 10.7287/peerj.preprints.27839v1/supp-4

Download

Explanation of the different iterations for computing prior scores

DOI: 10.7287/peerj.preprints.27839v1/supp-5

Download

Discussion of LDH% vs. dose distribution for selection of binarization thresholds needed for the binary classification models

DOI: 10.7287/peerj.preprints.27839v1/supp-6

Download

List of the 258 clusters produced in the study and the set of grouped genes in each cluster. Each line represents a different cluster of genes

DOI: 10.7287/peerj.preprints.27839v1/supp-7

Download

Detailed performance evaluation scores of each of the gene sets when applied to the external kidney dataset from Open TG-GATEs

DOI: 10.7287/peerj.preprints.27839v1/supp-11

Download

Supplemental Information

Density plot of lactate dehydrogenase (LDH) activity (%) across human and rat in vitrohepatic experiments from the OPEN TG-GATEs Project

Ratios plot of BMDt/BMDa for each NTP experimental group determined with T384 gene set

Ratios plot of BMDt/BMDa for each NTP experimental group determined with T1500 gene set

PCA plots for 158 chemicals of the Human in vitrodataset of Open TG-GATEs showing similarity of patterns when all genes are used and when only T1000 genes are considered

Explanation of the different iterations for computing prior scores

Discussion of LDH% vs. dose distribution for selection of binarization thresholds needed for the binary classification models

List of the 258 clusters produced in the study and the set of grouped genes in each cluster. Each line represents a different cluster of genes

The list of T1500 genes which includes T1000 as the top-ranked 1000 genes

Detailed quantitative comparison of gene expression space coverage

Ranked list of T1000 genes with annotations drawn from the Protein Atlas

Detailed performance evaluation scores of each of the gene sets when applied to the external kidney dataset from Open TG-GATEs