Predicting enhancers using a small subset of high confidence examples and co-training

Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Berlin, Germany
RNA Bioinformatics, Max Planck Institute for Molecular Genetics, Berlin, Berlin, Germany
DOI
10.7287/peerj.preprints.2407v1
Subject Areas
Computational Biology, Genomics
Keywords
enhancers, co-training, semi-supervised learning
Copyright
© 2016 Huska et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Huska MR, Ramisch A, Vingron M, Marsico A. 2016. Predicting enhancers using a small subset of high confidence examples and co-training. PeerJ Preprints 4:e2407v1

Abstract

Enhancers are important regulatory regions located throughout the genome, primarily in non-coding regions. Several experimental methods have been developed over the last several years to identify their location, but the search space is large and the overlap between the putative enhancer identified using these methods tends to be very small. Computational methods for enhancer prediction often use one large set of experimentally identified enhancer regions as input, and therefore rely critically on their correctness. We chose to take a different approach, and start with a high confidence set of 21 enhancer that are in the intersection of enhancers identified using three completely unrelated experimental approaches: deepCAGE, HiCap and classical enhancer reporter assays. Because this starting set is so small, we use a semi-supervised approach called co-training rather than a fully supervised approach to progressively predict enhancers from unlabeled regions. Using this approach we are able to outperform supervised learning as well as simpler semi-supervised learning methods and achieve an average area under the ROC curve of 0.84.

Author Comment

This is an article which has been accepted for the "GCB 2016 Conference"