Predicting enhancers using a small subset of high confidence examples and co-training

Matthew R Huska; Anna Ramisch; Martin Vingron; Annalisa Marsico

doi:10.7287/peerj.preprints.2407v1

Predicting enhancers using a small subset of high confidence examples and co-training

Matthew R Huska¹, Anna Ramisch¹, Martin Vingron¹, Annalisa Marsico ²

1 Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Berlin, Germany

2 RNA Bioinformatics, Max Planck Institute for Molecular Genetics, Berlin, Berlin, Germany

DOI: 10.7287/peerj.preprints.2407v1

Published: 2016-09-01
Accepted: 2016-09-01

Subject Areas: Computational Biology, Genomics
Keywords: enhancers, co-training, semi-supervised learning

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Huska MR, Ramisch A, Vingron M, Marsico A. 2016. Predicting enhancers using a small subset of high confidence examples and co-training. PeerJ Preprints 4:e2407v1 https://doi.org/10.7287/peerj.preprints.2407v1

Abstract

Enhancers are important regulatory regions located throughout the genome, primarily in non-coding regions. Several experimental methods have been developed over the last several years to identify their location, but the search space is large and the overlap between the putative enhancer identified using these methods tends to be very small. Computational methods for enhancer prediction often use one large set of experimentally identified enhancer regions as input, and therefore rely critically on their correctness. We chose to take a different approach, and start with a high confidence set of 21 enhancer that are in the intersection of enhancers identified using three completely unrelated experimental approaches: deepCAGE, HiCap and classical enhancer reporter assays. Because this starting set is so small, we use a semi-supervised approach called co-training rather than a fully supervised approach to progressively predict enhancers from unlabeled regions. Using this approach we are able to outperform supervised learning as well as simpler semi-supervised learning methods and achieve an average area under the ROC curve of 0.84.

Author Comment

This is an article which has been accepted for the "GCB 2016 Conference"