Predicting context specific enhancer-promoter interactions from ChIP-Seq time course data
- Published
- Accepted
- Subject Areas
- Bioinformatics, Computational Biology, Genomics
- Keywords
- Enhancer-promoter interaction, Bayesian classifier, Machine learning, Estrogen receptor, ChIP-Seq
- Copyright
- © 2017 Dzida et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2017. Predicting context specific enhancer-promoter interactions from ChIP-Seq time course data. PeerJ Preprints 5:e3093v1 https://doi.org/10.7287/peerj.preprints.3093v1
Abstract
We have developed a machine learning approach to predict context specific enhancer-promoter interactions using evidence from changes in genomic protein occupancy over time. The occupancy of estrogen receptor alpha (ERα), RNA polymerase (Pol II) and histone marks H2AZ and H3K4me3 were measured over time using ChIP-Seq experiments in MCF7 cells stimulated with estrogen. A Bayesian classifier was developed which uses the correlation of temporal binding patterns at enhancers and promoters and genomic proximity as features to predict interactions. This method was trained using experimentally determined interactions from the same system and was shown to achieve much higher precision than predictions based on the genomic proximity of nearest ERα binding. We use the method to identify a genome-wide confident set of ERα target genes and their regulatory enhancers genome-wide. Validation with publicly available GRO-Seq data demonstrates that our predicted targets are much more likely to show early nascent transcription than predictions based on genomic ERα binding proximity alone.
Author Comment
This is a submission to PeerJ for review.
Supplemental Information
Process of merging individual MACS-called peaks
Cartoon shows the process of merging individual MACS-called peaks with the objective of finding approximate locations of time persistent ER-α bindings. In the process MACS-detected time varying peaks from [0], 5, . . . , 320 min time points (0 is optional and by default not included) which co-occur at least twice across time points are merged by union operation to produce the approximate consensus locations of a single binding. The single occurrences of peaks are discarded.
Joint clustering of PolII and ER with AP
Figure shows the clustering of the joint time course of Pol II and ER-α at enhancers with Affinity Propagation. The clustering involves only the time series which individually possess a sum of at least 200 tags across all time point.
Histograms of positive and negative features between genes (300bp upstream + gene) and enhancers - all chromosomes
The graphs (a, b, c, d) show positive (green) and negative (yellow) distributions of correlations between time series of 300bp-upstream-extended- gene regions and enhancer bodies for ER-α, PolII, H2AZ and H3K4me3 collected across all 23 chromosomes. The figure (e) shows the distribution of genomic distances between centres of distal enhancers and 300bp-upstream-shifted-TSS of genes. The set of positive and negative pairs was constructed using 300bp- upstream-extended-genes and distal enhancers.
Histograms of positive and negative features between genes (1500bp upstream + gene) and enhancers - odd chromosomes
The graphs (a, b, c, d) show positive (green) and negative (yellow) distributions of correlations between time series of 300bp-upstream-extended- gene regions and enhancer bodies for ER-α, PolII, H2AZ and H3K4me3 collected across all odd chromosomes. The figure (e) shows the distribution of genomic distances between centres of distal enhancers and 1500bp-upstream-shifted-TSS of genes. The set of positive and negative pairs was constructed using 1500bp- upstream-extended-genes and distal enhancers.
Performance of the enhancer-gene (300bp upstream + gene) model on odd chromosomes - all combinations
Figure shows the comparison of performance of the NB model on odd chromosomes (training data) measured by Precision-TPR and MAP scores. The Precision-TPR curves show the accuracy of the predictions with the highest 10%, 20%, 30% scores i.e. posterior probabilities. The second and the third rows stratify predictions at each of the thresholds into those which take place within domains and those involving inter-domain contacts. The set of positive and negative pairs for the first model was constructed using 300bp-upstream- extended-genes and distal enhancers. The correlation-based attributes of the two models were estimated using signals (time series) aggregated over 300bp- upstream-extended-genes, and distal enhancer bodies. The separation-based feature was estimated from 300bp-upstream-shifted TSS to the centres of the ER-α enhancers.
Performance of the enhancer-gene (300bp upstream + gene) model on even chromosomes - all combinations
Figure shows the comparison of performance of the NB model on even chromosomes (test data) measured by Precision-TPR and MAP scores. The Precision-TPR curves show the accuracy of the predictions with the highest 10%, 20%, 30% scores i.e. posterior probabilities. The second and the third rows stratify predictions at each of the thresholds into those which take place within domains and those involving inter-domain contacts. The set of positive and negative pairs for the first model was constructed using 300bp-upstream- extended-genes and distal enhancers. The correlation-based attributes of the two models were estimated using signals (time series) aggregated over 300bp- upstream-extended-genes, and distal enhancer bodies. The separation-based feature was estimated from 300bp-upstream-shifted TSS to the centres of the ER-α enhancers.
Performance of the enhancer-gene (1500bp upstream + gene) model on odd chromosomes - all combinations
Figure shows the comparison of performance of the NB model on odd chromosomes (training data) measured by Precision-TPR and MAP scores. The Precision-TPR curves show the accuracy of the predictions with the highest 10%, 20%, 30% scores i.e. posterior probabilities. The second and the third rows stratify predictions at each of the thresholds into those which take place within domains and those involving inter-domain contacts. The set of positive and negative pairs for the first model was constructed using 1500bp-upstream- extended-genes and distal enhancers. The correlation-based attributes of the two models were estimated using signals (time series) aggregated over 300bp- upstream-extended-genes, and distal enhancer bodies. The separation-based feature was estimated from 1500bp-upstream-shifted TSS to the centres of the ER-α enhancers.
Performance of the enhancer-gene (1500bp upstream + gene) model on even chromosomes - all combinations
Figure shows the comparison of performance of the NB model on even chromosomes (test data) measured by Precision-TPR and MAP scores. The Precision-TPR curves show the accuracy of the predictions with the highest 10%, 20%, 30% scores i.e. posterior probabilities. The second and the third rows stratify predictions at each of the thresholds into those which take place within domains and those involving inter-domain contacts. The set of positive and negative pairs for the first model was constructed using 1500bp-upstream- extended-genes and distal enhancers. The correlation-based attributes of the two models were estimated using signals (time series) aggregated over 300bp- upstream-extended-genes, and distal enhancer bodies. The separation-based feature was estimated from 1500bp-upstream-shifted TSS to the centres of the ER-α enhancers.
Performance on odd chromosomes for alternative MACS parametrisation; (1e-07, λ local off ) vs (1e-05, λ local on)
The first column of the figure shows the performance of the NB model on all odd chromosomes. The model was trained on the stringent time persistent merged MACS-called peaks (i.e. distal ER-α bindings) from the scan with the p- value of 1e-07 and the local control switched off, in which case the search is done with λ BG . In the second column we see the performance under the alternative peak calling with the p-value of 1e-05 (MACS’ default), no control and the local control flag on. The set of positive and negative pairs for the first model was constructed using 300bp-upstream-extended-genes and distal enhancers. The correlation-based attributes of the model were estimated using pairs of 300bp- upstream-extended-genes, and enhancers (merged distal MACS-called peaks). The separation-based feature was estimated from 300bp-upstream-shifted TSS to the centres of the ER-α enhancers.
Performance on even chromosomes for alternative MACS parametrisation; (1e-07, λ local off ) vs (1e-05, λ local on)
The first column of the figure shows the performance of the NB model on all even chromosomes. The model was trained on the stringent time persistent merged MACS-called peaks (i.e. distal ER-α bindings) from the scan with the p-value of 1e-07 and the local control switched off, in which case the search is done with λ BG . In the second column we see the performance un- der the alternative peak calling with the p-value of 1e-05 (MACS’ default), no control and the local control flag on. The set of positive and negative pairs for the first model was constructed using 300bp-upstream-extended-genes and distal enhancers. The correlation-based attributes of the model were estimated using pairs of 300bp-upstream-extended-genes, and enhancers (merged distal MACS-called peaks). The separation-based feature was estimated from 300bp- upstream-shifted TSS to the centres of the ER-α enhancers.
Histograms of positive and negative features between genes (300bp upstream + gene) and enhancers (lambda on, 10e-05) - odd chromosomes
The graphs (a, b, c, d) show positive (green) and negative (yellow) distributions of correlations between time series of 300bp-upstream-extended- gene regions and enhancer bodies (MACS: λ local on, p-value 10e-05) for ER-α, PolII, H2AZ and H3K4me3 collected across all odd chromosomes. The figure (e) shows the distribution of genomic distances between centres of distal enhancers and 300bp-upstream-shifted-TSS of genes. The set of positive and negative pairs was constructed using 300bp-upstream-extended-genes and distal enhancers.