PeerJ Computer Science Preprints: Optimization Theory and Computationhttps://peerj.com/preprints/index.atom?journal=cs&subject=10800Optimization Theory and Computation articles published in PeerJ Computer Science PreprintsGenHap: A novel computational method based on genetic algorithms for haplotype assemblyhttps://peerj.com/preprints/32462017-09-122017-09-12Andrea TangherloniSimone SpolaorLeonardo RundoMarco S NobileIvan MerelliPaolo CazzanigaDaniela BesozziGiancarlo MauriPietro Liò
The process of inferring a full haplotype of a cell is known as haplotyping, which consists in assigning all heterozygous Single Nucleotide Polymorphisms (SNPs) to exactly one of the two chromosomes. In this work, we propose a novel computational method for haplotype assembly based on Genetic Algorithms (GAs), named GenHap. Our approach could efficiently solve large instances of the weighted Minimum Error Correction (wMEC) problem, yielding optimal solutions by means of a global search process. wMEC consists in computing the two haplotypes that partition the sequencing reads into two unambiguous sets with the least number of corrections to the SNP values. Since wMEC was proven to be an NP-hard problem, we tackle this problem exploiting GAs, a population-based optimization strategy that mimics Darwinian processes. In GAs, a population composed of randomly generated individuals undergoes a selection mechanism and is modified by genetic operators. Based on a quality measure (i.e., the fitness value), inspired by Darwin’s “survival of the fittest” laws, each individual is involved in a selection process.
Our preliminary experimental results show that GenHap is able to achieve correct solutions in short running times. Moreover, this approach can be used to compute haplotypes in organisms with different ploidity. The proposed evolutionary technique has the advantage that it could be formulated and extended using a multi-objective fitness function taking into account additional insights, such as the methylation patterns of the different chromosomes or the gene proximity in maps achieved through Chromosome Conformation Capture (3C) experiments.
The process of inferring a full haplotype of a cell is known as haplotyping, which consists in assigning all heterozygous Single Nucleotide Polymorphisms (SNPs) to exactly one of the two chromosomes. In this work, we propose a novel computational method for haplotype assembly based on Genetic Algorithms (GAs), named GenHap. Our approach could efficiently solve large instances of the weighted Minimum Error Correction (wMEC) problem, yielding optimal solutions by means of a global search process. wMEC consists in computing the two haplotypes that partition the sequencing reads into two unambiguous sets with the least number of corrections to the SNP values. Since wMEC was proven to be an NP-hard problem, we tackle this problem exploiting GAs, a population-based optimization strategy that mimics Darwinian processes. In GAs, a population composedof randomly generated individuals undergoes a selection mechanism and is modified by genetic operators. Based on a quality measure (i.e., the fitness value), inspired by Darwin’s “survival of the fittest” laws, each individual is involved in a selection process.Our preliminary experimental results show that GenHap is able to achieve correct solutions in short running times. Moreover, this approach can be used to compute haplotypes in organisms with different ploidity. The proposed evolutionary technique has the advantage that it could be formulated and extended using a multi-objective fitness function taking into account additional insights, such as the methylation patterns of the different chromosomes or the gene proximity in maps achieved through Chromosome Conformation Capture (3C) experiments.The solution of large-scale Minimum Cost SAT Problem as a tool for data analysis in bioinformaticshttps://peerj.com/preprints/26352016-12-122016-12-12Giovanni FeliciDaniele FeronePaola FestaAntonio NapoletanoTommaso Pastore
Data mining is one of the main activities in bioinformatics, specifically to extract knowledge from massive data sets related with gene expression measurement, CNV, DNA strings, and others. A long array of methods are used to perform such task, ranging from the more established parametric statistical analysis to non parametric techniques, to classification methods that have been developed in knowledge engineering and artificial intelligence. In this paper, we consider a method for extracting logic formulas from data that relies on a large body of literature in integer and logic optimization, originally presented in [1], that has been largely and successfully applied to different problems in bioinformatics ([2], [3], [4], [5], [6]). Such method is based on the iterative solution of Minimum Cost SAT Problems and is able to extract logic formulas in DNF form that possess interesting features for their interpretation. While leaving the discussion of the main features and motivations of this approach to the related literature, in this talk we focus on the problem of solving efficiently very large scale instances of this well known logic programming problem and propose a new GRASP approach that, being able to exploit the specific structure of the problem, largely outperforms other established solvers for the same problem.
References
[1] G. Felici, K. Truemper. A Minsat Approach for Learning in Logic Domains, INFORMS Journal on Computing 14(1): 20-36 (2002).
[2] P. Bertolazzi, G. Felici, E. Weitschek. Learning to classify species with barcodes, BMC Bioinformatics, 10:1-12 (2009).
[3] M. Arisi, R. D’Onofrio, A. Brandi, S. Felsani, G. Capsoni, G. Drovandi, G. Felici, E. Weitschek, P. Bertolazzi, A. Cattaneo. Gene Expression Biomarkers in the Brain of a Mouse Model for Alzheimer’s Disease: Mining of Microarray Data by Logic Classification and Feature Selection. Journal of Alzheimer's Disease, 24(4) 721-738 (2011).
[4] E. Weitschek, A. Lo Presti, G. Drovandi, G. Felici, M. Ciccozzi, M. Ciotti, P. Bertolazzi. Human polyomaviruses identification by logic mining techniques. BMC Virology Journal, 9:58 (2012).
[5] E. Weitschek, G. Fiscon, G. Felici. Supervised DNA Barcodes species classification: analysis, comparisons and results, BMC BioData Mining, 7:4 (2014).
[6] P. Bertolazzi, G. Felici, P. Festa, G. Fiscon, E. Weitschek. Integer Programming models for Feature Selection: new extensions and a randomized solution algorithm, European Journal of Operational Research, 250-389–399, 250 (2016).
Data mining is one of the main activities in bioinformatics, specifically to extract knowledge from massive data sets related with gene expression measurement, CNV, DNA strings, and others. A long array of methods are used to perform such task, ranging from the more established parametric statistical analysis to non parametric techniques, to classification methods that have been developed in knowledge engineering and artificial intelligence. In this paper, we consider a method for extracting logic formulas from data that relies on a large body of literature in integer and logic optimization, originally presented in [1], that has been largely and successfully applied to different problems in bioinformatics ([2], [3], [4], [5], [6]). Such method is based on the iterative solution of Minimum Cost SAT Problems and is able to extract logic formulas in DNF form that possess interesting features for their interpretation. While leaving the discussion of the main features and motivations of this approach to the related literature, in this talk we focus on the problem of solving efficiently very large scale instances of this well known logic programming problem and propose a new GRASP approach that, being able to exploit the specific structure of the problem, largely outperforms other established solvers for the same problem.References[1] G. Felici, K. Truemper. A Minsat Approach for Learning in Logic Domains, INFORMS Journal on Computing 14(1): 20-36 (2002).[2] P. Bertolazzi, G. Felici, E. Weitschek. Learning to classify species with barcodes, BMC Bioinformatics, 10:1-12 (2009).[3] M. Arisi, R. D’Onofrio, A. Brandi, S. Felsani, G. Capsoni, G. Drovandi, G. Felici, E. Weitschek, P. Bertolazzi, A. Cattaneo. Gene Expression Biomarkers in the Brain of a Mouse Model for Alzheimer’s Disease: Mining of Microarray Data by Logic Classification and Feature Selection. Journal of Alzheimer's Disease, 24(4) 721-738 (2011).[4] E. Weitschek, A. Lo Presti, G. Drovandi, G. Felici, M. Ciccozzi, M. Ciotti, P. Bertolazzi. Human polyomaviruses identification by logic mining techniques. BMC Virology Journal, 9:58 (2012).[5] E. Weitschek, G. Fiscon, G. Felici. Supervised DNA Barcodes species classification: analysis, comparisons and results, BMC BioData Mining, 7:4 (2014).[6] P. Bertolazzi, G. Felici, P. Festa, G. Fiscon, E. Weitschek. Integer Programming models for Feature Selection: new extensions and a randomized solution algorithm, European Journal of Operational Research, 250-389–399, 250 (2016).Engineering permanence in finite systemshttps://peerj.com/preprints/24542016-11-122016-11-12Daniel Bilar
The man-machine integration era (MMIE) is marked by sensor ubiquity, whose readings map human beings to finite numbers. These numbers processed by continuously changing, optimizing/learning, finite precision, closed loop, distributed systems are used to drive decisions such as insurance rates, prison sentencing, health care allocations and probation guidelines. Optimization and system parameter tuning is increasingly left to machine learning and applied AI. One challenge we face is thus: Ensuring the indelibility, the permanence, the infinite value of human beings as optimization-resistant invariants in such system environments. In this challenge paper, we propose developing safeguards, specifically working towards a 'deontological imprimatur' architecture embedding resilient representations of human beings.
The man-machine integration era (MMIE) is marked by sensor ubiquity, whose readings map human beings to finite numbers. These numbers processed by continuously changing, optimizing/learning, finite precision, closed loop, distributed systems are used to drive decisions such as insurance rates, prison sentencing, health care allocations and probation guidelines. Optimization and system parameter tuning is increasingly left to machine learning and applied AI. One challenge we face is thus: Ensuring the indelibility, the permanence, the infinite value of human beings as optimization-resistant invariants in such system environments. In this challenge paper, we propose developing safeguards, specifically working towards a 'deontological imprimatur' architecture embedding resilient representations of human beings.Observation analysis tool for the FREEWAT GIS environment for water resources managementhttps://peerj.com/preprints/21272016-08-172016-08-17Massimiliano CannataJakob NeumannMirko CardosoRudy RossettoLaura Foglia
Time-series are an important aspect of environmental modelling, and are becoming more available through the requirements of the water framework directive as well as more important with the advancement of numerical simulation techniques and increased model complexity. For this reason, within the H2020 FREEWAT project, which aims at facilitating the adoption of modeling for water resource management, the integration of a tool for time-series analysis and processing has been foreseen. As a result the Observation Analysis Tool was developed to enable time-series visualisation, pre-processing of data for model development, and post-processing of model results. Observation Analysis Tool can act as a pre-processor for calibration observations, and will be expanded to incorporate its processing capabilities directly into the calibration process. The tool consists in an expandable Python library and in an interface integrated in the QGIS FREEWAT plug-in which include a large number of modelling capabilities, hydro-chemical data management tools and calibration capacity. The tool has been extensively used and tested in different european institutions, to collect a number of indications to drive the future development.
Time-series are an important aspect of environmental modelling, and are becoming more available through the requirements of the water framework directive as well as more important with the advancement of numerical simulation techniques and increased model complexity. For this reason, within the H2020 FREEWAT project, which aims at facilitating the adoption of modeling for water resource management, the integration of a tool for time-series analysis and processing has been foreseen. As a result the Observation Analysis Tool was developed to enable time-series visualisation, pre-processing of data for model development, and post-processing of model results. Observation Analysis Tool can act as a pre-processor for calibration observations, and will be expanded to incorporate its processing capabilities directly into the calibration process. The tool consists in an expandable Python library and in an interface integrated in the QGIS FREEWAT plug-in which include a large number of modelling capabilities, hydro-chemical data management tools and calibration capacity. The tool has been extensively used and tested in different european institutions, to collect a number of indications to drive the future development.Detecting periodicities with Gaussian processeshttps://peerj.com/preprints/17432016-02-152016-02-15Nicolas DurrandeJames HensmanMagnus RattrayNeil D Lawrence
We consider the problem of detecting and quantifying the periodic component of a function given noise-corrupted observations of a limited number of input/output tuples. Our approach is based on Gaussian process regression which provides a flexible non-parametric framework for modelling periodic data. We introduce a novel decomposition of the covariance function as the sum of periodic and aperiodic kernels. This decomposition allows for the creation of sub-models which capture the periodic nature of the signal and its complement. To quantify the periodicity of the signal, we derive a periodicity ratio which reflects the uncertainty in the fitted sub-models. Although the method can be applied to many kernels, we give a special emphasis to the Matérn family, from the expression of the reproducing kernel Hilbert space inner product to the implementation of the associated periodic kernels in a Gaussian process toolkit. The proposed method is illustrated by considering the detection of periodically expressed genes in the arabidopsis genome.
We consider the problem of detecting and quantifying the periodic component of a function given noise-corrupted observations of a limited number of input/output tuples. Our approach is based on Gaussian process regression which provides a flexible non-parametric framework for modelling periodic data. We introduce a novel decomposition of the covariance function as the sum of periodic and aperiodic kernels. This decomposition allows for the creation of sub-models which capture the periodic nature of the signal and its complement. To quantify the periodicity of the signal, we derive a periodicity ratio which reflects the uncertainty in the fitted sub-models. Although the method can be applied to many kernels, we give a special emphasis to the Matérn family, from the expression of the reproducing kernel Hilbert space inner product to the implementation of the associated periodic kernels in a Gaussian process toolkit. The proposed method is illustrated by considering the detection of periodically expressed genes in the arabidopsis genome.A computational framework for colour metrics and colour space transformshttps://peerj.com/preprints/17002016-02-022016-02-02Ivar Farup
An object-oriented computational framework for the transformation of colour data and colour metric tensors is presented. The main idea of the design is to represent the transforms between spaces as compositions of objects from a class hierarchy providing the methods for both the transforms themselves and the corresponding Jacobian matrices. In this way, new colour spaces can be implemented on the fly by transforming from any existing colour space, and colour data in various formats as well as colour metric tensors and colour difference data can easily be transformed between the colour spaces. This reduces what normally requires several days of coding to a few lines of code without introducing a significant computational overhead. The framework is implemented in the Python programming language.
An object-oriented computational framework for the transformation of colour data and colour metric tensors is presented. The main idea of the design is to represent the transforms between spaces as compositions of objects from a class hierarchy providing the methods for both the transforms themselves and the corresponding Jacobian matrices. In this way, new colour spaces can be implemented on the fly by transforming from any existing colour space, and colour data in various formats as well as colour metric tensors and colour difference data can easily be transformed between the colour spaces. This reduces what normally requires several days of coding to a few lines of code without introducing a significant computational overhead. The framework is implemented in the Python programming language.The set of \((N/2^i)\)-distance graphs of \(C_N\) and its application to efficient information broadcast among \(N\) nodes over \(K_N\)https://peerj.com/preprints/13722015-09-162015-09-16Jaderick P Pabico
The set \(S\) of \(\left(\frac{N}{2^i}\right)\)-distance graphs of order \(N\) cycle graphs \(\mathbb{C}_N\) is defined here as \[S = \left\{ s_i= \bigcup_1^{\frac{N}{2^i}} \mathbb{C}_{2^i} \quad\Bigg\vert\quad i=1,2, \dots, \log N\right\},\] where w.o.l.o.g. \(N=2^k\), \(\forall k=2, 3, \dots\), and \(\log\) is base two. The utility of the computation of \(S\) is demonstrated by a \(\mathcal{O}(|S| = \log N)\)-step implementation of various information broadcasts and their corresponding duals (i.e., reduction) among \(N\) completely-connected nodes \(\mathbb{K}_N\) exchanging messages under a realistic \((1, 1, 2)\) communication model (i.e., concurrent one in-port and one out-port over a duplex connection). Information broadcast over \(\mathbb{K}_N\) under \((1,1,2)\) is currently implemented with \(\mathcal{O}(N)\) steps using a series of \(N\) circular 1-shift operations (or one circular \(N\)--shift). Algorithmically, the \(i\)th element \(s_i\in S\) partially coincides with the \(i\)th dimension of a \(\mathbb{K}_N\)-embedded \((\log N)\)-cube.
The set \(S\) of \(\left(\frac{N}{2^i}\right)\)-distance graphs of order \(N\) cycle graphs \(\mathbb{C}_N\) is defined here as \[S = \left\{ s_i= \bigcup_1^{\frac{N}{2^i}} \mathbb{C}_{2^i} \quad\Bigg\vert\quad i=1,2, \dots, \log N\right\},\] where w.o.l.o.g. \(N=2^k\), \(\forall k=2, 3, \dots\), and \(\log\) is base two. The utility of the computation of \(S\) is demonstrated by a \(\mathcal{O}(|S| = \log N)\)-step implementation of various information broadcasts and their corresponding duals (i.e., reduction) among \(N\) completely-connected nodes \(\mathbb{K}_N\) exchanging messages under a realistic \((1, 1, 2)\) communication model (i.e., concurrent one in-port and one out-port over a duplex connection). Information broadcast over \(\mathbb{K}_N\) under \((1,1,2)\) is currently implemented with \(\mathcal{O}(N)\) steps using a series of \(N\) circular 1-shift operations (or one circular \(N\)--shift). Algorithmically, the \(i\)th element \(s_i\in S\) partially coincides with the \(i\)th dimension of a \(\mathbb{K}_N\)-embedded \((\log N)\)-cube.An optimization approach to detect differentially methylated regions from Whole Genome Bisulfite Sequencing datahttps://peerj.com/preprints/12872015-09-022015-09-02Nina HesseChristopher SchröderSven Rahmann
Whole genome bisulfite sequencing (WGBS) is the current method of choice to obtain the methylation status of each single CpG dinucleotide in a genome. The typical analysis asks for regions that are differentially methylated (DMRs) between samples of two classes, such as different cell types. However, even with current low sequencing costs, many studies need to cope with few samples and medium coverage to stay within budget. We present a method to conservatively estimate the methylation difference between the two classes. Starting from a Bayesian paradigm, we formulate an optimization problem related to LASSO approaches. We present a dynamic programming approach to efficiently compute the optimal solution and its implementation diffmer. We discuss the dependency of the resulting DMRs on the free parameters of our approach and compare the results to those obtained by other DMR discovery tools (BSmooth and RADMeth). We showcase that our method discovers DMRs that are missed by the other tools.
Whole genome bisulfite sequencing (WGBS) is the current method of choice to obtain the methylation status of each single CpG dinucleotide in a genome. The typical analysis asks for regions that are differentially methylated (DMRs) between samples of two classes, such as different cell types. However, even with current low sequencing costs, many studies need to cope with few samples and medium coverage to stay within budget. We present a method to conservatively estimate the methylation difference between the two classes. Starting from a Bayesian paradigm, we formulate an optimization problem related to LASSO approaches. We present a dynamic programming approach to efficiently compute the optimal solution and its implementation diffmer. We discuss the dependency of the resulting DMRs on the free parameters of our approach and compare the results to those obtained by other DMR discovery tools (BSmooth and RADMeth). We showcase that our method discovers DMRs that are missed by the other tools.Matrix compression methodshttps://peerj.com/preprints/8492015-02-232015-02-23Crysttian A. PaixãoFlávio Codeço Coelho
The biggest cost of computing with large matrices in any modern computer is related to memory latency and bandwidth. The average latency of modern RAM reads is 150 times greater than a clock step of the processor (Alted, 2010). Throughput is a little better but still 25 times slower than the CPU can consume. The application of bitstring compression allows for larger matrices to be moved entirely to the cache memory of the computer, which has much better latency and bandwidth (average latency of L1 cache is 3 to 4 clock steps). This allows for massive performance gains as well as the ability to simulate much larger models efficiently. In this work, we propose a methodology to compress matrices in such a way that they retain their mathematical properties. Considerable compression of the data is also achieved in the process. Thus allowing for the computation of much larger linear problems within the same memory constraints when compared with the traditional representation of matrices.
The biggest cost of computing with large matrices in any modern computer is related to memory latency and bandwidth. The average latency of modern RAM reads is 150 times greater than a clock step of the processor (Alted, 2010). Throughput is a little better but still 25 times slower than the CPU can consume. The application of bitstring compression allows for larger matrices to be moved entirely to the cache memory of the computer, which has much better latency and bandwidth (average latency of L1 cache is 3 to 4 clock steps). This allows for massive performance gains as well as the ability to simulate much larger models efficiently. In this work, we propose a methodology to compress matrices in such a way that they retain their mathematical properties. Considerable compression of the data is also achieved in the process. Thus allowing for the computation of much larger linear problems within the same memory constraints when compared with the traditional representation of matrices.