The solution of large-scale Minimum Cost SAT Problem as a tool for data analysis in bioinformatics
Author and article information
Abstract
Data mining is one of the main activities in bioinformatics, specifically to extract knowledge from massive data sets related with gene expression measurement, CNV, DNA strings, and others. A long array of methods are used to perform such task, ranging from the more established parametric statistical analysis to non parametric techniques, to classification methods that have been developed in knowledge engineering and artificial intelligence. In this paper, we consider a method for extracting logic formulas from data that relies on a large body of literature in integer and logic optimization, originally presented in [1], that has been largely and successfully applied to different problems in bioinformatics ([2], [3], [4], [5], [6]). Such method is based on the iterative solution of Minimum Cost SAT Problems and is able to extract logic formulas in DNF form that possess interesting features for their interpretation. While leaving the discussion of the main features and motivations of this approach to the related literature, in this talk we focus on the problem of solving efficiently very large scale instances of this well known logic programming problem and propose a new GRASP approach that, being able to exploit the specific structure of the problem, largely outperforms other established solvers for the same problem.
References
[1] G. Felici, K. Truemper. A Minsat Approach for Learning in Logic Domains, INFORMS Journal on Computing 14(1): 20-36 (2002).
[2] P. Bertolazzi, G. Felici, E. Weitschek. Learning to classify species with barcodes, BMC Bioinformatics, 10:1-12 (2009).
[3] M. Arisi, R. D’Onofrio, A. Brandi, S. Felsani, G. Capsoni, G. Drovandi, G. Felici, E. Weitschek, P. Bertolazzi, A. Cattaneo. Gene Expression Biomarkers in the Brain of a Mouse Model for Alzheimer’s Disease: Mining of Microarray Data by Logic Classification and Feature Selection. Journal of Alzheimer's Disease, 24(4) 721-738 (2011).
[4] E. Weitschek, A. Lo Presti, G. Drovandi, G. Felici, M. Ciccozzi, M. Ciotti, P. Bertolazzi. Human polyomaviruses identification by logic mining techniques. BMC Virology Journal, 9:58 (2012).
[5] E. Weitschek, G. Fiscon, G. Felici. Supervised DNA Barcodes species classification: analysis, comparisons and results, BMC BioData Mining, 7:4 (2014).
[6] P. Bertolazzi, G. Felici, P. Festa, G. Fiscon, E. Weitschek. Integer Programming models for Feature Selection: new extensions and a randomized solution algorithm, European Journal of Operational Research, 250-389–399, 250 (2016).
Cite this as
2016. The solution of large-scale Minimum Cost SAT Problem as a tool for data analysis in bioinformatics. PeerJ Preprints 4:e2635v1 https://doi.org/10.7287/peerj.preprints.2635v1Author comment
This is an abstract presented at the BBCC2016 conference.
Sections
Additional Information
Competing Interests
The authors declare that they have no competing interests.
Author Contributions
Giovanni Felici conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper.
Daniele Ferone performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work.
Paola Festa conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper.
Antonio Napoletano performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work.
Tommaso Pastore performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work.
Data Deposition
The following information was supplied regarding data availability:
The research in this article did not generate, collect or analyse any raw data or code.
Funding
The authors received no funding for this work.