Abstract truncated at 3,000 characters - the full version is available in the pdf file.
Biological networks and, in particular, biological pathways are composed of thousands of nodes and edges, posing several challenge regarding analysis and storage. The primary format used to represent pathways data is BioPAX (http://biopax.org.) BioPAX is a standard language that aims to enable integration, exchange, visualization and analysis of biological pathway data. BioPAX is an open and collaborative effort made by the community of researchers, software developers, and institutions and it specifically supports data exchange between pathway data groups. BioPAX is defined in OWL and is represented in the RDF/XML format. OWL (Web Ontology Language) is a W3C standard and is designed for use by applications that need to process the content of information instead of just presenting information to humans. RDF is a standard model for data interchange on the Web. Although OWL allows a standard representation of pathways, since it is based on XML, it is a verbose and redundant language, so the storage of pathways may be very huge, preventing an efficient transmission and sharing of this data. The typical size of a pathway is related to the organism, for example, the size of Homo Sapiens pathways (from Reactome database) is near to 200 MB on disk. Moreover, integrating pathways data coming from different data sources may require GBytes of space. A second problem with pathways is related to the possibility to integrate information coming from different data sources to have updated information in a centralized way. There exist several different databases for pathways data that emphasizes different aspect of the same pathway, thus, it could be useful to integrate and annotate together pathways coming from different databases to obtain a centralized and more informative pathway data. The principal obstacle for integrating, storing and exchanging such data is the extreme size growth when several pathways data are merged together, posing several challenges from the computational and archiving point of view. Pathways data can be easily classified as big data, because they meet all the 5V (Volume, Velocity, Variety, Veracity, Value) characteristics typical of Big Data, thus, the necessity to efficiently integrate and compress pathways data arises. The methodology for pathways data integration is based on the following steps: i) aggregation and validation locally of data coming from several pathway databases, ii) identification and normalization of compounds and reactions identifier and iii) integration. Integration occurs at the level of physical entities, such as proteins and small molecules. This is accomplished by linking interaction and pathway records together if they use the same physical entities (such as from UniProt for proteins) and by adding annotation data from UniProt or GeneOntology.