Finding novel relationships with integrated gene-gene association network analysis of Synechocystis sp. PCC 6803 using species-independent text-mining

Department of Biochemistry, University of Turku, Turku, Finland
University of Turku Graduate School, University of Turku, Turku, Finland
Department of Future Technologies, University of Turku, Turku, Finland
Department of Life Sciences, Imperial College London, London, United Kingdom
DOI
10.7287/peerj.preprints.3472v1
Subject Areas
Computational Biology, Microbiology, Plant Science
Keywords
Cyanobacteria, Synechocystis sp. PCC 6803, Network analysis, Text-mining, Systems Biology, Metabolism
Copyright
© 2017 Kreula et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Kreula SM, Kaewphan S, Ginter F, Jones PR. 2017. Finding novel relationships with integrated gene-gene association network analysis of Synechocystis sp. PCC 6803 using species-independent text-mining. PeerJ Preprints 5:e3472v1

Abstract

The increasing move towards open access full-text scientific literature enhances our ability to utilize advanced text-mining methods to construct information-rich networks that no human will be able to grasp simply from 'reading the literature'. The utility of text-mining for well-studied species is obvious though the utility for less studied species, or those with no prior track-record at all, is not clear. Here we present a concept for how advanced text-mining can be used to create information-rich networks even for less well studied species and apply it to generate an open-access gene-gene association network resource for Synechocystis sp. PCC 6803, a representative model organism for cyanobacteria and first case-study for the methodology. By merging the text-mining network with networks generated from species-specific experimental data, network integration was used to enhance the accuracy of predicting novel interactions that are biologically relevant. A rule-based algorithm was constructed in order to automate the search for novel candidate genes with a high degree of likely association to known target genes by (1) ignoring established relationships from the existing literature, as they are already 'known', and (2) demanding multiple independent evidences for every novel and potentially relevant relationship. Using selected case studies, we demonstrate the utility of the network resource and rule-based algorithm to (i) discover novel candidate associations between different genes or proteins in the network, and (ii) rapidly evaluate the potential role of any one particular gene or protein. The full network is provided as an open source resource.

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

Individual and merged networks

Cytoscape file containing independent and merged networks. Opens with Cytoscape 3.1

DOI: 10.7287/peerj.preprints.3472v1/supp-1

Annotations

Text-file describing the annotations in the Cytoscape files

DOI: 10.7287/peerj.preprints.3472v1/supp-2

Script for identifying candidate genes

Python file containing candidate gene script

DOI: 10.7287/peerj.preprints.3472v1/supp-3

Script for identifying hypothetical genes

Python file containing hypothetical gene script

DOI: 10.7287/peerj.preprints.3472v1/supp-4

Annotations from EVEX

Text-file containing annotations from EVEX/Cyanobase

DOI: 10.7287/peerj.preprints.3472v1/supp-5

First neighbor GBA and script-based clusters

Cytoscape file containing the first neighbour GBA and script-based clusters used in the case studies. Opens with Cytoscape 3.1

DOI: 10.7287/peerj.preprints.3472v1/supp-6

Cytoscape file with identified motifs

Cytoscape file containing all genes in the genome of Synechocystis 6803 without an annotation that forms a motif with at least two other nodes via at least two different data-types (i.e. edges), of which one is direct and the second is indirect, and at least one of the members of the motif has an existing annotation. Opens with Cytoscape 3.1

DOI: 10.7287/peerj.preprints.3472v1/supp-7

List of candidate genes

Text-file containing list of possible candidates of hypotheticals

DOI: 10.7287/peerj.preprints.3472v1/supp-8