Seqenv: linking sequences to environments through text mining

Department of Ecology and Genetics, Limnology, Uppsala Universitet, Uppsala, Sweden
Environmental bioinformatics consultants, Envonautics Ltd., Göteborg, Sweden
Infrastructure and Environment Research Division, School of Engineering, University of Glasgow, Glasgow, United Kingdom
The Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
Western Australia Organic and Isotope Geochemistry Centre (WA-OIGC), Department of Chemistry, Curtin University of Technology, Bentley, WA, Australia
Institute of Biological & Environmental Sciences, University of Aberdeen, Aberdeen, United Kingdom
Institute of Soil Biology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic
Institute of Marine Biology Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion Crete, Greece
Bioinformatics Group, The Cyprus Institute of Neurology and Genetics, Nicosia, Cyprus
Department of Molecular Ecology, Microbial Genomics and Bioinformatics Group, Max Planck Institute for Marine Microbiology, Bremen, Germany
Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
Hawkesbury Institute for the Environment, University of Western Sydney, Hawkesbury, Sydney, Australia
Warwick Medical School, University of Warwick, Warwick, United Kingdom
DOI
10.7287/peerj.preprints.2317v1
Subject Areas
Bioinformatics, Ecology, Environmental Sciences, Microbiology
Keywords
bioinformatics, ecology, microbiology, genomics, sequence analysis, text processing, statistics, pipeline, open source software
Copyright
© 2016 Sinclair et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Sinclair L, Ijaz UZ, Jensen L, Coolen MJ, Gubry-Rangin C, Chroňáková A, Oulas A, Pavloudi C, Schnetzer J, Weimann A, Ijaz A, Eiler A, Quince C, Pafilis E. 2016. Seqenv: linking sequences to environments through text mining. PeerJ Preprints 4:e2317v1

Abstract

Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the "nt" nucleotide database provided by NCBI and, out of every hit, extracts – if it is available – the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install, go to: https://github.com/xapple/seqenv

Author Comment

This is a submission to PeerJ for review.