Background: Technological advances in sequencing, assembly and segregation of resulting contigs into species-specific bins has enabled the reconstruction of individual genomes from environmental metagenomic data sets. Though a powerful technique, it is shadowed by an inability to truly determine whether assembly and binning techniques are accurate, specific, and sensitive due to a lack of complete reference genome sequences against which to check the data. Errors in genome reconstruction, such as missing or mis-attributed activities, can have a detrimental effect on downstream metabolic and ecological modeling, and thus it is important to assess the accuracy of the process.
Methods: We compared genomes reconstructed from metagenomic data to complete genome sequences of 10 organisms isolated from the same community to identify regions not captured by typical binning techniques. The nucleotide content, as %G+C and tetranucleotide frequencies, and sequence redundancy within both the genome and across the metagenome were determined for both the captured and uncaptured regions. This direct comparison allowed us to evaluate the efficacy of nucleotide composition and coverage profiles as elements of binning protocols and look for biases in sequence characteristics and gene content in regions missing from the reconstructions.
Results: We found that repeated sequences were frequently missed in the reconstruction process as were short sequences with variant nucleotide composition. Genes encoded on the missing regions were strongly biased towards ribosomal RNAs, transfer RNAs, mobile element functions and genes of unknown function.
Conclusions: Our observation of increased mis-binning of short regions, especially those with variant nucleotide content, and repeated regions implies that factors which affect assembly efficiency also impact binning accuracy. To a large extent, mis-binned regions appear to derive from mobile elements. Our results support genome reconstruction as a robust process, and suggest that reconstructions determined to be >90% complete are likely to effectively represent organismal function.