To increase transparency, PeerJ operates a system of 'optional signed reviews and history'. This takes two forms: (1) peer reviewers are encouraged, but not required, to provide their names (if they do so, then their profile page records the articles they have reviewed), and (2) authors are given the option of reproducing their entire peer review history alongside their published article (in which case the complete peer review process is provided, including revisions, rebuttal letters and editor decision letters).
Both reviewers confirm that the contribution merits publication in PeerJ and I concur.
The manuscript has been extensively reviewed and English use is clear and unambiguous.
In the current form, the manuscript results is relevant to the hypothesis and it is now clear how the work fits into the broader field of knowledge.
The findings are appropriately stated, and connected to the original question investigated.
I thank the authors for addressing my concerns and making the manuscript much more clear. I have no further comments.
Pleas address all comments of the two reviewers.
The Authors show how pangenomes and metagenomes can be linked and provide proof-of-concept of how this metapangenomics provides unique insights.
The English should be improved to ensure text is clearly understood. For example:
1/ Line 27 to 32. In the abstract, the authors give two statements, “Rapidly growing number of ... …of populations across microbial genomes.” The first statements is a general statement that is followed by a second statement that is supposed to provide more clarity of which aspects of the general statement is the key focus of this manuscript. However, the current phrasing makes comprehension difficult.
2/ Line 65 to 68. Also rephrase these statements to make comprehension easy.
3/ Line 274. Rephrase these statements to make comprehension easy.
4/ Line 411 to 415. Rephrase these statements to make comprehension easy.
5/ The entire document needs to be proofread.
Most of the references used are from the Nature Journal but some references are old and newer published manuscripts with impacting findings have not been included. For example:
1/ Line 69: include after reference (“Lorenz & Eck 2005; Thies, Stephan, et al. "Metagenomic discovery of novel enzymes and biosurfactants in a slaughterhouse biofilm microbial community." Scientific reports 6 (2016): 27035.)
2/ Line 70: include after reference (“Tringe et al., 2005; Al-Amoudi, Soha, et al. "Metagenomics as a preliminary screen for antimicrobial bioprospecting." Gene 594.2 (2016): 248-258)
3/ line 71-72: include after reference (“Tyson et al., 2004; Haroon, Mohamed F., et al. "A catalogue of 136 microbial draft genomes from Red Sea metagenomes." Scientific data 3 (2016): 160050; Delmont et al., 2017)
Overall, I commend the authors for the thorough data analyses and on conciseness of style of writing. If there is a weakness it merely is with respect to making comprehension easier (as I have noted above).
Research question well defined and meaningful.
Conclusion are well stated, linked to original research question & limited to supporting results.
I am putting my entire review in this section, as nearly all, if not all, of my comments are related to basic reporting.
This is a nice contribution by Delmont and Eren describing the utility of a new software pipeline in the existing Anvi’o tool, along with a few new insights into Prochlorococcus ecology. The pipeline links genes from isolate genome sequences to their abundances in the environment via metagenomic read mapping, and it can identify specific genes (or protein clusters) that exist in isolate genomes but may be very uncommon in the environment. I have few, if any, scientific criticisms, but I found a lot of the text confusing, mostly due to undefined terminology and some long, confusing sentences. I think that this is a relatively straightforward study that would benefit from some streamlining of the text. Specific comments are below.
-The abstract is somehow both well written and deeply confusing. Please simplify the language. I am left not really understanding what the main question(s), methods, and results are. I know intuitively what both metagenomics and pangenomics (or at least pangenomes) are, but it would be helpful for the authors to explain these terms in the context of how they were considered for this study. Is the sentence starting with “While pangenomics offers …” meant to define both terms? If so, please restructure it along the lines of “The pangenome of a population (or genus?) consists of both core (shared) and accessory genes and genomic features …” or however you want to define it, and please similarly define metagenomics and its use in this study. If metagenomics is being used for abundance estimates (abundance estimates of what – SNPs, populations, and/or genes within pangenomes?), then consider calling it something more direct (metagenomic abundance estimates?). To me, the term metagenomics primarily evokes community predicted functional profiling and/or population genome assembly and metabolic reconstruction. Even though I have personally also made abundance estimates from metagenomes in much the same way as the authors, that would not be what first comes to mind. [edit: that is almost exactly how this is described by the authors in ln 69-74, so this needs to be much more clear in the abstract]
-Along those lines, I do not think of metagenomics and pangenomics to be inherently different. I would assume that metagenomics could be used both to define the pangenome (i.e., to find core and accessory genes in metagenomic assemblies) and to determine the abundances of subpopulations and/or specific genes or regions of the pangenome (through read mapping to metagenomic assemblies, SAGs, isolate genomes, or any combination of the above). Which, if any, of these possibilities apply to this study is not clear.
-Metapangenomics is not defined, and I do not find it to be a helpful term. Consider removing it from the manuscript, including the title. Based on my reading of the abstract alone, it looks like metapangenomics is meant to describe an abundance-informed pangenome, and if so, why not call it something like that, with some useful meaning in the term itself?
-ln 32: complementary
-ln 38-42: What does this sentence mean? Consider splitting it into two sentences. How does a metapangenome correlate with something? What is it correlating with? I am not sure that “traits” is an appropriate word – are the authors describing core and accessory genes here? What are “sub-clade demarcations”? How would these results differ from phylogenetic analyses (wouldn’t phylogenetics by definition separate sub-clades?)? Do the authors mean some specific phylogenetic analyses that are typically performed at a coarser resolution? If so, please provide more context on the phylogenetic analyses.
-ln 54: “have been” should be “has been”
-Throughout: consider changing “shared” to “core” in the context of core genes across pangenomes, as this is more common in the literature and therefore more intuitive. If the authors mean shared among some but not all populations, then that should be explicitly mentioned, but I assume that they mean core genes shared among all populations.
-ln 74-76: What does this mean? Wouldn’t metagenomic assembly + read mapping do that too? It might help if you explain the particular utility of isolate sequences here.
-ln 77-78 (or earlier): Please define functional traits in the context of a microbial genome or pangenome. I think that you also mean isolate genomes here, so please change the end to “… mapping of closely related isolate genomes.”
-ln 80-83: This sentence is a bit long and confusing and can probably be broken down. The authors can rework it for clarification as they see fit, but here are some examples of what I find confusing: What are “well-established practices in pangenomics”? Please give a few examples. What are “emerging opportunities from metagenomic data”? Is this just using metagenomic read mapping for abundance data? If so, just say that. What is a genome-centric framework (I assume that this involves the use of closed, isolate genomes), and how does that differ from what you would get from a combination of metagenomic assembly and binning to identify populations + read mapping across a number of metagenomes to get abundance estimates? What are “pangenomic traits” and how do you define which ones are “key”? Are “key” traits just those that are linked to “niche partitioning” and “population fitness”, and if so, how do you determine that?
-ln 85: please state exactly what you mean by integrating pangenomic and metagenomic data. Again, what is “pangenomic data” and what is “metagenomic data”? Are there better terms for these types of data in this context, for example, “… integrating population pangenomes from multiple isolate genome sequences with their abundance profiles across environmental samples from metagenomic read mapping?” That might not even be correct, but the point is that I do not understand.
-I think that the focus of both the abstract and introduction and maybe even the title should be on the need for and development of this software pipeline, as that seems to be the key novel result of the study, tested on Prochlorococcus as an example, right? [[later edit: wait, but the tool is Anvi’o, which is fabulous but not new; please use the introduction to very clearly walk the reader through what is known vs. what is new in this study, both in terms of the visualization software pipeline and the Prochlorococcus biology]]. It seems dangerous to imply that metagenomics has never been used to identify the ecological niches of specific subpopulations (for example, the Banfield lab has worked in that general area, at least in AMD systems; how would isolate pangenomes add further information in that context?) and much safer to say that your software and visualization pipeline can help to identify and show these differences more clearly.
-ln 93: How many genomes? That number seems important if all of these genomes are going into your downstream analyses.
-ln 95: Were these 16S rRNA gene amplicon surveys or otherwise not metagenomic studies? That seems like a worthwhile point of clarification to help make the case for the current study that links isolate genomes to metagenomic data.
-ln 96: dynamics plural
-ln 97-98: Correlations between the “genomic traits” of isolates … are these just groups of genes that correlate with environmental variables? Were there correlations to variables other than HL and LL? If so, maybe call these “other environmental variables.”
-ln 98-104: This is a long sentence, and I got lost halfway through. Do “these two groups” refer to the core and accessory genes? How many metagenomes? What does “independently” mean here, and what is “their differential occurrence”? Does “in Prochlorococcus populations” mean in the same 12 as at the beginning of the sentence? If so, change to “in the 12 Prochlorocuccus populations,” otherwise define these populations. I don’t understand the last part of the sentence at all. Maybe summarize these three studies in three separate sentences and explain which part(s) of each are being included in the current analyses, and then explain clearly how the current study will expand on what is already known from these previous studies.
-ln 104-105: This seems like an important distinction. To this point, the general implication is that this is the first time that anybody has thought of exploring niche partitioning in pangenomes or metagenomes, yet here the authors say that the difference is that previous studies have not had resolution at the level of protein clusters. Again, please dispense with the fancy sentences and terms and use the introduction to tell the reader what has been done in the past that is relevant, both in terms of biology and visualization/software, and then explicitly state what knowledge gap will be filled by this study. For example, it would be useful to explicitly say why monitoring protein clusters is useful.
-ln 106: Again, what are “pangenomic traits”? Maybe I am just stuck on “traits” as an ecological term and the authors just mean similarities and differences across populations?
-ln 106: How do these 31 Prochlorococcus isolates relate to the 12 (or more?) populations described from previous studies above?
-ln 108: Please give an exact number for billions
-ln 110: Define ecological niche; is this just HL vs. LL here?
-ln 109-115: These are results that do not belong in the Introduction. Consider either reworking to frame these as hypotheses (or similar) that will be explored in this study, or remove this.
-ln 113 and 117: These are the first mentions of SAGs. This seems to be of abstract-level importance in how you are defining your pangenomes (i.e., a combination of SAGs and isolates). Or am I not understanding how your pangenomes were defined? After reading more of the results, I do not see much in the way of SAGs there, so how did you decide when to use isolates and when to use SAGs in your analyses? SAGs are presumably less complete genomes, so if a particular gene is not detected in a SAG, it does not necessarily mean that it is actually absent. The authors know this, I am sure, but if this is part of the rationale for using only isolates for some of the analyses, it should be mentioned.
-I think that Anvi’o can also be called out specifically in the Abstract and/or Introduction. It is not clear to me what aspects of the Anvi’o workflow are new in this study, though the Introduction suggests that this is a novel pipeline. Based on the text to this point, I was expecting the presentation of a novel workflow, and this needs to be made more clear. I think that a paragraph in the Intro with Anvi’o background would be appropriate – how has Anvi’o been used in the past, and what specifically is the new application here? It seems like more than just plugging new data into the software, so maybe a flowchart figure would help? There is a section of the methods dedicated to this, which is good, but I wonder if at least some of that should be moved to the main text, given that the pipeline is one of the key outputs of the study and not just an ancillary method.
-ln 134: Has phylogenomics not been done on these 31 genomes before?
-ln 176-199: Is this the new part? If so, you could start with something along the lines of “The Anvi’o pangenomic workflow developed for this study consists of …”
-ln 179: What is a “genome of interest”? Is this just every genome that will be considered for a given analysis, i.e., 31 Prochlorococcus isolates for this study? [I see later that this is the case, so please rephrase to make this more clear]
-ln 234 and 264: What about the SAGs?
-ln 234-243: Has this been done already for Prochlorococcus in any of the TARA Oceans publications? I would guess so, but maybe not all 31 isolates were included. It would be worth clarifying what parts of this analysis are new vs. what just needed to be done again here to feed into the Anvi’o pipeline.
-ln 244-256: These specific clades have not been described anywhere. I realize that a description of each could get tedious, but is there something general that you could say along the lines of, e.g., “All LL lineages come from low-light niches and include subclades I-IV defined by x, y, z” Otherwise, the description of these clades is not particularly useful. The figures just say that these are “literature-defined” lineages, which is fine, but the authors could briefly elaborate on these clade distinctions in the text.
-ln 277-285: Cool!
-ln 298-299: ECGs and EDGs -- do we really need more acronyms? I saw these again a couple of pages later and had to dig back to this section to remind myself of what they are. [and again when I came back to the manuscript after a break] When these acronyms appear again a couple pages later (ln 356), the next sentence has four different acronyms occurring eight times …
-ln 313: Okay, metapangenomics is finally defined! I still don’t fully understand the utility of this term. Maybe it is just me, but I do not find the introduction of new terms and acronyms in nearly every new manuscript in this field to be helpful.
-ln 313-320: What is the result here? The result cannot just be the figure; there has to be some interpretation or guidance for what the reader should be seeing.
-ln 354-356: This seems like an important contribution, and it is buried near the end of the Results.
-ln 366-367: Change to plural
-ln 370-371: Please rephrase this sentence for clarity. What is “they”?
-ln 420: Is this really only a little effort?!
-ln 464-466: Okay, the authors have confirmed the obvious application that I mentioned above, which is that this can also be applied to metagenome-assembled genomes. Why were those not considered here? I don’t think that this is a hole-in-the-paper offense, but I am puzzled, as it seems like a relatively easy addition that would boost the size of the available pangenome significantly.
These are nice. If the authors insist on keeping the ECG and EDG acronyms, please define them in each figure legend.
See above (Basic reporting section) for a few specific comments related to better defining the research question and knowledge gap(s) filled by this study in the Introduction.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.