Peer Review #1 of "The comparative population genetics of Neisseria meningitidis and Neisseria gonorrhoeae (v0.2)"

Neisseria meningitidis (Nm) and N. gonorrhoeae (Ng) are closely related pathogenic bacteria. To compare their population genetics we compiled a dataset of 1145 genes found across 20 Nm and 15 Ng genomes. We find that Nm is seven-times more diverse than Ng in their combined core genome. Both species have acquired the majority of their diversity by recombination with divergent strains, however we find that Nm has acquired more of its diversity by recombination than Ng. We find that linkage disequilibrium declines rapidly across the genomes of both species. Several observations suggest that Nm has a higher effective population size than Ng; it is more diverse, the ratio of non-synonymous to synonymous polymorphism is lower, and linkage disequilibrium declines more rapidly to a lower asymptote in Nm. The two species share a modest amount of variation, half of which seems to have been acquired by lateral gene transfer and half from their common ancestor. We investigate whether diversity varies across the genome of each species and find that it does. Much of this variation is due to different levels of lateral gene transfer. However, we also find some evidence that the effective population size varies across the genome. We test for adaptive evolution in the core genome using a McDonald-Kreitman test and by considering the diversity around non-synonymous sites that are fixed for different alleles in the two species. We find some evidence for adaptive evolution using both approaches.

123 In most analyses we treated genes independently. However, to detect hLGT we ran 124 ClonalFrameML (Didelot & Wilson 2015) on a concatenation of the protein coding sequences 125 from the core genome of both species. Genes were concatenated randomly without respect for 126 synteny. For some analyses we masked those regions inferred to be due to hLGT in the strains 127 affected.

128
129 We investigated whether linkage disequilibrium (LD) declines with the distance between sites 130 by measuring the LD between all pairs of polymorphisms within each gene; we did not 131 concatenate the genes or align whole genomes, because with the gain and loss of genes the 132 distance between sites differs depending on the strains being analysed. We measured LD using 133 the r 2 statistic (Hill & Robertson 1968). LD values were then assigned to bins based on the 134 distance between the two sites -10bp bins between 1-100bp, a bin from 101-200bp and then 135 200bp bins between 201-800bp. We took the average LD and distance between sites for each 136 bin in a manner which weighted each gene equally -we estimated the average LD and distance 137 for pairs of sites in each bin for each gene and then averaged those values across genes. To 138 estimate the approximate half-life of LD, we found the distance between sites that gave 139 approximately half the LD between the LD for the 1-10bp bin and the asymptotic value of the 140 LD.

141
142 Because r 2 is constrained to be positive, the expected value of r 2 is greater than zero even when 143 there is no LD. To calculate the expected value of r 2 when there is no LD, we considered two bi-144 allelic loci with alleles at frequencies p 1 and p 2 . The expected frequencies of the four haplotypes 145 are p 1 p 2 , p 1 (1-p 2 )…etc. from which we generated four random variates from a multinomial 146 distribution for a sample size of N chromosomes using Mathematica version 11; for each 147 sample of haplotypes we calculated r 2 . We repeated this procedure 10,000 times and calculated 148 the mean to estimate the expected value of r 2 . We found that the expected value of r 2 is 149 independent of the allele frequencies.
150 151 To investigate the relationship between the non-synonymous,  N , and synonymous,  S , 152 nucleotide diversity we used a variation of the method of James et al. ) to 153 combine data from different genes. If the distribution of fitness effects of new mutations is a 154 gamma distribution (assuming most mutations are deleterious) then log( N ) is expected to be 155 linearly correlated to log( S ) if there is variation in N e (Welch et al. 2008). However, for many 156 genes either  N or  S is zero, hence we need to combine genes together. We can do this by 157 splitting the synonymous polymorphisms into two groups according to whether they were in an 158 odd or even numbered codon and then using the two groups to estimate two synonymous 159 nucleotide diversities that have independent sampling errors,  S1 and  S2 . One of these,  S1 , 160 was used to rank and group genes, and the other was averaged across genes in the group to 161 give an unbiased estimate of  S for the group.  N was also averaged across the genes in the 162 group.

163
164 To investigate the diversity around sites that are fixed between Nm and Ng for different alleles 165 we focused on genes that had at least one synonymous polymorphism and one fixed difference 166 between the two species. For each fixed difference, we identified all the synonymous 167 polymorphisms that were within 1 kb and we grouped them by windows of 100 bp. Since,  250 homologous recombination with divergent strains, of the sort detected by ClonalFrameML, 251 generates LD because it simultaneously introduces many polymorphisms that are initially linked 252 to each other. However, homologous recombination amongst a set of closely related strains 253 breaks-up LD. To investigate how these two forces play out, we calculated the LD between all 254 pairs of sites within each gene and plotted these as a function of the distance between sites. As 255 expected we observe a decline in LD with distance ( Figure 1A). Both species show similar 256 patterns with LD declining rapidly; in Nm the approximate half-life is 30bp and in Ng it is 100bp.
257 The decline could be due to two processes. If most hLGT fragments tend to be short, with 258 decreasing numbers of long fragments, then LD will be greater between closely linked sites.
259 However, we also expect a decline due to recombination between closely related strains, and in 260 fact we observe a decline even when we focus on those parts of the genome which do not 261 appear to have undergone hLGT ( Figure 1B). 276 Nucleotide diversity is known to vary across the genomes of many organisms. This is largely 277 thought to be driven by variation in the mutation rate or variation in the effects of linked 278 selection. However, in bacteria, and particularly Nm and Ng, it could also be due to variation in 279 the frequency of hLGT. All of these processes are expected to affect synonymous and non-280 synonymous diversity to greater or lesser extents, and indeed we observe a positive correlation 281 between non-synonymous and synonymous diversity, demonstrating that both vary across the 282 genome in concert. At least part of this pattern is driven by hLGT because genes with hLGT 283 show higher  N and  S values than genes without any evidence of hLGT ( Figure 2).

284
285 However, to investigate whether there is also variation in the effective population size across 286 the genome we removed sequences inferred to be due to hLGT by ClonalFrameML from our 287 data. This reduces our data substantially and so to reduce statistical sampling issues we used      Bennett et al. 2007), that Nm is substantially more diverse than Ng, but that the two 378 species share a moderate amount of diversity in the genes that they have in common. This 379 shared diversity could have been a consequence of ancestral polymorphism that has been 380 inherited by both species, or due to hLGT transferring variation between the two. We find a 381 substantial fraction is indeed due to hLGT, since if we remove the fraction of the genome that Manuscript to be reviewed  Correlation between non-synonymous and synonymous diversity across the genome. he correlation between the log of the non-synonymous nucleotide diversity and the log of the synonymous diversity for core genes in A) Nm and B) Ng. Points in green are genes with evidence of hLGT and red are those genes without evidence of hLGT. Note that some genes are excluded because they have either no non-synonymous or synonymous diversity. Manuscript to be reviewed