Assessing the impact of high-throughput sequencing strategy and model complexity on ABC-inferred demographic history in mussels

Genome-scale diversity data are increasingly available in a variety of biological systems, and can be used to reconstruct the past evolutionary history of species divergence. However, extracting the full demographic information from these data is not trivial and requires inferential methods that account for the diversity of coalescent histories throughout the genome. Here, we evaluate the potential and limitations of one such approach. We reexamine a well-known system of mussel sister species, using the coding joint site frequency spectrum (jSFS), in an approximate Bayesian computation (ABC) framework. We first assess the best sampling strategy (number of: individuals, loci, and classes in the jSFS), and show that the number of individuals and loci have little effect on model selection. In contrast, different decompositions of the joint site frequency spectrum strongly affect the results: including classes of low and high frequency shared polymorphisms can more effectively reveal recent migration events. We then take advantage of the flexibility of the ABC to compare more realistic models of speciation including variation in migration rates through time (i.e. periodic connectivity) and across genes (i.e. genome-wide heterogeneity in migration rates). We show that these models consistently outperform the simpler alternatives. This argues that methods that are restricted to simpler models may fail to reconstruct the true speciation history.


Introduction
The biodiversity we inherited from the Quaternary was shaped by the process of species formation (Hewitt 2000).A long-standing question concerns the timing and rate of gene exchange that occurred while populations diverged, during the incipient stages of speciation.
Model-based inferences from genetic data have been used to investigate the history of gene flow (Beaumont et al. 2010).Special attention has been paid to the distinction between recent divergence in a strict isolation model, and older divergence with continuous migration (Nielsen & Wakeley 2001), although of course, more complex scenarios are also possible (Marino et al. 2013, Sousa & Hey 2013).
With next-generation sequencing technologies, thousands of SNPs throughout the genome can be used to infer the demographic histories of non-model species pairs (Sousa & Hey 2013).One way of summarizing the information in these data is the unfolded joint site frequency spectrum (jSFS), i.e. the number of copies of derived alleles found in each of the two species.A recent and fast maximum-likelihood method based on the jSFS (Gutenkunst et al. 2009) has proven useful for distinguishing continuous migration from strict isolation (e.g. in ragworts, Chapman et al. 2013, andbeach mice, Domingues et al. 2012).The method can also evaluate more complex scenarios (e.g., in sea bass, Tine et al. 2014;poplars, Christe et al. 2017;and whitefish, Rougeux et al. 2017), but it struggles to explore the parameter space in these cases.In addition, the method is not well suited for transcriptome data as model comparison by log-likelihood ratio tests assumes independence of SNPs.As a consequence simulations need to be conducted to evaluate competing models, and the computational speed advantage is lost.
As an alternative, Approximate Bayesian Computation (ABC) is a method based on simulations that avoids the need to explicitly compute the likelihood (Beaumont et al. 2002).
As such, histories of speciation characterized both by periods of strict isolation and periods of gene exchange can easily be investigated, e.g. the scenarios of ancient migration and secondary contact.These scenarios can be extended by including two cycles of "isolation / gene exchange", following climatic changes in the Pleistocene (Figure 1).Methods have also been developed to include genome-wide heterogeneity in migration rates (Sousa et al. 2013;Roux et al. 2013).This is consistent with the "genic view" of speciation (Wu 2001), whereby barriers to gene flow are often semi-permeable, varying in strength across the genome due to linked selection and recombination (Barton & Bengtsson 1986).A major challenge in ABC, as compared to explicit likelihood methods, is the selection of summary statistics, which involves a trade-off between loss of information and reduction of dimensionality.Several methods have been suggested to select the most appropriate statistics for a given dataset and a set of models (e.g., Wegmann et al. 2009;Nunes & Balding 2010;Aeschbacher et al. 2012); but most of these summarize the site frequency spectrum (but see, e.g.Boitard et al. 2016) leading to a loss of information.Recently, several ABC studies have used the site frequency spectrum directly to reconstruct the history of single populations (Boitard et al. 2016), or multiple populations (Xue & Hickerson 2015;Smith et al. 2017).However the number of statistics, i.e. the number of classes in the spectrum, increases quadratically with the number of haploid genomes sampled in the case of a two-dimensional spectrum, and even faster if more than two populations are considered.For this reason, recent studies have tested different ways of binning the site frequency spectrum to circumvent the curse of dimensionality (e.g., Smith et al. 2017).While accumulating new data and developing new methods, studies that evaluate the impact of the sampling strategy and the inferential method (e.g.Li & Jakobsson 2012;Robinson et al. 2014;Shafer et al. 2015;Cabrera & Palsbøll 2017;Smith et al. 2017) to reconstruct species divergence history will be increasingly valuable.
The copyright holder for this preprint (which was not peer-reviewed) is the .https://doi.org/10.1101/259135doi: bioRxiv preprint Mytilus edulis (Linnaeus, 1758) and Mytilus galloprovincialis (Lamarck, 1819) are two closely-related species that currently hybridize where their ranges overlap along the Atlantic French coasts (Bierne et al. 2003a) and the British Isles (Skibinski et al. 1983).Their interspecific barrier to gene flow is semi-permeable, and it has been shown to involve multiple isolating mechanisms, both pre-zygotic (e.g.assortative fertilization and habitat choice, Bierne et al. 2002Bierne et al. , 2003b)), and post-zygotic (hybrid fitness depression, Simon et al.

2017
).Other evidence of ongoing gene flow between M. edulis and M. galloprovincialis comes from footprints of local introgression of edulis-derived alleles into a population of M. galloprovincialis enclosed within the Atlantic hybrid zone (Fraïsse et al. 2014).Another study (Fraïsse et al. 2016) revealed that the Atlantic population of M. galloprovincialis was more introgressed than the Mediterranean population on average.At some specific loci, however, the Mediterranean population was found to be fixed for edulis alleles while the Atlantic population was not introgressed at all, suggesting that an ancient contact between M. edulis and M. galloprovincialis occurred during glacial periods and allowed adaptive introgression.Finally, direct model comparisons have been conducted with the IMa method of Hey & Nielsen (2007), and with an ABC framework, and shown that M. edulis and M. galloprovincialis have experienced a complex history of divergence punctuated by periods of gene flow in Europe (IMa: Boon et al. 2009, ABC: Roux et al. 2014, 2016).
Here, we analysed coding sequence datasets from this well-known pair of sister species in Europe, and systematically reconstructed its speciation history by ABC for different configurations of the data.We varied (i) the number of individuals sampled (2, 4, vs. 8), (ii) the number of SNPs (which were obtained by different sequencing techniques, "capture" vs. "rna-seq") and (iii) the methods of binning the jSFS (4, 7 or 23 classes).We then evaluated the influence of these choices on model selection, using eleven distinct scenarios of speciation.Our results show that the influence of the sampling strategy on inferences is The copyright holder for this preprint (which was not peer-reviewed) is the .https://doi.org/10.1101/259135doi: bioRxiv preprint surprisingly limited, while the methods of binning the jSFS strongly affect the results.Moreover, we find that an history of periodic connectivity, with both ancient and contemporary introgression, and a semipermeable barrier to gene flow, best fit these data, arguing that methods that are restricted to simpler models may fail to identify the true history of speciation.

Sampling, sequencing, mapping and calling
Two datasets were analysed for demographic inferences.They both consist in patterns of molecular polymorphism and divergence obtained for the pair M. edulis and M. galloprovincialis.Although the adopted sequencing strategies were different among datasets, the surveyed populations were similar.

Data set 1: "capture"
We used the dataset already published in Fraïsse et al. (2016) and available on http://www.scbi.uma.es/mytilus/index.php.Briefly, a set of 890 EST contigs was used as a reference for a pre-capture multiplex DNA enrichment in samples of eight individuals from two geographical populations in each species (M.edulis: North Sea and Bay of Biscay; M. galloprovincialis: Brittany and Mediterranean Sea, Table 1).In addition, we used a sample of four individuals of M. trossulus to serve as an outgroup (Table 1).Each DNA library was sequenced twice to increase the per-base coverage (Miseq or GA2X followed by HiSeq2000).
After trimming and quality-filtering, reads of each individual were aligned against the same EST reference sequences using the BWA program (bwa-mem, Li & Durbin 2009).Because of the relative divergence between the two species (~2%, Table 2) We used a maximum-likelihood method, implemented in the program read2snps (Tsagkogeorga et al. 2012;Gayral et al. 2013), to call genotypes directly from read numbers at each position.The method computes the probability of each possible genotype after estimating the sequencing error rate.To limit bias in the site frequency estimation (Han et al. 2013), a minimum of 10X coverage, i.e. ten reads per position and per individual, was required to call a genotype.Only genotypes supported at 95% were retained; otherwise missing data was applied.Moreover, paralogous positions were filtered-out using a likelihood ratio test based on explicit modelling of paralogy.

Data set 2: "rna-seq"
The dataset 2 is made of sequenced transcriptomes (RNA-seq) previously generated in a wide meta-analysis study comparing levels of polymorphism across 76 animal species (Romiguier et al. 2014).It includes the transcriptomes of four individuals sampled in the same populations as described above and one individual of M. trossulus (Table 1).Briefly, for each individual, cDNA libraries were prepared with total RNA extracted from whole body and sequenced on HiSeq2000.Illumina reads (100 bp, paired-end) were mapped with the BWA program on de novo transcriptomes, independently assembled for each species with a combination of the programs Abyss and Cap3, following the strategy B and D in Cahais et al. (2012).Contigs with a per-individual average coverage below ×2.5 were discarded.Genotype calling was performed as described in the first data set using identical filters.Open reading frames were predicted with the Trinity package and sequences carrying no ORF longer than 200 bp were discarded.Full methods are described in Romiguier et al. (2014).

Data analysis
We restricted our analysis to loci assembled in all individuals and longer than 300 bp after filtering positions containing missing data or more than two segregating alleles when the three species were aligned (the total number of loci retained in each dataset is given in Table 2).

Site Frequency Spectrum
We first computed the jSFS for each dataset (Figure 2 and Figure S1).Each derived allele, oriented with the outgroup sequence, was assigned to one cell of the jSFS depending on its frequency in each of the two populations.From the full spectrum, different classes of polymorphism were extracted and used as summary statistics (Tellier et al. 2011).
Specifically, we used the four Wakeley-Hey's classes (jsfs=4 in Figure 2): fixed differences, Sf; private polymorphisms for each species Sx1 and Sx2; and shared polymorphisms, Ss (Wakeley & Hey 1997).We also considered a version in which Sf and Sx classes were split to distinguishing whether the derived allele was fixed or absent in the other species (jsfs=7 in Figure 2; (Ramos-Onsins et al. 2004).The third decomposition of the jSFS contains twentythree classes of polymorphisms because singletons and doubletons in each population were included as new classes (jsfs=23 in Figure 2).This corresponds to the full spectrum with n=2 diploid individuals.

Estimators of polymorphism and divergence
We then computed a set of genetic statistics across loci to make use of the coalescent information contained within each sequence, instead of considering each SNP separately as independent.Following previous studies (e.g., Fagundes et al. 2007;Ross-Ibarra et al. 2008;Roux et al. 2011), we used the following statistics: (1) nucleotide diversity, π1 and π2 (Tajima 1983); (2) Watterson's θW1 and θW2 (Watterson 1975); (3) total and net interspecific divergence, div and netdiv; (4) between-species differentiation, FST, computed as 1-πS/πT, where πS is the average pairwise nucleotide diversity within species and πT is the total pairwise nucleotide diversity of the pooled sample across species.We also included the four Wakeley-Hey's classes as explained above (Sf, Sx1, Sx2 and Ss).Finally, we assessed departure from mutation/drift equilibrium using Tajima's D1 and D2 (Tajima 1989a,b).The effective size NA into two populations of constant sizes N1 and N2.The Strict Isolation scenario (SI) assumed that divergence occurred without gene exchange between the two populations.The other models differed by their temporal pattern of migration which occurred at a rate M12 from population 1 to population 2 and M21 in opposite direction.Ancient Migration (AM) and Periodic Ancient Migration (PAM) scenarios both assumed that migration was restricted to the early period of divergence.In the AM scenario, the two populations experienced a single period of strict isolation (Tiso) while in the PAM scenario migration was stopped twice with an intermediate period of isolation of Tiso/2 generations.In the Isolation Migration (IM), Secondary Contact (SC) and Periodic Secondary Contact (PSC) scenarios, gene exchange was currently ongoing between the two populations.In the SC and PSC scenarios, the two populations first evolved in strict isolation and then experienced a period of gene exchange (Tsc).In the SC scenario, there was a single period of recent migration whereas in the PSC scenario a period of ancient migration also occurred after 1-Tsc/2 generations of strict isolation.The last scenario is the standard isolation with migration (IM) scenario in which migration occurred continuously over time since the two species started to diverge.For models including migration (IM, AM, PAM, SC and PSC), we compared two alternative models in which the effective migration rate was either homogeneous ("homo") or heterogeneous ("hetero") among loci (Roux et al. 2013(Roux et al. , 2014)).

Coalescent simulations
For each of the five sampling strategies, we performed one million multilocus simulations under the 11 scenarios of speciation using the coalescent simulator Msnsam (Hudson 2002;Ross-Ibarra et al. 2008).Simulations fitted to the characteristics of each data set ("capture" data with n = 2, 4, 8 individuals and "rna-seq" data with n = 2, 4 individuals).We assumed free recombination between contigs and we fixed the intra-contig population recombination rate to be equal to the population mutation rate.Previous studies have shown that methods which take intra-locus recombination into account remain valid when rates of recombination are low (Becquet & Przeworski 2007).Moreover, our method does not rely on haplotypic data, and so not estimating exact rates of recombination should not affect our results.To account for errors in identifying the ancestral allele in the unfolded jSFS, we explicitly modeled a misorientation rate in our coalescent simulations.We assumed that a proportion e of SNPs, which was a parameter to be inferred, were misoriented and changed ei, their frequency in population i, into 1-ei.
Prior distributions for θA/θref, θ1/θref and θ2/θref were uniform on the interval 0-20 with θref=4*Nref*.The effective size of the reference population, Nref, used in coalescent simulations was arbitrarily fixed to 100,000.The mutation rate, , was set to 2.763*10 -8 per bp per generation (Roux et al. 2014).The Tsplit/4Nref ratio was sampled from the interval 0-25 generations, conditioning the parameters Tiso and Tsc to be uniformly chosen within the 0-Tsplit interval.For the scenarios including migration, we used the scaled effective migration rates M=4*N*m, where m is the fraction of the population made up of effective migrants from the other population at each generation.In the homogeneous model, a single effective migration rate, shared by all loci but differing in each direction of introgression, was sampled from an uniform distribution in the interval 0-40.For the heterogeneous model, we assumed two categories of loci occurring in proportion p and (1-p).The parameter p was sampled from an uniform distribution in the interval 0-1.The first category of loci are neutral genes introgressing at a migration rate sampled from an uniform distribution in the interval 0-40.
The second category comprises loci affected by the barrier to gene flow, so that their effective migration rate was reduced by a gene flow factor, gff, compared to neutral loci (Barton & Bengtsson 1986).The gff was sampled from a Beta distribution with two shape parameters (alpha chosen in the interval 0.001-10 and beta chosen in the interval 0.001-5).Prior

Model choice and checking
To choose the best supported model, we followed the methods previously described in Roux et al. (2013,2014).Briefly, posterior probabilities for each of the eleven speciation scenario were estimated with a neural network using the R package abc (Csilléry et al. 2012).It implements a nonlinear multivariate regression by considering the model itself as an additional parameter to be inferred.The 0.01% replicate simulations nearest to the observed values of the summary statistics were selected.Moreover, to evaluate the relative probability of the heterogeneous model of migration rates across loci, we compared the alternative models ("homo" vs. "hetero") within each scenario including gene exchange.We calculated bayes factors, BF, as the posterior probability of the best supported model divided by that of the model with the highest posterior probability from the remaining candidates.The posterior probability of each model calculated among the eleven models of speciation are detailed in Table S1; and those of the best model between heterogeneous and homogeneous migration rates are given in Table S2.
Model checking was performed by randomly sampling 100 replicates from the one million simulations performed for each model, and used them as pseudo-observed datasets.We then applied the same model choice procedure as described above to compute the posterior probability of the pseudo-observed data set given a model, and we repeated the procedure for all tested models.The accuracy rate for model M was calculated as the proportion, among pseudo-observed data inferred to correspond to model M, of those actually generated under model M. The ambiguity rate was computed as the proportion of pseudo-observed data generated under model M whose best model was not strongly supported, i.e. its posterior probability was below an arbitrary threshold (Pmin) set to be one third above the expected value given the total number of models compared (1/11 for eleven models, 1/6 for six models or 1/2 for two models).The accuracy and ambiguity rates for the capture data with n=2 individuals are provided in Table S3.

Patterns of polymorphism
We obtained polymorphism data for two mussels species, M. edulis and M. galloprovincialis, and one outgroup (M.trossulus) in samples of increasing size (n=2, n=4 and n=8) and for the two datasets ("capture" vs. "rna-seq").The jSFS of each data set is shown in Figure S1, while Table 2 gives summary statistics of genetic polymorphism.
The data produced by the two techniques differed in the total number of SNPs (nsnp=3,993 (n=2); nsnp=5092 (n=4); nsnp=5000 (n=8) in the "capture" data; nsnp=17,275 (n=2); nsnp=17,902 (n=4) in the "rna-seq" data).The substantial lower number of SNPs in the "capture" data reflects the lower number of loci retained for the analysis due to a reduced and more heterogeneous sequencing depth compared to the "rna-seq" data (while applying identical coverage filters to call SNPs).However, the jSFS calculated from the two techniques had a similar proportion of sites in the different classes (Pearson's correlation between jSFS: R 2 =0.97, p<0.0001 for n=2 individuals; and R 2 =0.99, p<0.0001 for n=4 individuals).
Globally, the mean level of genetic differentiation was quite low (FST~10% in the "capture" data; FST~18% in the "rna-seq" data, see Table 2), but highly variable across loci (standard deviation was ~16 % and ~24 %, respectively).A low proportion of sites (<5% in the "capture" data and <10% in the "rna-seq" data) were fixed differences, while two to three times more SNPs were polymorphic and shared between the two species (see Figure S1).The level of intraspecific nucleotide diversity was elevated (πedu=πgal=0.016 in the "capture" data; πedu=0.038and πgal=0.036 in the "rna-seq" data for n=2, Table 2) and not significantly different between the two species (non-significant Wilcoxon signed-rank test).Polymorphic sites were mainly private to each species (~80% of the sites), and mainly corresponded to low frequency classes.Moreover, the jSFS was remarkably symmetric suggesting limited differences in population size and/or migration rates between the two species.These patterns were consistent across sample sizes, but there were some differences comparing the two techniques.Specifically, the "capture" data set showed significantly lower level of divergence (Wilcoxon signed-rank test, p<0.0001 between netdivcapture=0.004and netdivrna-seq=0.02for n=2 individuals, Table 2) and average number of fixed differences between species per locus (Wilcoxon signed-rank test, p<0.0001 between Sfcapture=0.322and Sfrna-seq=0.810for n=2 individuals, Table 2).These discrepancies were most likely due to the use of a single reference in the "capture" data resulting in the problematic mapping of highly divergent alleles from the two species.

Effects of sampling strategy on model selection
We carried out model selection between the various scenarios of speciation shown in Figure 1, and asked whether the sampling strategy (number of individuals and number of SNPs) had an effect on model selection.Results are shown in Table 3a which reports the posterior probability of the best supported scenario for each configuration of the data; and in Table 3b which compares the posterior probability of homogeneous vs. heterogeneous migration for the best supported scenario (see also Table S1 and S2 for full details).Firstly, we compared the "capture" and the "rna-seq" data which differ in the number of SNPs sampled (3,993 SNPs and 17,275 SNPs, respectively); and we found that the best supported scenario was the same for both data sets (Table 3a).For example, when considering twenty-three classes in the jSFS (jsfs=23), the best supported scenario always involved recent and genome-wide heterogeneous migration between the two species.The heterogeneous periodic secondary contact (PSC.hetero) was the most supported scenario; and the next best models were almost identical: BF=PPSC.hetero/PSC.hetero=1.21with the "capture" data, BF=PPSC.hetero/PSC.hetero=1.07with the "rna-seq" data, in the case of n=2 individuals; those numbers were PPSC.hetero/PIM.hetero=1.09and PPSC.hetero/PIM.hetero=1.38with n=4 individuals.The same patterns were found when using different subsets of the jSFS (jsfs=4 and jsfs=8, Table 3a), except that the best supported scenario was then the periodic ancient migration (PAM).
Regarding genome-wide heterogeneity of migration rates (Table 3b), the "rna-seq" data gave more support to the heterogeneous model (e.g., BF=PAM.hetero/PAM.homo=1.27 with n=2 and jsfs=4) compared to the "capture" data (BF=PAM.homo/PAM.hetero=1.27).This is consistent with the higher heterogeneity of the jSFS in the "rna-seq" data, involving a higher proportion of fixed differences and shared polymorphic sites (Table 2 and Figure S1).
Secondly, we evaluated the effect of the number of individuals sampled.As with the number of SNPs, it is clear that sample size had little effect.The best supported scenario remained consistent across the different sampling size (n=2, 4 or 8 individuals; Table 3a).For example, when considering twenty-three classes in the jSFS (jsfs=23), the heterogeneous periodic secondary contact scenario was the best supported model whatever the sampling size in both datasets.

Effects of jSFS-binning on model selection
We next investigated the effects of binning the jSFS on model choice (Figure 2 and Table 3) and our ability to discriminate between different speciation scenarios (Table S3).Given the limited effects of the sampling strategy, the ABC performance results are presented for the "capture" data with n=2 individuals only.This allowed us consider the full spectrum when using twenty-three classes of polymorphism.

Heterogeneity of migration rates
The scenarios with recent migration (SC, PSC and IM) all strongly supported heterogeneity of migration rates; and this support tended to increase with the number of polymorphic classes considered (Table S2).For example in the "capture" data with n=2 individuals, the relative probability of the heterogeneous model in the periodic secondary scenario was PPSC.hetero=0.53with jsfs=4, PPSC.hetero=0.77with jsfs=7 and PPSC.hetero=0.99with jsfs=23.In contrast, the model of homogeneous migration rates was the most supported, though marginally, in the scenarios of ancient migration (PAM and AM): e.g., PPAM.homo=0.56 with jsfs=4, PPAM.homo=0.69 with jsfs=7 and PPAM.homo=0.63 with jsfs=23.Concordant patterns were obtained using the other data configurations (Table S2).
We then assessed the performance of the method in identifying the correct model, using pseudo-observed datasets when homogeneous and heterogeneous models were compared (Table S3a).The correct model was always recovered (i.e., an accuracy rate of 1) for the different binnings in all speciation scenarios including gene flow; however, the ambiguity rate did strongly decrease when more information from the jSFS was included.With only four classes (jsfs=4), none of the replicates showed a posterior probability higher than 0.83 (the threshold set for the 2-model comparison), which corresponds to an ambiguity rate of 1.
Similarly, all ambiguity rates were above 0.97 with seven classes (jsfs=7).In contrast, when considering the full jSFS (jsfs=23), we could correctly recover with a strong support the different speciation models (e.g., 63% of the PSC.hetero replicates and 43% of the PAM.hetero replicates were above the threshold, Table S3a).These results suggest that the

Scenarios of speciation
It is clear from Table 3a (see details in Table S1) that binning the jSFS to four or seven classes leads to a loss of information.Specifically, when considering the full jSFS (jsfs=23), only the scenarios involving recent migration were supported (PPSC+PSC+PIM=0.96,Table S1); while the contrary was true when fewer classes of polymorphism were used (PPAM+PAM=0.96with jsfs=4 and 0.93 with jsfs=7, Table S1).Remarkably the strict isolation scenario was never supported (PSI<0.06,Table S1) suggesting that gene flow must have occurred between the two mussels species during the divergence.Moreover, the fact that the most supported scenario, using full information (jsfs=23), was the heterogeneous periodic secondary contact (PPSC.hetero=0.39)suggests a complex history of speciation, including periods of isolation alternating with ancient and recent migrations.The discrepancies that appear when not distinguishing low and high frequency shared variants (in the case of jsfs=4 and jsfs=7), confirm their importance for the identification of recent migration events (Alcala et al. 2016).
In general, the best supported model fitted the data well for each binning strategy (Figure S2).
All observed statistics were in the 95% simulated posterior distribution, except for the class "ssfB_2" (derived polymorphism fixed in species B, but polymorphic (doubletons) in species A) which was slightly overestimated by the model PSC.hetero (jsfs=23, Figure S2a); and the class "sfB" (derived polymorphism fixed in species B, and absent in species A) which was slightly underestimated by the model PAM.hetero (jsfs=7, Figure S2b).No statistics were found significantly out of the simulated distribution under model PAM.hetero with jsfs=4 (Figure S2c).characterize recent demographic events rather than the past divergence history.Second, we showed that inferences from the two sequencing techniques ("capture" vs. "rna-seq") were qualitatively the same, despite the very different number of SNPs that were sampled, again supporting the periodic secondary contact scenario (Table 3).This implies that neither the sequencing technique, nor the number of informative sites have a substantial effect on the inference.In fact, the jSFS calculated from the data sets were very similar (Figure S1); we only found a deficit of divergent SNPs in the "capture" data that may be due to the difficulty of mapping highly divergent alleles onto a single reference (on the contrary, the "rna-seq" reads from the different species were independently assembled and mapped).From a theoretical perspective, adding more loci (or longer loci) provide information about deep coalescent events which are important for shedding light on the divergence history of closelyrelated species (e.g., Wang & Hey 2010).In fact, previous simulation studies showed an influence of locus length and locus number on the ABC performance in model choice, and highlighted a threshold effect above which adding more loci did not significantly improve inferences (e.g., Li & Jakobsson 2012;Robinson et al. 2014;Shafer et al. 2015).Thereby, we argue that the number of SNPs sampled in our study was sufficient to accurately represent the diversity of coalescent histories in the Mytilus genome, and consistently support the same speciation model.
In most studies, functions of the jSFS are used as summaries of the data.For example, the likelihood method of Nielsen & Wakeley (2001) uses four classes of polymorphisms to estimate migration rates and divergence times in the isolation with migration scenario.In the ABC approach, choosing a suitable set of summary statistics is difficult because it implies a trade-off between loss of information and reduction of dimensionality.Accordingly, practical methods to identify approximately sufficient statistics have been developed (e.g., Wegmann et al. 2009;Nunes & Balding 2010;Aeschbacher et al. 2012).Here, we compared ABC inferences based on 23-binned jSFS (jsfs=23) with inferences using a subset of polymorphic classes (jsfs=4 and jsfs=7).Our results pointed out the loss of information when only four or seven classes in the spectrum were considered; particularly for the inference of recent migration events.By the jSFS into twenty-three classes, we could reveal the excess of shared polymorphisms that are at high frequency in one species and low frequency in the other, a pattern produced by recent migrants (Table 3).This is in agreement with the simulation study of Tellier et al. (2011) that showed a significant improvement in the estimation of the timing of gene flow when these additional polymorphic classes were considered; and the study of Alcala et al. (2016) that showed an excess of high frequency derived alleles is the characteristic footprint of secondary contacts.Across all models, we showed that the probability to correctly infer the true model (accuracy rate) increases with the number of classes considered, and that the ambiguity rate correspondingly decreases (Table S3).Smith et al. (2017) similarly found that the statistical power of their ABC model selection increases with the number of classes until reaching a plateau of error rates (but above which computation efforts continues to increase).
Inferences based on the 23-binned jSFS constantly support a model of periodic secondary contact (PSC) with a high accuracy rate and one of the lowest ambiguity rate (Table S3).
Moreover, the method has substantial power to distinguish PSC from the other models with migration (SC and IM; Figure S3a).These results are in agreement with previous ABC-based studies revealing clearly a secondary contact history between the two mussel species (Fraïsse et al. 2014;Roux et al. 2014), although they relied on eight nuclear loci and on a different set of summary statistics (those that we called "mscalc" here).The periodic connectivity models (PSC and PAM) were not included in these previous studies because of the lack of power to test for intermittent gene exchange since secondary contact.In the present study, we directly compared the use of the jSFS vs. "mscalc" statistics by performing additional ABC inferences on the "mscalc" statistics presented in Table 2. Results were similar to those obtained with jsfs=23 (Table S1), i.e. the best supported models included one or two secondary contacts, except for two data configurations ("capture" data with n=2 and n=8 individuals) for which strict isolation was chosen with a posterior probability of 37% and 34%, respectively.However, the goodness-of-fit of the strict isolation model was quite poor for the standard deviation of the number of fixed sites, "sf_std", and very poor for the standard deviation of the FST, 'fst_std' ("capture" data with n=2 individuals, Figure S2d).Both statistics were underestimated by the model suggesting that a history without gene flow cannot produce the observed variation of genetic divergence along the genome.Across models, inferences based on the "mscalc" statistics showed similar accuracy rate to those based on jsfs=23; however, the ambiguity rates for models with current migration were somewhat higher (PSC = 0.52 vs. 0.30, SC = 0.61 vs. 0.31, IM = 0.40 vs. 0.25, respectively).These supplementary analyses suggest that extracting summary statistics from the jSFS can lead to a substantial loss of information.

Conclusion
Genome-wide data offer us the possibility to test for more complex scenarios of divergence by providing a comprehensive picture of the gene histories across the genome.In this work, we incorporate heterogeneity of migration rates among loci to account for the semipermeability of the barrier to gene flow between recently diverged species (Barton & Bengtsson 1986).A reduced effective migration rate is expected in chromosomal regions linked to incompatible genes; while free introgression is expected in loosely linked regions.This variation in the rate of effective migration results in variation of genetic divergence along the genome.As shown elsewhere (Roux et al. 2013(Roux et al. , 2014)), avoiding to account for this heterogeneity can mislead inferences.In a similar way, the effect of background selection (i.e.purifying selection at linked loci) and genetic hitchhiking (i.e.positive selection at linked loci) in regions of low recombination can be incorporated in demographic inferences by including an heterogeneity of effective sizes among loci (Sousa & Hey 2013).This is of particular interest in the debate concerning the "genomic island of divergence" for which the Resour., 12(5), 834-845.
• Chapman MA, Hiscock SJ, Filatov DA (2013) Genomic divergence during speciation driven by adaptation to altitude.Mol.Biol. Evol.,30(12), 2553-2567.The following statistics were calculated for each locus.Their average (in black) and standard deviation (in grey) across all loci are given.
distributions were computed using a modified version of the program Priorgen (Ross-Ibarra et al. 2008) as described in Roux et al. (2013).