Dispersal is a major contributor to population processes, and can strongly influence genetic diversity, local adaptation, speciation, levels of inbreeding, sexual selection and many other biologically important processes (Clobert et al., 2012). Knowledge of dispersal patterns can improve evaluation of the maintenance of source–sink population dynamics (Walsh et al., 2012), the identification of habitat corridors or barriers to movement (Dzialak et al., 2005; Kozakiewicz et al., 2009), the determination of biological limitations to dispersal (e.g., sex biased dispersal, Goudet, Perrin & Waser, 2002), and our understanding of how the environment affects dispersal behaviour (Pruett-Jones & Lewis, 1990). Such knowledge can contribute to a theoretical understanding of ecological processes and be applied to improve conservation management practices (Fahrig & Merriam, 1994; Selonen, Hanski & Painter, 2010; Keller et al., 2010). For over a century, direct measures of dispersal have been collected using mark-recapture data (Sandercock, 2003), and more recently using telemetry techniques (Harris et al., 1990) and Geographic Position Systems (GPS) tracking (Cagnacci et al., 2010). Unfortunately, these methods have practical constraints, including expense and the difficulty of deploying and retrieving a large number of units. Alternatively, genetic methods that identify individual contemporary dispersers are commonly used because they are relatively cheap and easy to implement (Broquet & Petit, 2009). These methods include Bayesian statistics to identify the probability that a particular genotype arises from each of a group of populations (Pritchard, Stephens & Donnelly, 2000; Piry et al., 2004; Jombart et al., 2008). If an individual’s genotype has alleles that are more common to a population other than where it was sampled, the individual is considered to be a potential disperser (Rannala & Mountain, 1997).
The accuracy of genetic assignment methods depends on the demographic parameters of the species being investigated (e.g., rates of dispersal; Bohonak, 1999; Whitlock & McCauley, 1999), experimental design (e.g., proportion of populations sampled, number of individuals sampled; Cornuet et al., 1999; Paetkau et al., 2004; Schwartz & McKelvey, 2008), and the markers chosen (e.g., levels of polymorphism, number of loci; Berry, Tocher & Sarre, 2004). Given the number of factors that can influence the correct identification of dispersers using genetic assignment techniques, it is important that studies robustly test if the marker system and population structure allow for reliable identification and assignment of dispersers to the populations being studied. For example, Berry, Tocher & Sarre (2004) were able to determine the number of loci needed to accurately assign dispersers of grand skink (Oligosoma grande) populations by directly comparing genetically identified dispersers with mark-recapture data collected over a seven year period. Ideally, estimates of dispersal between populations should be carried out using multiple approaches to confirm congruence between estimation methods; however, most ecological studies are restricted to using a single approach either due to logistical or budget constraints (or both). Genetic data may permit the identification of dispersers and assignment to their source populations, but also provide the ability to estimate a range of other population parameters often of interest to ecological researchers (e.g., estimates of genetic diversity and heterozygosity, levels of inbreeding, multiple paternity, and historical patterns of connectivity; Excoffier, Laval & Schneider, 2005; Peakall & Smouse, 2006; Porras-Hurtado et al., 2013; Waser & Hadfield, 2011; Waser, Paetkau & Strobeck, 2001).
Here we perform a Simulated Disperser Analysis, using microsatellite data from three Australian common starling (Sturnus vulgaris) populations to test the power of a marker system to accurately identify ‘dispersers’ (individuals that are not native to their collection locality) using genetic assignment tests and genetic data only. This approach simulates dispersal between populations and then tests a number of parameters (number of loci, levels of variability of loci, and effect of disperser genotype) on the ability to identify dispersers and to correctly assign these to their collection locality.
The collection of all samples followed the strict guidelines outlined in the ethics application (permit number: 05/011A) approved by The University of New South Wales ethics committee.
Sample collection and DNA methods
We investigated dispersal in Australian invasive starlings (Sturnus vulgaris), a highly vagile species that maintains an extensive distribution across southeast Australia. Samples were taken from three populations previously shown to have low levels of genetic differentiation (Rollins et al., 2009; FST = 0.02–0.07): Orange (New South Wales, 33°17′S, 149°06′E, N = 32); Mallala (South Australia, 34°27′S, 138°30′E, N = 32); and Munglinup (Western Australia, 33°42′29″S, 120°51′54″E, N = 30).
DNA was extracted using Gentra PureGene DNA extraction kit (Qiagen) following the manufacturer’s instructions. Microsatellites were developed using next generation sequencing on the GS-FLX 454 platform (Roche, Manheim, Germany) following methods described by Abdelkrim et al. (2009). QDD v 0.9.0.0 Beta (Meglécz et al., 2010) was used to identify microsatellites from the sequencing data and primers were designed using the program PRIMER 3 (Rozen & Skaletsky, 2000). A panel of 20 polymorphic markers was chosen (see Table S1). PCR reactions were multiplexed using universal fluorescently labelled primers (Neilan, Wilton & Jacobs, 1997). A step-down PCR protocol was used for each reaction consisting of ten cycles each at the following annealing temperatures: 70 °C, 64 °C, 58 °C, 54 °C, 50 °C. Reactions containing DNA from the same individual but with differently labelled universal markers (e.g., PET, NED, 6-FAM and VIC) were combined in equivalent amounts, so that all loci for each individual could be multi-loaded into three tubes prior to fragment analysis (see Table S1). Samples were genotyped using an ABI 3730 (Applied Biosystems, Foster City, CA, USA) using GS-500 (Liz) in each capillary as a size standard. Allele sizes were estimated on GENEMAPPER version 3.7 (Applied Biosystems). These data were combined with data from 11 microsatellite loci for the same individuals as detailed in Rollins et al. (2009) giving us a total of 31 loci.
Microsatellite markers were checked for departures from Hardy-Weinberg equilibrium (Arlequin version 18.104.22.168, Excoffier, Laval & Schneider, 2005) and linkage disequilibrium (GenePop version 4.0.10, Rousset, 2008). PIC was calculated for each locus using PICcalc (Nagy et al., 2012). Arlequin was used to calculate Pairwise FST values. Pairwise values for Shannon’s mutual information index (I, formerly called SHUA, Sherwin et al., 2017) were calculated in GenAlEx (Peakall & Smouse, 2006) because mutual information is better than FST at handling a range of population sizes and dispersal rates (Sherwin et al., 2006; Dewar et al., 2011).
Ninety-four samples were successfully genotyped for 31 microsatellite loci. Of the 20 new species-specific loci developed here, two showed significant deviation from Hardy-Weinberg equilibrium (Svu002 and Svu010) and were removed from further analyses (see Table S3). None of the remaining 18 loci showed significant departures from linkage equilibrium after Bonferroni correction. Data from the other 11 loci previously have been shown to be in Hardy-Weinberg equilibrium in these populations by Rollins et al. (2009).
Initially, we used GeneClass2 (Piry et al., 2004) to determine the probability of each individual originating in the population where it was sampled (see method below). One individual from each population was randomly assigned to be a ‘simulated disperser’; individuals with low probabilities of population membership were excluded from being simulated dispersers because they were potentially real dispersers. Simulated dispersers were moved in every possible combination relative to other simulated dispersers creating a total of 27 (33) treatments (including a treatment of no movement, Table 1).
Loci were ranked by their Polymorphic Information Content (PIC) (see statistical analysis below). Each treatment was analysed using 57 genetic datasets. The first analysis started with the locus with the highest PIC and then consecutively adding the next highest PIC locus until all loci were added (N = 29, including the dataset having all markers; referred to as highest PIC hereafter). The second analysis started with the locus with the lowest PIC and then consecutively adding the next lowest PIC locus until all loci were added (N = 28, because the dataset with all markers was analysed in the previous analysis; referred to as lowest PIC hereafter).
The partial Bayesian method (Rannala & Mountain, 1997) implemented in GeneClass2.0 was used to detect simulated dispersers. We ran 10,000 MCMC simulations per population using the Lh∕Lmax likelihood computation (Paetkau et al., 2004). Individuals were considered first-generation dispersers if they had an Lh∕Lmax p-value that was below 0.01. An individual can incorrectly be identified as a disperser in two ways: if they are identified as a disperser when they are not one (False Positive), or, when they fail to be identified as a disperser when they are one (False Negative). Once identified, GeneClass2.0 assigns dispersers to the most likely source population.
In our assessment of the ability of GeneClass2.0 to detect simulated migrants, assignment tests were repeated for treatments and individuals, resulting in the non-independence of data. To account for this, we used Generalised Linear Mixed Models (GLMMs), specifying treatment and individual as random effects, and predictor variables (number of loci, highest/lowest PIC loci used, location the simulated disperser was moved to, and I) as fixed effects (Zuur et al., 2009). The response variable ‘dispersal status correctly identified’ was a binary indicator of the ability of GeneClass2.0 to correctly identify whether a simulated disperser was a disperser or not (0 = incorrectly identified, 1 = correctly identified) and was specified as having a binomial distribution of errors with a logit link function. For example, if an individual was identified as a resident in the population from which it was sampled, it was classed as ‘dispersal status correctly identified’. The response variable (dispersal status correctly identified) was modelled as a function of (a) the number of loci, (b) whether the highest/lowest PIC loci were used, (c) the population in which the simulated disperser was located (either its collection locality or one of the other two populations) and (d) the pairwise I value between populations when moved. The number of loci was a discrete numeric variable, while highest / lowest PIC loci used (PIC range = 0.251–0.862) was specified as a categorical variable with two levels, location was specified as a categorical variable where individuals were coded as either being from their collection locality or one of the other two populations (collection locality was used as the reference group with which other groups were compared), and pairwise I was specified as a continuous numeric variable. Model fit was estimated from marginal () and conditional () coefficients of determination, following Nakagawa & Schielzeth (2013). estimates model fit using fixed effects only, while estimates model fit including both fixed and random effects. By comparing these estimates it is possible to compare the contribution that random effects and fixed effects have on the response variable.
To assess how studies have addressed parameters that affect power to detect dispersers (number of individuals, number of loci, levels of genetic differentiation) we used Google Scholar to conduct a literature search (on 19/12/2016) for articles that cited the publication that originally presented the GeneClass2 analysis (Piry et al., 2004). We used the key term ‘dispersal’ to narrow the search and of 1,280 articles identified, we randomly chose 160 articles to assess. If the article did not use GeneClass2.0’s migrant detection analysis we disregarded it; if the article did use this method we collected information on the number of loci used, global FST value, and general article meta-data such as author(s), year published, common name and species name. Of the 160 papers that were assessed, 132 matched the selection criteria, including a total of 136 datasets (Table S2).
Generalised Linear Models (GLMs) were used to determine the relationship between the global FST values of studies using genetic assignment tests and the number of loci that were used by these studies to identified dispersers. The number of loci used was specified as a continuous response variable, with a Poisson distribution, which was determined graphically and the global FST value was a fixed effect.
To demonstrate the continuing relevancy of this work (see De Barba et al., 2016), we searched within the 1,280 articles we identified above for those containing the search term ‘microsatellite’ by year since this technique was published (2005–2016) to determine whether the number of papers using this approach was decreasing over time.
Regression models and graphics were generated using the statistical programming environment R (version 2.15.2, R Development Core Team, 2012). GLMMs were generated using the function glmer within the package lme4 (version 0.999999-2, Bates, Maechler & Bolker, 2012). Graphics were generated using the package ggplot2 (version 0.9.3.1, Wickham, 2009).
Sample inclusion and genetic analysis
Populations from Orange (New South Wales) and Mallala (South Australia) showed the lowest levels of differentiation from each other (FST = 0.026, P value < 0.001; I = 0.081), while the population from Munglinup (Western Australia) displayed higher levels of genetic differentiation from Orange (FST = 0.082, P value ≤ 0.001; I = 0.152) and Mallala (FST = 0.081, P value ≤ 0.001; I = 0.139) indicating that Munglinup is the most genetically distinct area sampled.
One of 94 individuals was identified as a potential disperser within the original dataset, which included all loci and no simulated dispersal. We avoided using this individual as a simulated disperser.
The simulation results showed that the ability to correctly identify a simulated disperser was strongly associated with an increase in the number of loci used and when the highest PIC loci were used rather than the lowest (Table 2, Fig. 1). For a subset of the simulation data (treatments where simulated dispersers were moved outside their collection locality), I values showed a strong positive relationship with correct identification of simulated dispersers. When a simulated disperser was moved to a location that had a relatively high pairwise I to its collection locality, it was more likely to be correctly identified as a disperser (Table 2). For two of our simulated dispersers, no False Positive errors were identified. However, False Negative errors were common.
|Response variable||Predictor variable||C||SE||Z|
|Simulated disperser status correctly identified by genetic assignment test||Number of loci||0.30||0.01||26.98|
|(; )||Highest/lowest PIC loci used||1.32||0.11||11.62|
|Moved to Mallala||−4.75||0.24||−19.91|
|Moved to Munglinup||−3.18||0.20||−15.71|
|Moved to Orange||−5.75||0.25||−23.31|
|Simulated disperser status correctly identified by genetic assignment test||Number of loci||0.38||0.01||25.39|
|(R; )||Highest/lowest PIC loci used||2.18||0.14||15.07|
Simulated disperser A (SD-A), artificially moved from its collection locality at Munglinup
SD-A was always identified as a resident when located in its collection locality, whether using the highest or lowest PIC loci. We were able to identify SD-A to its collection locality when using the three highest PIC loci available, whereas 11 loci were needed when using those having the lowest PIC values. The minimum number of loci required to identify SD-A to its collection locality 100% of the time did not vary between locations (Fig. 2).
Simulated disperser B (SD-B), moved from Mallala
SD-B was incorrectly identified as a disperser when located in its collection locality when using the highest PIC loci but was always identified as a resident when using the lowest PIC loci. When using the lowest PIC, 20 or more loci were needed to assign SD-B to its collection locality 100% of the time, while 27 loci were needed when using the highest PIC loci. SD-B was identified as a disperser after using the 18 highest PIC loci; however, detection broke down between 21 to 24 loci. When using the lowest PIC loci SD-B was not correctly assigned 1.66% of the time, and only when located in Orange. SD-B was not correctly assigned 14.94% of the time when using the highest PIC loci (Fig. 2).
Simulated disperser C (SD-C), moved from Orange
SD-C was always identified as a resident when located in its collection locality whether using the highest or lowest PIC loci. Fifteen loci were required to identify SD-C to its collection locality 100% of the time, whether using the highest or lowest PIC loci. When using the highest PIC loci, 21 out of 27 treatments correctly identified SD-C with 10 or more loci. For treatments from 1 to 15 loci where SD-C was incorrectly identified, the location of SD-A contributed to the incorrect identification. SD-C was not correctly assigned to Mallala 2.81% of the time when using the highest PIC loci and was not correctly assigned only when located in Munglinup (Fig. 2).
The studies we evaluated were published between 2005 and 2016. Studies that implemented first-generation migrant detection in GeneClass2.0 to identify dispersers used on average 11.2 ±5.1 loci. For a subset of these data (N = 72), where global FST was available, the average global FST was, 0.120 ±0.119. There was no relationship between the global FST of these studies and the number of loci used when conducting genetic assignment tests (GLM, t71 = − 1.312; Fig. 3).
Figure 4 demonstrates that, despite the recent increase in population genetic papers based on next-generation sequencing data, microsatellite data sets continue to be used for contemporary disperser identification at a similar rate over the past eight years (more than 100 papers per year during this period).
Many studies have emphasised the need to have sufficient numbers of highly variable loci in order to carry out population assignment tests with a high degree of confidence (Manel, Berthier & Luikart, 2002; Berry, Tocher & Sarre, 2004; Evanno, Regnaut & Goudet, 2005). Here we artificially force dispersal between a set of populations and test the power of the marker system (i.e., number of loci and levels of variability) to identify dispersers. Our results show that we should be validating genetic assignment tests by testing the power of the marker systems we use. Without validation of genetic data sets we cannot be confident in the inferences we make about dispersal. Our results confirm that the number of loci is one of the best predictors that a marker system has adequate power to detect real dispersal events (Smouse & Chevillon, 1998; Bernatchez & Duchesne, 2000). However, both false positive (identified as a disperser when they are not one) and false negative (not identified as a disperser when they are one) assignments can result even when using many loci with high levels of polymorphism (e.g., simulated disperser B). The approach we present highlights the number of false positive and false negative assignments (see Fig. 2).
Similar to previous studies (Bernatchez & Duchesne, 2000; Berry, Tocher & Sarre, 2004), our data indicated that the number of loci used to identify dispersers strongly influences the results of contemporary genetic disperser analysis. As more loci are added to an analysis the total information available for discriminating between the genotype of an individual and the background population increases, which increases the detectability of dispersers. Others have suggested that approximately 10 loci are required to correctly identify dispersers when using assignment tests with microsatellite data (Cornuet et al., 1999; Bernatchez & Duchesne, 2000); however, for our study system, ideally 27 loci were required (Fig. 1). In our study, false negatives (i.e., not identifying an individual as a migrant when they were) were more common than false positives (i.e., when an individual was identified as a migrant when they were not). This difference highlights the fact that every study requires an individual validation of the marker system in order to be confident of identifying dispersers using genetic assignment tests.
Not surprisingly, when using loci with low PIC more loci are usually required to reach the same level of discrimination than loci with high PIC because low PIC loci provide less discriminatory power for the analysis. However, this is not always true as seen with individual B when using 21 to 24 of the highest PIC loci. A single locus can result in inconsistent assignment of dispersers even when using a large number of loci. It is possible that a single locus can play a comparatively large role in distinguishing the multi-locus genotype of an individual. For example, if an individual has alleles at a locus that are rare in its population but more common in another population, inclusion of this locus may increase the incorrect assignment of that individual. Post-hoc analyses, where locus 21 was removed, resulted in correct assignment of SD-B for all locus combinations. This demonstrates how our approach can help to identify problematic loci (i.e., those that yield inaccurate assignments). Although spurious results might occur with any combination of loci, it is reasonable to assume that such results would generally become less likely as more discriminatory information becomes available with an increased number of loci.
Our results confirm those of previous studies (Cornuet et al., 1999; Berry, Tocher & Sarre, 2004) by showing that it was easier to identify a disperser when they were moved into a population that had a relatively high pairwise differentiation (e.g., I) from their collection locality. Although Berry, Tocher & Sarre (2004) clearly showed this effect in a natural system over a decade ago, there has been surprisingly little response to these findings. If researchers were to take into account the relationship between population differentiation (e.g., I, FST ) and the number of loci required and the ability to correctly identify dispersers, studies that use genetic assignment tests should use more loci when their global FST is lower. As shown by our analysis of the literature (Fig. 3), studies with lower FST values are not consistently using more loci to compensate for detection ability. By not dealing with this relationship, published studies may have incorrectly identified dispersers, potentially leading to misleading conclusions. To be confident of research outcomes, we recommend that studies only using genetic assignment test to identify dispersers should employ an approach similar to the Simulated Disperser Analysis that we describe here to validate their marker panel.
For a quick and stringent method capable of determining the number of loci necessary to accurately identify dispersers (e.g., an approach that is able to distinguish individuals that are easy to identify and hard to identify, such as SD-A and SD-B respectively) a ‘worst-case scenario’ is necessary. This study showed that a worst-case scenario can be expected to occur when individuals are moved to the most genetically similar population and the lowest PIC loci are used. Therefore, for future studies to validate their disperser analysis, they could move simulated dispersers to the most genetically similar population, start with the lowest PIC locus and continue by adding the next lowest PIC locus (Fig. 5). Such a simulated dispersal scenario makes biological sense given that genetic populations that have the lowest pairwise FST values are likely to be geographically closest to one another, and dispersers are more likely to move to a population closer to their own rather than further away (Sutherland et al., 2000). If spurious results are observed (false positive), such as when using the 21st locus to detect SD-B caused consistently incorrect results in our study, the offending loci may require greater scrutiny because this could result from mis-scoring.
In studies where some information on levels of genetic differentiation and levels of loci polymorphism are known, a preliminary “true” simulation study (i.e., based on synthetic samples) can potentially provide guidance on sample sizes needed to correctly identify dispersers. However, a simulated dispersal approach is applicable in cases where there is no prior information on levels of genetic differentiation and levels of loci polymorphism. It is also important to note that a simulated dispersal approach will only be valuable when estimates of the underlying allele frequencies are gained from a robust sampling protocol.
An increase in the number of loci and the pairwise differentiation (e.g., I, FST) between a disperser’s collection locality and its final destination increases the accuracy of genetically assigning the disperser (Cornuet et al., 1999; Bernatchez & Duchesne, 2000; Berry, Tocher & Sarre, 2004). Our study showed similar patterns to previous research on natural systems (Berry, Tocher & Sarre, 2004), however, the magnitude of the effect varied. Although basic patterns may be discernible when determining the number of loci necessary to accurately identify dispersers, the unique nature of natural systems means that every study system will be different. By validating our genetic disperser analysis, we were able to determine the number of loci required to increase the accuracy of identifying dispersers in a wild population without the need for direct measures of dispersal (e.g., tracking data, etc). Although next-generation sequencing data sets are becoming more prevalent in population genetic studies, we demonstrate here that microsatellite data are still commonly used to estimate contemporary dispersal and believe they will continue to do so (De Barba et al., 2016). Although our demonstration of this approach used microsatellite data, the process is also applicable analyses of next-generation sequencing data sets.
Microsatellite accession numbers
Characterisation of microsatellite loci in Sturnus vulgaris (N = 94) including locus name, GenBank accession number, primer sequences, repeat motif, number of alleles and allele size range.
Papers included in the literature search. The number of loci and the FST value (global or average pairwise FST) are given
Hardy-Weinberg Equilibrium data for the markers used
Hardy-Weinberg Equilibrium testing for the 20 markers described in this study (Table S1). Loci in bold show significant departures from Hardy-Weinberg equilibrium in the given population. Loci Svu002 and Svu010 have diverged from Hardy-Weinberg equilibrium in two populations and were removed from further analysis. Included for each population are: the population name locus name, the number of individuals genotyped for that locus, the observed and expected heterozygosity with standard deviations. Significant departures from Hardy-Weinberg Equilibrium are highlighted in bold.
Order of simulated movement treatments
Visual order of simulated movement treatments that correspond to the individual heat maps of each simulated disperser. The ‘heat map row’ corresponds to the visual row (1–27) on each heat map, while the number under Plot A, B and C (the plots representing results for Simulated Disperser A, B and C) specifies which treatment results are shown in the row. For example, heat map row 2 for simulated disperser A (A plot) corresponds to treatment 12 in Table 1.