Generalized Linear Models outperform commonly used canonical analysis in estimating spatial structure of presence/absence data
 Published
 Accepted
 Received
 Academic Editor
 Jason Blackburn
 Subject Areas
 Biogeography, Ecology, Statistics, Spatial and Geographic Information Science
 Keywords
 Redundancy Analysis (RDA), Statistical modelling, Spatial analysis, Spatial ecology, Beta diversity, Moran’s Eigenvector Maps (MEMs)
 Copyright
 © 2020 CarlosJúnior et al.
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
 Cite this article
 2020. Generalized Linear Models outperform commonly used canonical analysis in estimating spatial structure of presence/absence data. PeerJ 8:e9777 https://doi.org/10.7717/peerj.9777
Abstract
Background
Ecological communities tend to be spatially structured due to environmental gradients and/or spatially contagious processes such as growth, dispersion and species interactions. Data transformation followed by usage of algorithms such as Redundancy Analysis (RDA) is a fairly common approach in studies searching for spatial structure in ecological communities, despite recent suggestions advocating the use of Generalized Linear Models (GLMs). Here, we compared the performance of GLMs and RDA in describing spatial structure in ecological community composition data. We simulated realistic presence/absence data typical of many βdiversity studies. For model selection we used standard methods commonly used in most studies involving RDA and GLMs.
Methods
We simulated communities with known spatial structure, based on three real spatial community presence/absence datasets (one terrestrial, one marine and one freshwater). We used spatial eigenvectors as explanatory variables. We varied the number of nonzero coefficients of the spatial variables, and the spatial scales with which these coefficients were associated and then compared the performance of GLMs and RDA frameworks to correctly retrieve the spatial patterns contained in the simulated communities. We used two different methods for model selection, Forward Selection (FW) for RDA and the Akaike Information Criterion (AIC) for GLMs. The performance of each method was assessed by scoring overall accuracy as the proportion of variables whose inclusion/exclusion status was correct, and by distinguishing which kind of error was observed for each method. We also assessed whether errors in variable selection could affect the interpretation of spatial structure.
Results
Overall GLM with AICbased model selection (GLM/AIC) performed better than RDA/FW in selecting spatial explanatory variables, although under some simulations the methods performed similarly. In general, RDA/FW performed unpredictably, often retaining too many explanatory variables and selecting variables associated with incorrect spatial scales. The spatial scale of the pattern had a negligible effect on GLM/AIC performance but consistently affected RDA’s error rates under almost all scenarios.
Conclusion
We encourage the use of GLM/AIC for studies searching for spatial drivers of species presence/absence patterns, since this framework outperformed RDA/FW in situations most likely to be found in natural communities. It is likely that such recommendations might extend to other types of explanatory variables.
Introduction
Ecological communities tend to be spatially structured in response to environmental gradients that are themselves organized in space, or to spatially contagious processes such as growth, dispersion, and species interactions (Legendre & Legendre, 2012; PeresNeto & Legendre, 2010). Thus, disentangling the causes of spatial structure and identifying spatial variability and different scales of organization in natural communities is a central question in ecology (Legendre, 1993). Answering this question requires the construction of explanatory variables based on spatial relationships among sites (Dray, Legendre & PeresNeto, 2006). One approach extensively used to create spatial variables and/or control for spatial autocorrelation in residuals is an eigenvectorbased method, called Moran’s eigenvector maps (MEMs, Dray, Legendre & PeresNeto, 2006). This method creates spatial explanatory variables representing structure on a range of spatial scales from the spatial relationships among sampling sites. These variables can be used for a broad range of goals, from controlling for phylogenetic autocorrelation in ecological data (DinizFilho et al., 2012) to searching for spatial structure in natural communities, even when irregularly sampled (e.g., Bauman et al., 2016; Neves et al., 2015).
In many studies the response variables for which ecologists seek to find spatial structure are community composition datasets containing either abundances or presence/absence information (here, we focus on the latter). For community ecology studies, Redundancy Analysis (RDA) is one of the most popular strategies due to its versatile framework, wellestablished literature and abundant toolkits available for implementation (see (Blanchet et al., 2014) ; Borcard, Legendre & Drapeau, 1992; Eisenlohr & De OliveiraFilho, 2015; Saiter et al., 2015). The RDA algorithm searches for optimal linear combinations (in the leastsquares sense, see Legendre & Legendre, 2012) of the explanatory variables that best explain the variation in the transformed community composition data (Legendre & Gallagher, 2001; Borcard, Gillet & Legendre, 2011; (Blanchet et al., 2014)). The usual approach then consists of establishing the global significance of the relationship between the response matrix and all the explanatory variables, after which a subset of explanatory variables is usually selected by stepwise procedure such as Forward Selection (FW, sensu Blanchet, Legendre & Borcard, 2008a) The most common approach uses two thresholds for variable selection: a significance level α and the adjusted R^{2} (see below and Blanchet, Legendre & Borcard, 2008a for details). This whole framework will hereafter be called RDA/FW for brevity. A statistic related to the Akaike Information Criteria (AIC, Akaike, 1973) has also been suggested for RDA model selection (GodínezDomínguez & Freire, 2003), but it has been shown to perform poorly and will not be further explored here (Bauman et al., 2018a).
However, methods based on leastsquares such as RDA are unlikely to perform well when applied to data that violate the assumption of constancy in the mean–variance relationship. This assumption is usually violated by datasets containing many zeros including abundance (count or semiquantitative) and presence/absence (binary) data. Data transformation does not always solve such problems (O’Hara & Kotze, 2010; Warton, 2018), although leastsquares can give reasonably robust tests of the significance of regression coefficients (Ives, 2015). In general, algorithmic methods such as RDA do not take into account the statistical properties of the response variable, such as the distribution of variances and how the response changes along spatial/environmental gradients (Ferrier et al., 2007; Warton, Wright & Wang, 2012; Warton et al., 2015; Warton, 2018). More recently, Generalized Linear Models (GLMs) have been proposed as an alternative modelbased approach to the analysis of presence/absence or count data (Wang et al., 2012; Warton et al., 2015; Yee, 2006). The use of GLMs has long been established for univariate analyses and related approaches for multivariate count data are now available (O’Hara & Kotze, 2010; Warton, 2018). The usual approach to selection of explanatory variables in this approach is Akaike’s Information Criterion (AIC: Akaike, 1973; Wagenmakers & Farrell, 2004). This framework will hereafter be named GLM/AIC.
Here, we compared the performance of the RDA/FW and GLM/AIC approaches to selecting spatial explanatory variables for community presence/absence data by measuring the proportion of spatial patterns contained in simulated communities they could correctly retrieve. There have been some studies of simulated multivariate count data (Warton, Wright & Wang, 2012), but presence/absence data are particularly important in spatial studies because they are often the only data that can be collected consistently over large spatial extents. We therefore compare the performance of RDA/FW and GLM/AIC methods for the selection of MEM spatial variables (including one special case, the asymmetric eigenvector maps or AEM) from realistic simulated presence/absence data. We used spatial variables as our predictors since we were interested in discovering whether varying the spatial scales in which communities were structured would affect model performance. We generated simulated data sets with predefined spatial structure based on three real data sets, under two different ecological interpretations of presence/absence data. First, we assumed that species are truly present at some sites and absent at others, and are detected if present (simulated presence method, SPM). Alternatively, absences may represent failure to detect species that are truly present. In this case, we simulated species abundances, followed by a simulated sampling step to obtain presence/absence data (simulated abundances method, SAM).
Materials & Methods
Baseline datasets
We compared the two approaches to spatial variable selection using simulated community data based on three real community composition datasets with a range of properties:

Presence/Absence of 110 marine benthic macroalgae species from a Rapid Assessment Program for biodiversity of 42 sample sites spanning roughly 2,000 km^{2} at Ilha Grande Bay, Rio de Janeiro, Brazil (southwest Atlantic) (CarlosJúnior et al., 2019), permit number IBAMA/RJ:031/04);

Presence/Absence of 588 plant species from grassland covering 500 km^{2} of Scotland’s coast. Data were collected from 3639 5 × 5 m quadrats from 94 sites. We used sites as our sample units, treating species as present when they occurred in at least one quadrat at a site, and absent otherwise (see Lewis, Pakeman & Marrs, 2014) for more information);

Presence/Absence of 47 freshwater aquatic insect species collected from 30 sample sites in five tributaries of the Guapiaçú River basin, Brazil which covers about 40 km^{2} (FeijóLima in prep, permit number INEARJ: 0192014).
For each of the datasets we used the geographical coordinates (maps and sampling sites in Fig. S1) to calculate spatial explanatory variables for regression (Fig. 1). We chose MEMs as our spatial variables since they are commonly used to describe spatial structure in ecological studies. Moreover, in contrast to coarser methods such as trendsurface analysis, MEMs are a flexible method, capable of describing all spatial scales provided by the sampling design (Borcard, Gillet & Legendre, 2011). They are also more flexible and powerful than the method of principal coordinates of neighbor matrices (PCNMs, a special case of distancebased MEMs) (Bauman et al., 2018a; Bauman et al., 2018b; Borcard & Legendre, 2002; Dray, Legendre & PeresNeto, 2006). One needs two matrices to build the MEM variables for a given set of site coordinates: matrix B describing the connectivity among the geographical sampling sites and matrix A describing the weights of such connections. The Hadamard product of these two matrices generates the spatial weighting matrix (matrix W), which is then doubly centred and diagonalized, yielding eigenvectors to be used as spatial variables. For ecological studies, the processes of interest are usually those generating positive autocorrelation, and it is therefore common to use only MEMs associated with positive eigenvalues (as in this study). For studies in which negative spatial autocorrelation is also of interest (e.g., where negative interactions such as competitive exclusion, predation, etc are suspected), the eigenvectors associated with negative eigenvalues can also be separately used (Bauman et al., 2018a). We made decisions about B and A for each dataset based on our ecological knowledge of the spatial structure of these regions, since our goal was to simulate communities with ecologically sensible spatial structures. Therefore, for dataset 1 we chose the minimum spanning tree (B) with Euclidian linear distances as weights (A). Our decision was based on the shape of the bay and the fact that the main water movements make the sampling sites geographically compartmentalised in subregions where sites are likely to be minimally connected (CarlosJúnior et al., 2019). Similarly, spatial organisation in dataset 2 could be sensibly described in terms of Delaunay triangulation (B) with Euclidian weights (A). Despite some degree of connectivity among all sites, pairs of sites could be mostly associated not to their immediate neighbours but rather as a function of their distances. This is due to cultural differences in land management. For example, northern and western islands share cultural histories, which is reflected in species composition (Lewis, Pakeman & Marrs, 2014). Directional spatial processes in ecological data, such as those observed in rivers, are well described by a special case of MEMs called asymmetric eigenvector maps (AEM, Blanchet, Legendre & Borcard, 2008b), which were used for constructing variables for dataset 3. In MEMs, larger eigenvalues are associated with broaderscale spatial structures while smaller eigenvalues represent finescale spatial structures. This allowed us to control the spatial scale of variation in community structure. Dataset 1 had 16 positive MEMs from 42 sites, dataset 2 had 30, and dataset 3 had 12 AEM variables with positive autocorrelation. For computation of the MEMs for the three datasets we used the packages adespatial (version 0.3–7, Dray et al., 2019) and spdep (version 0.7–4, Bivand & Piras, 2015; Bivand, Hauke & Kossowski, 2013).
Simulating communities with chosen spatial drivers
We simulated realistic communities with known spatial structure, based on the three datasets. We used spatial eigenvectors as explanatory variables. We varied the number of MEMs with nonzero coefficients and created new binary (presence/absence) communities (with the same number of sites and same expected number of species as the real ones) using two different modelling scenarios. These simulated communities reflected the effect of those MEMs with nonzero coefficients. By varying the number and ordering of the nonzero coefficients, we could therefore control the spatial structure and scale of the simulated community data (see scheme in Fig. 1 and Table 1).
Scaling  

1 (only broad)  2 (only fine)  3 (mixed)  
(A)  0  None  –  –  
⌊m∕6⌋  $\left\{1,2\right\}$  $\left\{15,16\right\}$  {1, 16}  
⌊m∕3⌋  $\left\{1,2,3,4,5\right\}$  $\left\{12,13,14,15,16\right\}$  $\left\{1,2,3,15,16\right\}$  
⌊m∕2⌋  $\left\{1,2,\dots ,8\right\}$  $\left\{9,11,\dots ,16\right\}$  $\left\{1,2,3,4,13,14,15,16\right\}$  
⌊3m∕4⌋  $\left\{1,2,\dots ,12\right\}$  $\left\{5,\right.7,\dots \left.,16\right\}$  $\left\{1,2,\dots ,6,11,12,\dots ,16\right\}$  
m  $\left\{1,2,\dots ,16\right\}$  –  –  
(B)  0  None  –  –  
⌊m∕6⌋  $\left\{1,2,3,4,5\right\}$  $\left\{26,27,28,29,30\right\}$  {1, 2, 3, 29, 30}  
⌊m∕3⌋  $\left\{1,2,\dots ,10\right\}$  $\left\{21,22,\dots ,30\right\}$  $\left\{1,2,\dots ,10,21,22,\dots ,30\right\}$  
⌊m∕2⌋  $\left\{1,2,\dots ,15\right\}$  $\left\{16,17,\dots ,30\right\}$  $\left\{1,2,\dots ,8,24,25,\dots ,30\right\}$  
⌊3m∕4⌋  $\left\{1,2,\dots ,22\right\}$  $\left\{6\right.,7,\dots ,\left.30\right\}$  $\left\{1,2,\dots ,11,21,22,\dots ,30\right\}$  
m  $\left\{1,2,\dots ,30\right\}$  –  –  
(C)  
0  None  –  –  
⌊m∕6⌋  $\left\{1,2\right\}$  $\left\{11,12\right\}$  {1, 12}  
⌊m∕3⌋  $\left\{1,2,3,4\right\}$  {9, 10, 11, 12}  {1, 2, 11, 12}  
⌊m∕2⌋  {1, 2, …, 6}  {7, 8, …, 12}  {1, 2, 3, 10, 11, 12}  
⌊3m∕4⌋  {1, 2, …, 9}  {4, 5, …, 12}  {1, 2, 3, 4, 5, 9, 10, 11, 12}  
m  {1, 2, …, 12}  –  – 
In order to simulate new binary communities under the simulated presence method (SPM, in which species are always detected if present), we first estimated a coefficient matrix C of size (m variables + 1 (first) row with intercepts) × p species from each real data set. This was achieved using the manyglm function with binomial errors in R package mvabund (version 3.11.9, Wang et al., 2012), with explanatory matrix X (n sites × m positive MEMs + an initial column of 1′s). The matrix C gives the effect of each explanatory variable on the logittransformed probabilities of presence. The mvabund package provides a GLM framework for multivariate response data.
We then created new hypothetical scenarios by generating a new coefficient matrix C^{∗}, of the same size as C, whose elements ${c}_{kj}^{\ast}$ are given by (1)$\left\{\begin{array}{c}{c}_{kj}^{\ast}={c}_{1j},\text{if}\phantom{\rule{1em}{0ex}}k=1,\phantom{\rule{1em}{0ex}}j=1,2,\dots ,p,\left(\text{intercepts}\right)\phantom{\rule{10.00002pt}{0ex}}\hfill \\ {c}_{kj}^{\ast}\sim {\stackrel{\u02c6}{F}}_{b},\text{if}\phantom{\rule{1em}{0ex}}k1\in K,j=1,2,...,p\phantom{\rule{10.00002pt}{0ex}}\hfill \\ {c}_{kj}^{\ast}=0,\text{otherwise},\phantom{\rule{10.00002pt}{0ex}}\hfill \end{array}\right.$ where ${\stackrel{\u02c6}{F}}_{b}$ is the empirical distribution function of c_{kj} (k =2, 3, …, m +1, j = 1, 2, …, p) (Evans, Hastings & Peacock, 2000), and the ${b}_{kj}^{\ast}$ are sampled with replacement. The set K defines to which rows of C^{∗} the nonzero coefficients were allocated: we studied 14 such sets (see below and Table 1(A–C)). In other words, we used the originallyestimated intercepts in each simulation (first row of Eq. (1)), and drew those coefficients assigned to nonzero values (second row of Eq. (1)) from the empirical distribution of all the originallyestimated explanatory variable coefficients. We sampled the values of the nonzero coefficients from the empirical distribution in order to simulate plausible but not fixed spatial structures. Table 1 depicts for each dataset how the nonzero coefficients were assigned for each dataset and simulation scenario (see below).
We then calculated predicted probabilities of presence ${\stackrel{\u02c6}{p}}_{ij}$ for the jth species at the ith site. Given the matrix $\stackrel{\u02c6}{\mathbf{Y}}=\mathbf{X}{\mathbf{C}}^{\ast}$ (n sites × p species) of predicted logit probabilities of presence, the predicted probability of presence is (2)${\stackrel{\u02c6}{p}}_{ij}=\frac{exp\left({\stackrel{\u02c6}{y}}_{ij}\right)}{1+exp\left({\stackrel{\u02c6}{y}}_{ij}\right)}.$
The simulated presence/absence value for species j at site i was sampled from a Bernoulli distribution with success probability ${\stackrel{\u02c6}{p}}_{ij}$. The result is a community matrix with the same number of sites and the same expected number of species as the real community, and with realistic coefficients for spatial eigenvectors. As in the maximum likelihood estimation done by manyglm (Wang et al., 2012), species and sites were assumed conditionally independent when generating simulated presence/absence data, given the values of the explanatory variables. Our simulated communities correspond to the simple case in which presence/absence patterns are affected by environmental variables but not interspecific interactions. Nevertheless, interspecific interactions could well be relevant to real world systems and other models (Godsoe & Harmon, 2012; Anderson, 2017).
Since GLMs are specified correctly for presence/absence data generated this way, we would expect them to perform well. We therefore devised a second ecologically meaningful simulation method in which absences arise from the sampling protocol, called the simulated abundance method (SAM). The two simulation methods differ in whether they assume we have true absences or samplingrelated absences. Note that it is not possible to simulate binary data directly using RDA, because RDA does not generate predicted probabilities of presence. Instead, we treated $\stackrel{\u02c6}{\mathbf{Y}}$ as log expected abundances and exponentiated each element to get expected abundances λ. Then we calculated the probability of detecting the species under Poisson sampling (i.e., the probability of drawing a value of at least 1 from a Poisson distribution with parameter λ), which is (3)${\stackrel{\u02c6}{p}}_{ij}=1{e}^{\lambda}.$
Finally, we generated a Bernoulli random variable with success probability ${\stackrel{\u02c6}{p}}_{ij}$ to produce a simulated presenceabsence observation. Both GLM and RDA are misspecified for data generated in this way. Codes for both the SPM and SAM simulation frameworks and all the datasets used in our simulations are available as supplemental information (Data S1–Data S3).
We compared GLM and RDA variable selection under up to 14 different scenarios, differing in the number of nonzero coefficients (nVar) and whether these coefficients were associated with fine or broad spatial scales. We simulated up to six different choices of the number of MEM variables creating the spatial structure in the data (i.e. having nonzero coefficients): none, approximately one sixth, approximately one third, approximately half, approximately threequarters, and all (Table 1(A–C), rows). We also simulated three different spatial scales of the patterns. As mentioned above, MEMs associated with larger eigenvalues represent broader spatial scales. We ordered the MEMs in descending order of eigenvalues and arranged the nonzero coefficients within matrix C^{∗} in three different ways (Table 1 (A–C), columns): only broadscale MEMs with nonzero coefficients (scaling 1); only finescale MEMs with nonzero coefficients (scaling 2); half broadscale, half finescale (scaling 3). Because not every combination of number of nonzero coefficients and spatial scaling is possible (e.g., it is not possible to assign one nonzero coefficient in scaling 3), there were 14 possible combinations overall for each dataset (Table 1). The main steps of the simulation scheme are summarized in Fig. 1.
RDA and GLM
We used the default RDA function from the R package vegan (version 2.56, Oksanen et al., 2019), with simulated community composition as the response variable, and MEMs associated with positive eigenvalues generated from geographical coordinates of the sample sites as explanatory variables. In order to perform a transformationbased RDA (Borcard, Gillet & Legendre, 2011; Blanchet et al., 2014) we used the Ochiai coefficient, which is the Hellinger transformation analogue for binary data, as recommended by Legendre & Gallagher (2001) and Borcard, Gillet & Legendre (2011).
Binomial GLMs were fitted to the same data using the manyglm function in R package mvabund (Wang et al., 2012). We fitted our models using a logistic regression (logit link function for binomial response), with species compositional data as the multivariate response variable and MEMs as predictors. No interaction terms were included, following common practice in spatial modelling of community data.
Comparing model selection between RDA and GLM frameworks
We compared the results of model selection between the approach usually taken in the RDA and a somewhatsimilar approach for GLMs. For RDA, we used the forward selection with double stopping criterion following (Bauman et al., 2018a; Bauman et al., 2018b), beginning with a global test of significance (model with all spatial predictors) and carrying on with the variable selection if the global model was significant. The forward selection itself consists of a stepwise procedure including in the model the variable contributing the most to the adjusted R^{2}. The procedure stops either when the next variable with the highest contribution is not significant (first stopping criterion) or causes the adjusted R^{2} to be bigger than that of the global model (i.e., containing all variables; second criterion). This is implemented in the function ordiR2step in the vegan package (Oksanen et al., 2019). For GLM, we used forward selection with a stopping rule based on minimum Akaike Information Criterion (AIC) (Akaike, 1973; Wagenmakers & Farrell, 2004). The selection procedure started from a model with intercept only and added one explanatory variable at a time, until no further improvement in the sum of AIC over each of the response variables was possible. We used this approach because the usually large number of MEMs makes it difficult to compare the AIC sum over all possible GLMs.
The performance of each method on simulated data was mainly assessed by two criteria. First, we assessed how many MEMs with zero coefficients were incorrectly included in the final model. Second, we assessed how many MEMs with nonzero coefficients were incorrectly excluded from the final model. Also, we assessed overall accuracy (score) as the percentage of MEMs whose inclusion/exclusion status was correct. The goals of ecological studies are usually not directly related to the inclusion/exclusion of individual MEM variables, but instead to identify spatial pattern, represented by a linear combination of MEMs. However, since the MEMs form a basis for the space spanned by the transformed spatial weighting matrix, such a linear combination is unique (Fraleigh & Beauregard, 1995, pages 197198). Furthermore, the MEMs are orthogonal, so that each represents a qualitatively distinct aspect of spatial pattern. Therefore, if an individual MEM is incorrectly included or excluded, the estimated spatial pattern is qualitatively wrong.
We further explored the ability of each method to capture spatial pattern using a graphical approach (Article S1). For each real dataset and each method, we haphazardly picked one simulated data set. We plotted the MEM decompositions of both the true and estimated spatial patterns. We chose the scenarios in which each method had the worst performance in terms of correctly including/excluding variables, in order to determine whether in such cases, overall spatial pattern would still be captured.
Finally, we calculated how much of the variation in response variables was explained by each method using the adjusted R^{2} for the linear model in RDA and its analogue for GLMs, the Dvalue (Tjur, 2009). These two values cannot be directly compared since they are not exactly equivalent, but their results could yield interesting insights and are made available as supplemental information (see table results in Data S4).
For each of the combinations of conditions in Table 1, 1,000 simulated data sets were generated under each of SPM and SAM. For each simulated data set, spatial explanatory variables were selected using both GLM/AIC and RDA/FW.
Results
Overall, GLM/AIC outperformed RDA/FW in selecting spatial explanatory variables when data were simulated under either SPM or SAM in all three scaling patterns (Fig. 2).
In general, GLM/AIC had fairly predictable performance: it performed nearly perfectly when few or none of the available variables had nonzero true coefficients (i.e., nVar = 0, [m∕6], [m∕3] or [m∕2]), but was less accurate when many or all the variables had nonzero true coefficients (nVar = [3m∕4] or nVar = m) (blue lines in Figs. 2A–2E). There was also some discernible pattern in RDA/FW’s scores: it performed best at nVar = 0 and nVar = m, with intermediate values showing a considerable decrease in selection success. The loss of accuracy for intermediate values of nVar (drop in red lines across different nVar values in Fig. 2 AE) varied substantially among datasets, making general inferences about results more difficult. There was little difference between the results from the SPM (Figs. 2A, 2C, 2E) and SAM simulations (Figs. 2B, 2D, 2F).
It is also noteworthy that when the model had a smaller number of variables to select from (River dataset 3 with 12 MEMs), scores in GLM/AIC were higher, with virtually no incorrect inclusion of variables, and incorrect exclusion of variables occurring on average in only approximately 6% of all 14000 simulations over the whole set of replicates (Fig. 3E). Under the same conditions, RDA/FW’s rate of success was approximately 81%, incorrectly including variables at a rate of 18% (incorrect exclusions represented less than 1%) as depicted in Fig. 3E.
Under both the SPM and SAM simulation methods, GLM/AIC differed substantially from the RDA/FW framework in regard to the type of errors it most often produced. GLM/AIC had virtually no incorrect inclusion of variables (Fig. 3, blue). However, when nVar = [3m∕4] or nVar = m some variables that should be included in the final model were left out. Nevertheless, GLM/AIC never had less than around 90% accuracy over all three datasets (overall mean = 96 ± 1.3% against 71 ± 1.7% from RDA/FW). On the other hand, RDA/FW often included more variables than it should in the model (Fig. 3, red). Such errors especially occurred when 0 < nVar ≤ [3m∕4]. Under some conditions, up to one third of the variables selected by RDA/FW had zero coefficients.
MEM decompositions of true and estimated spatial structure provided a visual assessment of the extent of the misspecification yielded by each method (Article S1). In all three datasets, the worst performance of GLM/AIC corresponded to those models in which it should have included all MEM variables (Fig. 2). Those scenarios represented communities structured at all spatial scales (broad, intermediate and fine). Despite incorrectly excluding several individual variables, GLM/AIC was capable of selecting subsets of variables that corresponded to all those scaling categories ( Articles S1.2–S1.7). In contrast, RDA/FW performed worse when there were few spatial variables (nVar = 5, nVar = 10 and nVar = 2 for datasets 1, 2 and 3, respectively). Under those conditions, incorrect inclusion of variables also resulted in the inclusion of incorrect spatial scales. For example, in one simulation from dataset 1 (Article S1.8) the true spatial structure contained only five MEMs describing finer spatial scale patterns (scaling 2 = MEMs 1216). However, the final model selected by RDA/FW included 13 variables describing both broad (MEMs 16) and intermediate spatial scales (MEMs 9, 11), along with the correct ones (Article S1.9). Similar results were found in all three datasets (Articles S1.10–S1.13). Moreover, these incorrect inclusions of individual variables by RDA/FW resulted in the inclusion of MEM variables associated with eigenvalues substantially different from the correct ones, representing spatial scales much larger than those actually present in the data (Article S1.14). For matters of space, we only plotted one failure example from each dataset for both GLM/AIC and RDA/FW. However, the correct spatial structures within simulated communities and those structures retrieved by both methods in all our simulations scenarios are available as supplemental data (Data S5).
Under SPM simulations, the scale of spatial pattern (fine, broad or mixed: scaling 1, 2 and 3, respectively) had negligible effect on GLM/AIC performance (Figs. 4A, 4C, 4E). A slight difference in variable selection scores between scaling 1 to 2 and 3 was only found in one modelling condition (Fig. 4, nVar = [3m∕4]). On the other hand, scaling often affected the performance of RDA/ FW, although there was no obvious general pattern across different conditions and datasets (Figs. 4A, 4C, 4E). Under SAM simulations, both frameworks performed similarly to what was observed under SPM (Figs. 4B, 4D, 4F).
Discussion
Here, we showed that a GLM/AICbased method for finding spatial structure in communities outperformed an RDA/FWbased method, for presenceabsence data simulated under two different ecologically plausible scenarios about how absences arise. We based our simulated datasets on real datasets from marine, terrestrial and freshwater data. Notably, differences in assumptions about how absences arise made little difference to performance. This might be due to the structure of our community presence/absence datasets, which (like most ecological datasets) had many rare species and, therefore, many expected abundances close to zero. In such cases, the relationship between the community data and explanatory variables could be approximated by a binomial GLM with a logit link function, even if this was not the correct model (as in the SAM simulations). We therefore focus below on general patterns that apply equally to both assumptions about absences, rather than on the details of these assumptions.
In selecting spatial explanatory variables, GLM followed by AICbased model selection (GLM/AIC) performed better than the widelyused approach of RDA followed by forward selection (RDA/FW). Not only did GLM/AIC have better performance overall, but its performance varied little between simulation conditions (Fig. 2). In contrast, RDA/FW performed unpredictably, but often retained too many explanatory variables (Fig. 3).
The problems arising from data with nonGaussian error distributions, such as classic community presence and absence data, in a linear modelling framework are not new to science (Legendre & Gallagher, 2001; McCullagh & Nelder, 1989; Wolda, 1981). Classical linear models such as RDA (Legendre & Anderson, 1999; Legendre & Legendre, 2012) make assumptions regarding constancy of variance in the data (Ter Braak & Prentice, 1988) that cannot be true for presenceabsence data, even after data transformation (O’Hara & Kotze, 2010; Warton, 2018; Warton, Wright & Wang, 2012). The problem may be negligible in some hypothesis testing situations (Ives, 2015). Regardless, incorrectly assuming linearity (and constant variance) may lead to serious problems. Unfortunately, RDA is an algorithmic method that makes implicit decisions about the distribution of variances (Ter Braak & Prentice, 1988; Warton, Wright & Wang, 2012) and does not provide the flexibility to separate systematic variation from random variation in the way that statistical models such as GLMs do (Warton et al., 2015; and see O’Neil & Schutt, 2013) for differences between algorithms and statistical models). New frameworks, such as using GLMs with spatiallystructured random effects (followed by variation partitioning to find environmental and spatial components) have also been specifically proposed as a modelbased alternative to MEMs (Ovaskainen et al., 2017). Despite recent advances showing that better estimates could be obtained by using sensible selection procedures, manipulating the data appropriately and/or by splitting the analysis of the response data over shorter spatial/environmental gradients (Bauman et al., 2018a; Ives, 2015; Vieira et al., 2019), employing statistical models that match the distribution of the response data is better practice in most cases (Ferrier et al., 2007; Warton, 2018; Warton et al., 2015).
Another relevant aspect of the general performances of the two methods concerns the peaks of performance in detecting spatial structure. The scores in the GLM/AIC framework were close to ideal across datasets when the number of variables that should be selected was none or was small relative to the number of variables available. The performance only decayed when many or all of the available variables should have been retained in the final model. Thus, if a few variables are responsible for most of the spatial structure in community composition, GLM/AIC will usually outperform RDA/FW (Fig. 2). Considering that the majority of effects could be derived from a small number of causes (Sullivan, 2019) in many biological systems, GLM/AIC could presumably perform well on many real systems. On the other hand, RDA/FW worked best precisely in situations thought unlikely in real systems, when no spatial structure is present among communities (where GLM/AIC also performed equally well), or when composition is structured at all possible spatial scales (i.e., nVar = 0 and nVar = m, respectively). Moreover, when the model had a small number of variables to select from (River dataset, Figs. 3E–3F), performance of RDA/FW was very variable.
The two approaches also differed in the ways they failed. GLM/AIC more often included too few variables, while RDA/FW more often included too many. This was consistent among all three datasets under SPM and SAM simulations (Fig. 3) and is in contrast with results from previous studies where GLMs produced higher Type I error rates compared to a linear model (Ives, 2015). For beta diversity studies, where the aim is to identify the most important variables associated with differences in community composition, leaving out a few variables that affect composition is better, in our opinion, than including many variables whose effects are not important. On the contrary, in other scenarios such as when one tries to select pivotal attributes that could be important for the conservation of a population or community, it might be better to accept a higher risk of including spurious variables. Furthermore, model selection problems involve a tradeoff between bias and variance, with inclusion of unnecessary variables inflating the uncertainty in parameter estimates (Miller, 1990). Using AIC is often a good way to deal with this tradeoff (Anderson, Burnham & Thompson, 2000), and in our simulations, an AICbased approach worked well. Thus, we suggest that GLM/AIC will usually outperform RDA/FW in selecting spatial explanatory variables for presence/absence community composition data. Unfortunately, AIClike statistics are not recommended for constrained ordination methods such as RDA, and therefore its use cannot be trusted (see below and Bauman et al., 2018a for details). When different RDAbased procedures were systematically compared, the commonly (mis)used combination of RDA and AIC model selection produced the worst results, yielding inflated Type I errors rates (Bauman et al., 2018a). Therefore, the benefits from AIC in dealing with the bias and variance tradeoff do not apply to RDA or related ordination methods. Despite our interest in some attributes of the MEMs for our simulations, such as differences in model performance under varying spatial scales, we hypothesize that the results demonstrated here hold true for other types of explanatory variables (e.g., environmental) not tested here.
The spatial scale represented by the MEMs had a negligible effect on GLM/AIC’s performance, with only one condition in one dataset slightly differing in results between different scales (see Fig. 4 when the number of nonzeros is ⌊3m∕4⌋). In contrast, RDA/FW’s performance was strongly affected by spatial scale (Fig. 4). In real systems, where the spatial scale at which community composition varies is not known a priori, the performance of RDA/FW could therefore be unpredictable. The uncertainty around RDA/FW performance over differing spatial scales could be especially troublesome for analyses involving processes that may not be constant along spatial/environmental gradients, as commonly observed for rates of species turnover, for example (Ferrier et al., 2007; Fitzpatrick et al., 2013).
Conclusions
We discourage the use of traditional RDA/FW to search for spatial descriptors of variation in multivariate presence/absence data sets of moderate size, although large datasets could potentially overcome the issues found here. Instead, we recommend the GLM/AIC framework, in which the relationship between the response and its predictors is modelled in a way that respects the nature of the response. Similar recommendations are likely to apply to other forms of community abundance data with nonnormal error distributions (e.g., count data with many zeros or proportional data, Bolker et al., 2009; Warton, Wright & Wang, 2012; Warton et al., 2016).
Supplemental Information
Maps of sampling sites in all three baseline datasets used for simulations
The maps depict the sampling sites (black dots in the three panels of the bottom) for all three real baseline datasets that were used for simulating new communities as described in the main text (for dataset A N= 42, B= 94, C= 30). The lettering of each dataset follow those described in the main text.
Baseline presence/absence data and geographic coordinates of sample sites
Original presence/absence data and geographic coordinates of Brazilian marine algae, Scottish grassland plants and Brazilian freshwater insect groups, as described in main text.
R script for running th simulated presences method (SPM) using dataset A
This file contains the R script for the simulating species presence/absence data with predefined spatial structure under different conditions using realistic coefficients.
R script for running th simulated abundances method (SAM) using dataset A
This file contains the R script for the simulating species presence/absence data from abundances with predefined spatial structure under different conditions using realistic coefficients.
Summary Table of Results
Performance of GLM/AIC and RDA/FW under both SPM and SAM after 1000 repetitions. The table depicts the results of variable selection by GLM/AIC and RDA/FW for each one of 1000 repetitions made for all three datasets (column ’dataset’) and varying the number of nonzero explanatory variables (column ’ev’) under different scaling scenarios (column ’scaling’, where 1= only broadscale variables with nonzero coefficients, 2= only finescale variables and 3= half broad and half fine scale variables). Two modelling strategies were used and are specified in column ’modelling’. They differ in the assumptions on how presence/absence data arise. The simulated presence method (SPM) assume true presences whereas simulated abundances method (SAM) treats absences as flaws in the sampling protocol. Columns ’scores’ provide the total number of correclty identified variables out of total number of variables for each dataset (16 for dataset A, 30 for dataset B and 12 for dataset C). Finally, column ’coef.det’ provides the coefficient of determinations Dvalue and Rsquared for each method, GLM/AIC and RDA/FW, respectively.
Simulated spatial patterns against the patterns retrieved by GLM/AIC and RDA/FW across three datasets and two modelling strategies (SPM and SAM)
This tables represent the presence (TRUE) or absence (FALSE) of nonzero coefficients assigned to each explanatory variables (MEMs, columns) in a certain simulation scenario (rows, see main text). Each tab corresponds to a dataset (subtidal algae, Scotland grasslands or river data), modelling strategy (SPM and SAM, see main text) and a method (GLM/AIC or RDA/FW,) compared to the ”real” spatial structure simulated for any given scenario (14 simulation scenarios, see main text for details). For each simulation scenario,1000 simulated data sets were generated.