SDMtoolbox 2.0: the next generation Python-based GIS toolkit for landscape genetic, biogeographic and species distribution model analyses

SDMtoolbox 2.0 is a software package for spatial studies of ecology, evolution, and genetics. The release of SDMtoolbox 2.0 allows researchers to use the most current ArcGIS software and MaxEnt software, and reduces the amount of time that would be spent developing common solutions. The central aim of this software is to automate complicated and repetitive spatial analyses in an intuitive graphical user interface. One core tenant facilitates careful parameterization of species distribution models (SDMs) to maximize each model’s discriminatory ability and minimize overfitting. This includes carefully processing of occurrence data, environmental data, and model parameterization. This program directly interfaces with MaxEnt, one of the most powerful and widely used species distribution modeling software programs, although SDMtoolbox 2.0 is not limited to species distribution modeling or restricted to modeling in MaxEnt. Many of the SDM pre- and post-processing tools have ‘universal’ analogs for use with any modeling software. The current version contains a total of 79 scripts that harness the power of ArcGIS for macroecology, landscape genetics, and evolutionary studies. For example, these tools allow for biodiversity quantification (such as species richness or corrected weighted endemism), generation of least-cost paths and corridors among shared haplotypes, assessment of the significance of spatial randomizations, and enforcement of dispersal limitations of SDMs projected into future climates—to only name a few functions contained in SDMtoolbox 2.0. Lastly, dozens of generalized tools exists for batch processing and conversion of GIS data types or formats, which are broadly useful to any ArcMap user.


Downloading
The latest version of the toolbox is available for download at: www.sdmtoolbox.org. This software requires ArcMap 10.1-10.5 with an active Spatial Analyst license (www.ESRI.com). This toolbox is programed specifically for ArcMap 10.1 (and above) and due to a series of improvements in this version, it is not backwards compatible with older releases of ArcMap (i.e. ArcMap 9.2). This software consists of an ArcGIS toolbox and associated python scripts.

A First Run
Many things will cause the SDMtoolbox to not run that have nothing to do with the toolbox itself. Upon first use, the following steps should be performed: 1. Open ArcMap10.X and activate the ArcToolbox window (if not already visible). Bottom image on this page is the Arctoolbox window. If not visible, click toolbox icon (see image below).

Ensure Spatial Analyst is Enabled in ArcMap
A.
If using an equal-areas projection and input data are WGS-1984 coordinates, try the SDMtoolbox tool that will do all the above steps: SDM tools  2. MaxEnt ToolsCorrecting Latitudinal Background Selection Biases Solution 2: Project Input Data to Equal-Area Projection (EAP) 1. CSV to EAP. MaxEnt format output 3. Check that all your input data are projected and in the same projection. This is the most common error. Often reference ASCII files are not projected at all. 4. Last, visualize all GIS files to make sure they are in fact properly projected (all maps should overlap etc.). Sometimes files say they are projected to the same projection, however in reality something was incorrect.

To check the projection of each GIS file
Import GIS files, then right click each file and select 'Properties' Select the 'Source' tab and page down until you see the 'Projection'. Here the projection is Lamberts Azimuthal Equal Area To accompany this guide I have provided example data available at www.sdmtoolbox.org. Download this, the latest version of the SDMtoolbox, and this guide before beginning.
This guide does not cover all the tools in the SDMtoolbox. However, as an overview, the guide covers tools from all the major groups. For many tools lacking a guide I have included example data to execute each analysis. The example data should be contained in a folder respective to the hierarchy of the toolbox. For example, for the tool 'SDM toolbox: Biodiversity Measurements  Input: Point Data  Calculate Richness and Endemicity (WE and CWE)', the example data are in the folder '…example_data\biodiversity_measurements\biodiversity_points' Lastly, each tool is annotated and instructions should be contained within each tool's help file from within ArcGIS. Follows are the major groups of the toolbox. The guide treats each one as a separate chapter.
The 10 commandments of SDMtoolbox 2.0

Tool Overview
These tools estimate three common biodiversity metrics: species richness, weighted endemism and corrected weighted endemism. There are two sets of analyses here: analyses that utilize point occurrence data and analyses that use binary SDMs.
As follows are the three diversity metrics:

Estimate Richness and Endemicity (WE and CWE)
ARCGIS STEP-BY-STEP GUIDE: 4. Here I used a resolution of 80,000m (or 80km). The data used here are in meters. In the future, however, your data might being in feet or degrees. The distance value input should be large enough to capture landscape processes, but not too large where regional differences are lost. I recommend starting with a value equivalent to 50 or 100km. Note 1: your actual values will be in the maps units (likely meters, feet or decimal degrees). Note 2: 100km = ~0.8983 DD at Equator. 5. I prefer TIFF files as a raster output format because they allow for longer file names (vs. ESRI grid files that are limited to 13 characters) and don't have too many raw parts to each file. There is, however, a slight reduction in performance (vs. ESRI grid files), thus, if processing thousands of rasters this should be taken into consideration. 6. A polygon mask of a country outline. This will clip the edges of pixels by the boundary of this mask. This produces a much more visually pleasing output. All biodiversity metrics will be appended to the shapefile table and can be visualized in the file symbology.

NOTES:
-This MUST be in the same projection as other input data (e.g. all projections must be WGS 1984).
-A good source for clipping mask (country boundaries etc.) is: http://www.diva-gis.org/Data -Again, be sure to project the file to match your input rasters 7. Often due to lacking occurrence data, not all species can be modeled. This feature will include species occurrences that were not modeled to be included in the biodiversity estimates. 8. Select table field corresponding to species ID 9. Buffer distance. The input points will be buffered to this distance. Here I chose 25,000m (25km)---this means a circle with a 25km radius will be created around each point.

Results
Outputs from analyses. Left to Right: Estimated Spp. Richness, Estimated Spp. Richness (low resolution), WE and CWE To change the appearance of the output shapefile, right click the file and select 'Properties' Then select the 'Symbology' tab. In the left column select 'Quantities  Graduated colors', for 'Fields Value' select any of the three biodiversity fields. Here I chose to use 10 classes defined by 'Natural Breaks (Jenks).' To keep consistent with the other color schemes, I used the RGB color ramp and inverted it so that the highest values are red. To do this, right click one of the colors (see above) and select 'Flip Symbols.' Small image to right is the output from these settings.

Extra bit:
For a little extra pizazz, place a digital elevation model below the biodiversity layer (use black to white colorramp) and make the biodiversity layer 10-20% transparent. This gives the map a little texture corresponding to topography.

Tool Overview
This tool facilitates a quantitative method for locating hotspots of endemism called Categorical Analysis of Neo-and Paleo-Endemism (CANAPE; Mishler et al. 2014) on grids output from Biodiverse (http://shawnlaffan.github.io/biodiverse/). These analyses are able to classify neo-endemic and paleoendemic species, young taxon and an old taxon with a restricted distribution, respectively. This method assess the significance of branch lengths among taxa that are either significantly shorter (neo) or significantly longer (paleo) than other areas in the landscape.
For more information regarding this analysis and running it: http://biodiverse-analysissoftware.blogspot.com.au/

Tool Overview
This tool creates a raster of the sum of least-cost corridors and a polyline shapefile of least-cost paths between populations that share haplotypes. Often a single LCP between sites oversimplifies landscape processes. By using categories of cost paths that include paths with slightly more costly path lengths (relative to the LCP), you can better depict habitat heterogeneity and its varying role in dispersal. For each comparison you can classify the lowest cost paths into three categories. Lastly, a density analysis will produce a raster depicting the frequency that LCPs traverse the same path.

Landscape Connectivity
ARCGIS STEP-BY-STEP GUIDE: recode as alphanumeric ID (e.g. change '9' to '9a') 4. A friction Layer is a raster that depicts the ease of dispersal from each locality through the landscape. In this analysis, the friction layer depicts the output extent and spatial resolution of analysis. If you want a larger extent or a lower spatial resolution, increase both in your friction layer. However, the larger both are-the longer the analysis will take.
Creating a friction layer. One prevalent way to create a friction layer that doesn't suffer from applying weights to habitat types (often associated with contributing to biases in results) is the use of SDMs. An SDMtoolbox tool (location below) will invert a SDM for use as a friction surface. Using this method, areas of high suitability will be converted to areas of low dispersal cost.
Tool Path: SDM Tools 1. Universal Tools  Create Friction Layer  Invert SDM 5. Input desired output name. Note when this category is highlighted the Help Box displays the text in the box to the right. I input 'Oplurus_cuvieri', this means that the output files will be named: 'Oplurus_cuvieri_Dispersal_Network', 'Oplurus_cuvieri_LCPs.shp', 'Oplurus_cuvieri_LCPs_Line_Density' 6. Select output folder location. This should be a new empty folder. If not empty this can cause the analysis to fail, particularly if temporary files from a previous analysis were not properly removed (e.g. this can happen if another SDMtoolbox analysis is terminated early). 7. Output file type. Here I selected the 'Erdas Imagine (.img)' raster format.
Tip. I prefer TIFF files as output format because they allow for longer file names (vs. ESRI grid files that are limited to 13 characters) and don't have too many raw parts to each file. There is, however, a slight reduction in performance (vs. ESRI grid files), thus, if processing thousands of rasters this should be taken into consideration.
8. Create least-cost path lines. Least-cost path analysis (LCP) allows researchers to find the 'cheapest' way to connect two locations within a cost surface (i.e. a friction layer). 9. There are two methods for calculating the least-cost corridors (LCCs): A. Percentage of LCP value is based on the LCP between each site. For example, if the LCP was 5.0 and a 1% LCC class cutoff was selected, LCPs with values between 5.0-5.05 would be included in that class. This formula focuses only on the LCP value and is not affected by larger cost-path values.

Results
Dispersal Networks. A. Least-cost paths and LCP line densities. Warmer colors depict LCP lines traversed more frequently. B. Haplotype Dispersal Networks. Warmer colors depict cost-paths traversed more frequently and represent likely connections in habitat (due to common ancestry among shared haplotypes).

Tool Overview
This tool will create two pair-wise distance matrices reflecting: least cost path distance and the along path cost of the least-cost path. The least-cost path distance is simply the distance of the LCP. The along path cost of the least-cost path is the total sum of the friction values that characterize the least-cost path. Each output is a n-dimensional symetrical matrix output as a CSV table. Note that some of the code used here are adapted and updated from T.R. Etherington's: Landscape Genetics toolbox. If you use this tool, cite both SDMtoolbox and the following citation: Etherington, T.R. (2011) Python based GIS tools for landscape genetics: visualising genetic relatedness and measuring landscape connectivity, Methods in Ecology and Evolution, 2(1): 52-55.

Create Pairwise Distance Matrix
ARCGIS STEP-BY-STEP GUIDE: recode as alphanumeric ID (e.g. change '9' to '9a') 3. A friction Layer is a raster that depicts the ease of dispersal from each locality through the landscape. In this analysis, the friction layer depicts the output extent and spatial resolution of analysis. If you want a larger extent or a lower spatial resolution, increase both in your friction layer. However, the larger both are-the longer the analysis will take.
Creating a friction layer. One prevalent way to create a friction layer that doesn't suffer from applying weights to habitat types (often associated with contributing to biases in results) is the use of SDMs. An SDMtoolbox tool (location below) will invert a SDM for use as a friction surface. Using this method, areas of high suitability will be converted to areas of low dispersal cost.
Tool Path: SDM Tools 1. Universal Tools  Create Friction Layer  Invert SDM 4. This should be a new empty folder. If not empty this can cause the analysis to fail, particularly if temporary files from a previous analysis were not properly removed (e.g. this can happen if another SDMtoolbox analysis is terminated early). 5. Input desired output name. Note when this category is highlighted the Help Box displays the text in the box to the right. I input 'Oplurus_cuvieri', this means that the output files will be named: 'Oplurus_cuvieri_LCP_cost.csv', 'Oplurus_cuvieri_ LCP_distance.csv', 'Oplurus_cuvieri.shp' Select output folder location.

Results
Pairwise distance matrix: A. path cost and B. path distance (not picture)

TOOL OVERVIEW
If you are using data that are in a geographic coordinate system (such as, degrees minutes seconds or decimal degrees) for MaxEnt analyses (and most other background and pseudoabsence based SDM methods)---then you are biasing your selection of background points (or pseudoabsence points) and unique observed localities toward the poles. The level of bias depends on the breadth of latitudes your analyses cover. The reason for this is due to the area occupied by these units decreases latitudinally (as values increase, see Table 1), with areas largest at equator and smallest at poles. This inequality results from convergence of the meridians (lines of longitude) towards the poles.
There are two solutions to this issue. The first solution corrects the bias sampling problem by correcting how background values and unique occurrence localities are selected. The second solution fixes the problem by projecting all the data into an equal-areas projection (EAP). The latter is the preferred method, however for many modelers, this requires

Clip BFCD by Background Selection Bias File tool interface
SDMTOOLBOX STEP-BY-STEP GUIDE:

Solution 2
Best practice: Projecting all data into an equal-areas projection Tool: 1. CSV to EAP. MaxEnt format output (runs both 1a and 1b) ARCGIS STEP-BY-STEP GUIDE:

Help with Equal Area Projections:
Global Projections: Equal-Area Cylindrical is best for equatorial studies covering several continents; Mollweide is great for global studies with resolution >25km; Lambert Azimuthal Equal-Area is great for most regional studies.
Continental: Albers Equal-Area Projection is best for studies at mid-latitudes (20N and 50N; 20S and 50S). For regions centered on poles, use Lambert Azimuthal Equal-Area.
Reminder. Use the same projection for all rasters and CSVs in the study

Project Climate Data (Raster) to Equal-Area Projection (Folder) tool interface
SDMTOOLBOX STEP-BY-STEP GUIDE: 1. Input data need to be a non-ASCII raster files and all files should be in a single folder. If ASCII files, first use the '2b. ASCII to Raster (folder)' tool (part of the 'Basic Tools  Raster Tools' group). 2. Select output folder location. This should be a new empty folder. If not empty this can cause the analysis to fail, particularly if temporary files from a previous analysis were not properly removed (e.g. this can happen if another SDMtoolbox analysis is terminated early). 3. Select an appropriate Equal-Area Projection. Use the same projection selected for the CSV species file (pg. 26 step 6). 4. Run tool. Then use the '2a. Raster to ASCII (folder)' tool, part of the 'Raster Tools' group, to convert newly projected rasters to ASCII files (input format for use in MaxEnt).

Distribution Changes Between Binary SDMs
A common use of species distribution models is to predict distributional changes due to climate change.
Here I created two tools that help summarize distributional changes. The first tool calculates the distributional changes between two binary SDMs (e.g. current and future SDMs). Output is a table depicting predicted contraction, expansion, and areas of no change in the species' distribution. A second tool also calculates the distributional changes between two binary SDMs (e.g. current and future SDMs), however this analysis is focused on summarizing the core distributional shifts of the ranges of many species. This analysis reduces each species' distribution to a single central point (known as a centroid) and creates a vector file depicting magnitude and direction of predicted change through time.

Distribution Changes Between Binary SDMs
Tool: Centroid Changes (Lines) ARCGIS STEP-BY-STEP GUIDE: 5. Select an appropriate Equal-Area Projection. If your area of extent is centered on a single continent, select that projection.

Help with Equal Area Projections:
Global Projections: Equal-Area Cylindrical is best for equatorial studies covering several continents; Mollweide is great for global studies with resolution >25km; Lambert Azimuthal Equal-Area is great for most regional studies.
Continental: Albers Equal-Area Projection is best for studies at mid-latitudes (20N and 50N; 20S and 50S). For regions centered on poles, use Lambert Azimuthal Equal-Area.
Reminder. Use the same projection for all rasters and CSVs in the study 6. Check to save the raster files of results (see below) 7. Output file type. Here I selected 'Tiff (.tif)' format.
Tip. I prefer TIFF files as output format because they allow for longer file names (vs. ESRI grid files that are limited to 13 characters) and don't have too many raw parts to each file. There is, however, a slight reduction in performance (vs. ESRI grid files), thus, if processing thousands of rasters this should be taken into consideration.

Results
Results: CSV file of area changes and to right, a raster of results (optional).

Overprediction Correction: Clip Models by Buffered Minimum Convex Polygons
To limit over-prediction of SDMs, a problem common with modeling species distributions, two tools were created that clip SDMs by a buffered minimum convex polygon (MCP) generated from the input point data of each species following the approach of Kremen et al. (2008). This method produces models that represent suitable habitat within an area of known occurrence (based on a buffered MCP), excluding suitable habitat greatly outside of observed range and unsuitable habitat through the landscape.  (Location: …\example_data\sdm_analyses\overprediction_correction\binary_SDMs) 5. Select output folder location. This should be a new empty folder. If not empty this can cause the analysis to fail, particularly if temporary files from a previous analysis were not properly removed (e.g. this can happen if another SDMtoolbox analysis is terminated early). 6. Output file type. Here I selected 'Tiff (.tif)' format. 7. Climate data sized to extent of MaxEnt Modeling. Here use the imported 'Bio_1.asc' layer. Select one of your climate files sized to your modeling extent (e.g. Bio1.asc). This file will be used to match the bias file to proper extent and resolution (no change will be made to this file). 7. Climate data sized to extent of MaxEnt Modeling. Here use the imported 'Bio_1.asc' layer. Select one of your climate files sized to your modeling extent (e.g. Bio1.asc). This file will be used to match the bias file to proper extent and resolution (no change will be made to this file).

Background Selection via Bias Files
A subset of python scripts create bias files used to fine-tune background and occurrence point selection in Maxent. Bias files control where background points are selected and the density of background sampling. Proper use of bias files can avoid sampling habitat greatly outside of a species' known occurrence or can account for both collection sampling biases and latitudinal biases associated with coordinate data.
Background points (and similar pseudo-absence points) are meant to be compared with the presence data and help differentiate the environmental conditions under which a species can potentially occur. Typically background points are selected within a large rectilinear area, within this area there often exist habitat that is environmentally suitable, but was never colonized. When background points are selected within these habitats, this increases commission errors (false-positives). As a result, the 'best' performing model tends to be over-fit, because selection criterion favor a model that fail to predict the species in the un-colonized climatically suitable habitat ( To circumvent this problem, many researchers have begun using background point and pseudo-absence selection methods that are more regional. SDMtoolbox contains two tools to facilitate more sophisticated background selection for use in Maxent. The Sample by Distance from Obs. Pts. tool (see: SDM Tools 2. MaxEnt Tools  Background Selection via Bias Files) uses a common method that samples backgrounds within a maximum radial distance of known occurrences (see Thuiller et al. 2009). The Sample by buffered MCP tool restricts background selection with a buffered minimum-convex polygons based on known occurrences (see following guide).
One limitation of presence-only data SDM methods is the effect of sample selection bias from sampling some areas of the landscape more intensively than others (Phillips et al. 2009). Maxent requires an unbiased sampling of occurrence data and spatial sampling biases can be reduced by using the Gaussian kernel density of sampling localities tool. This method produces a bias grid that up-weights presence-only data points with fewer neighbors in the geographic landscape. To do this the tool creates a Gaussian kernel density of sampling localities (Fig 1n). Output bias values of 1 reflect no sampling bias, whereas higher values represent increased sampling bias. Depending on the study, the input points could be all sampling localities for a larger taxonomic group or simply the input sampling localities of a focal species. For example, if I were studying a single species of frog from Madagascar, I could use either: i)1. only the occurrence points from that species, or ii)2. all sampling points from all amphibians in Madagascar. The former focuses on sampling biases in the focal species, where the latter focuses on widespread spatial sampling biases and likelihood of detection of your species in all surveys (e.g. sampling only near roads). 2. The distance outside of minimum-convex-polygon included in background selection. 3. Output name. 4. Select output folder location. This should be a new empty folder. If not empty this can cause the analysis to fail, particularly if temporary files from a previous analysis were not properly removed (e.g. this can happen if another SDMtoolbox analysis is terminated early). 5. Climate data sized to extent of MaxEnt Modeling. Here use the imported 'Bio_1.asc' layer. Select one of your climate files sized to your modeling extent (e.g. Bio1.asc). This file will be used to match the bias file to proper extent and resolution (no change will be made to this file). 2. Field corresponding to species ID 3. Field corresponding to latitude 4. Field corresponding to longitude 5. The distance outside of polygon(s) included in background selection. 6. The alpha parameter depicts is the search distance used to define the convex-hull shape and size. Larger values will result in areas of background selection more similar to a buffered MCP and smaller values more similar to outputs from the distance from observed localities tool.

Background Selection via Bias Files
Note. This value is directly linked to the buffer distance (being multiplied by that value). Thus, if you want the same local adaptive convex-hull shape for different buffer distances, you need adjust the value accordingly. For example, a buffer distance of 50km and α=4 would result in the same adaptive convex-hull shape (prior to buffering) as a buffer distance of 100km and α=2 (50 x 4 = 100 x 2). 7. Select output folder location. This should be a new empty folder. If not empty this can cause the analysis to fail, particularly if temporary files from a previous analysis were not properly removed (e.g. this can happen if another SDMtoolbox analysis is terminated early). 8. Climate data sized to extent of MaxEnt Modeling. Here use the imported 'Bio_1.asc' layer. Select one of your climate files sized to your modeling extent (e.g. Bio1.asc). This file will be used to match the bias file to proper extent and resolution (no change will be made to this file).

Create Friction Layer: Invert SDM
The use of least-cost paths and along-path distances often dramatically improve the calculation of geographic distance for testing hypotheses (such as, isolation by distance

Spatially Rarefy Occurrence Data
Most SDM methods require input occurrence data to be spatially independent to perform well. However, it is common for researchers to introduce environmental biases into their SDMs from spatially autocorrelated occurrence points. The elimination of spatial clusters of localities is important for model calibrating and evaluation. When spatial clusters of localities exist, often models are over-fit towards environmental biases (reducing the model's ability to predict spatially independent data) and model performance values are inflated Hijimans et al. 2012;). The spatially rarefy occurrence data tool addresses this issue by spatially filtering locality data by a user input distance, reducing occurrence localities to a single point within the specified Euclidian distance. This tool also allows users to spatially rarefy their data at several distances according to habitat, topographic or climate heterogeneity (Table 1d). For example, occurrence localities could be spatially filtered at 5 km 2 , 10 km 2 and 30 km 2 in areas of high, medium and low environmental heterogeneity, respectively. This graduated filtering method is particular useful for studies with limited occurrence points and can maximize the number of spatially independent localities.

SDMTOOLBOX STEP-BY-STEP GUIDE:
1. Input Table with occurrence data, species ID, and clade ID (as number: 2-10). The current script only supports clade numbers of 2 to 10 groups. If only a single clade is identified, there will be no change splitting of the binary SDM. If more than 10 clade IDs exist for a species this will be skipped. 2. Column with species (or species group) name 3. Column with longitude values 4. Column with latitude values 5. Column with clade ID 6. Select folder with binary species model (name must perfectly match species name input in table) 7. Output folder location. This should be a new empty folder. If not empty this can cause the analysis to fail, particularly if temporary files from a previous analysis were not properly removed (e.g. this can happen if another SDMtoolbox analysis is terminated early).

Split binary SDM by input clade relationship results
Left: Final distribution split by clade relationship. Below: Over view of method, which split the landscape by thiessen for each locality. Then associated with clade membership is applied to groups which are use to divide input binary SDM into associated clade groups. Output are individual rasters for each clade and a single file with all clades in one raster.

Chapter 5. Running a SDM in MaxEnt: from Start to Finish
Below is a brief overview of my view of the best practices of correlative species distribution modeling and how SDMtoolbox will facilitate achieving them. This overview focuses on modeling in MaxEnt, but many steps are applicable to all types of distribution modeling. For overview of major assumptions and other considerations, see table at the end of this document.
Species distribution modelling (SDM) occurs in two phases: 1) Data compilation and 2) Model creation, calibration, and validation

1A. Preparing Worldclim Climate Data: Clip the raster to area of species' extent
Tools: Extract by Mask (Folder) and Raster to ASCII ARCGIS STEP-BY-STEP GUIDE: 1. Open a fresh ArcMap document 2. Download ESRI grid climate data (e.g. the 30 arc-second bioclim) from worldclim.org 3. Open one of the newly downloaded layers in ArcMap

Data compilation
This step includes collecting occurrence records of the focal species and environment data for its habitats.

Occurrence Data
The single most important component of any SDM is the input occurrence records. Extra care should go into selecting, and then processing, these points. The quality, distribution and number of points are directly related to the accuracy of the model. Use as many high-quality locality points as possible (e.g. GPS data collected with confident taxonomic identification) and try to collect occurrence records that are evenly sampled throughout the species' range and avoid biases in the sampling method (e.g., sampling only from road transects). It is better to have only a limited number of points that satisfy the above conditions than many points of vague credence (e.g. be skeptical of points downloaded from internet databases, particularly those that are georeferenced from locality info) (Chan et al. 2011).

Environment Data
The environment data provide the landscape-level data to quantify the focal species' ecological tolerances. Include variables that are likely to be directly relevant to the species being modeled. However do not add all available climate data without regard to the redundancy of the data. Many environmental variables are tightly correlated making some redundant, this makes interpreting the influences of each variable in the model difficult. If not included in your model, consider the effects of the following items on the present distribution of your species: fire history, glaciations, contagious diseases, anthropogenic factors, and recent geological changes, the species' movement potential through the landscape or biotic interactions.
4. There are several ways to define the area to clip the climate data into. One of the easiest ways is to simply zoom the display window the desired extent (Image below, where I wanted to reduce climate data to Madagascar). Other ways include defining the max-min XY coordinates (a bounding box) or using another GIS layer as a templates (such as, a country's boundary). Select one of these methods and continue. Note the area should encompass an area about 50-100 km (ca. 0.5-1 degree) greater than total distribution of all your focal species. We will then use bias files to limit background selection of each species to meaningful areas within this area.

Zoom tool
Zoom to desired area

Optional Step. Which variables should I use? Testing Autocorrelations of Environmental Data
Tool: Explore Climate Data: Remove Highly Correlated Variables ARCGIS STEP-BY-STEP GUIDE: 1. Double-click the 'SDM Tools  Universal SDM Tools  Explore Climate Data  Remove Highly Correlated Variables ' 2. Continue to tool interface instructions (below) NOTE: the tool below will fail if raster files are too large. This bug is associated with SciPy and is nothing I can fix. If you are really interested in performing this analysis-I suggest you reduce spatial scale of environmental rasters to 10-20km 2 and run this tool. Given very high spatialautocorrelation in these layers, values below this typically wont greatly affect the correlations associated with this analysis SDMTOOLBOX STEP-BY-STEP GUIDE: 1. Select all the clipped Worldclim data ('control+shift' will allow you to select all items in a folder). Layers that you wish to retain (vs. the other correlated layers) should be first in the list. All correlated layers that occur after will be excluded. For interpreting influence of environmental layers in the SDM, I prefer to place layers that depict metrics frequently used in non-SDM ecology and evolution studies [such as: BIO1 = Annual Mean Temperature, BIO2 = Mean Diurnal Range (Mean of monthly (max temp -min temp)), BIO12 = Annual Precipitation]. Further for simplicity, these layers often best represent the original input climate data (as they directly reflect the actual measurements) and are not derived from several layers or a subset of the data. 2. Maximum correlation allowed. Multiple values can be input separated by semicolon (';'). Input a value between 0-1. The absolute value of the correlation coefficients range from 0 to 1. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly. A value of 0 implies that there is no linear correlation between the variables. 3. Input NoData Value. Note this must be the same for all values or else correlations will not be accurate. Since we used only Worldclim here, this should not be an issues (as all values are the If you are interested interpreting how each input environmental variable contributes to your species distribution model, then you need to reduce autocorrelation of your input environmental data by removing highly correlated variables. It is widely known that many climate variables are highly correlated with each other. While including all these will not affect the predictive quality of your MaxEnt model, it does seriously limit any inference of the contribution of any correlated variables (i.e. the MaxEnt outputs from 'Analysis of variable contributions' and to some degree 'Jackknifing environmental variables'). This is mainly because when a model is built in MaxEnt, if a highly correlated variable is included in the model, this often excludes all other highly correlated variables from being incorporated. This is because these variables likely would contribute similarly to the models. Since they are not included, they will not be properly represented in the output 'Analysis of variable contributions'. same). To check NoData values, import layers into ArcGIS and right click the layer and select 'Properties' and then go to the 'Source' tab. Alternatively, you can simply open your ".asc" files in a text editor and at the top of header will be the NoData value. 4. Select output folder location. Output will be two tables with the correlation coefficients among all comparisons and a table with the final list of rasters to include in your model.

3A. Preparing Occurrence Data: Import Species Occurrence Records
Tool: CSV, TXT, XLS to shapefile ARCGIS STEP-BY-STEP GUIDE: 1. Open a fresh ArcMap document 2. Import CSV, TXT or XLS file with occurrence records.

Background Selection via Bias Files
A subset of python scripts create bias files used to fine-tune background and occurrence point selection in Maxent. Bias files control where background points are selected and the density of background sampling. Proper use of bias files can avoid sampling habitat greatly outside of a species' known occurrence or can account for collection sampling biases with coordinate data.
Background points (and similar pseudo-absence points) are meant to be compared with the presence data and help differentiate the environmental conditions under which a species can potentially occur. Typically background points are selected within a large rectilinear area, within this area there often exist habitat that is environmentally suitable, but was never colonized. When background points are selected within these habitats, this increases commission errors (false-positives). As a result, the 'best' performing model tends to be over-fit because selection criterion favor a model that fail to predict the species in the un-colonized climatically suitable habitat ( To circumvent this problem, many researchers have begun using background point and pseudo-absence selection methods that are more regional. SDMtoolbox contains three tools to facilitate more sophisticated background selection for use in Maxent. The Sample by Distance from Obs. Pts. tool (see: SDM Tools 2. MaxEnt Tools  Background Selection via Bias Files) uses a common method that samples backgrounds within a maximum radial distance of known occurrences (see Thuiller et al. 2009). The Sample by buffered MCP tool restricts background selection with a buffered minimum-convex polygons based on known occurrences (see following guide).

I. Spatial Jackknifing
Spatial jackknifing (or geographically structured k-fold cross-validation) tests evaluation performance of spatially segregated spatially independent localities. SDMtoolbox automatically generates all the GIS files necessary to spatially jackknife your MaxEnt Models. The script splits the landscape into 3-5 regions based on spatial clustering of occurrence points (e.g. if 3: A,B,C). Models are calibrated with k-1 spatial groups and then evaluated with the withheld group. For example if k=3, models would be run with following three subgroups: 1. Model is calibrated with localities and background points from region AB and then evaluated with points from region C 2. Model is calibrated with localities and background points from region AC and then evaluated with points from region B 3. Model is calibrated with localities and background points from region BC and then evaluated with points from region A

II. Independent Tests of Model Feature Classes and Regularization Parameters
Equally important, this tool allows for testing different combinations of five model feature class types (FC) and regularization multiplier(s) (RM) to optimize your MaxEnt model performance. For example, if a RM was input (here 5), this tool kit would run MaxEnt models on the following parameters for each species:

III. Automatic Model Selection
Finally, the script chooses the best model by evaluating each model's: 1. omission rates (OR)*,2. AUC**, and 3. model feature class complexity. It does this in order, choosing the model with the lowest omission rates on the test data. If many models have the identical low OR, then it selects the model with the highest AUC. Lastly if several models have the same low OR and high AUC, it will choose the model with simplest feature class parameters in the following order: 1. linear; 2. linear and quadratic; 3. hinge; 4. linear, quadratic, and hinge; and 5. linear, quadratic, hinge, product, and threshold. Once the best model is selected, SDMtoolbox will run the final model using all the occurrence points. If desired, at this stage models will be projected into other climates, environmental variables will be jackknifed to measure importance, and response curves will be created.
Tool: Run MaxEnt: Spatially Jackknife ARCGIS STEP-BY-STEP GUIDE: 1. Double-click the 'SDM Tools 2. MaxEnt Tools  Modeling in MaxEnt  Run MaxEnt: Spatially Jackknife' tool 2. Continue to tool interface instructions (below) SDMTOOLBOX STEP-BY-STEP GUIDE: IMPORTANT NOTE 1: none of the input and output file names/file paths can have spaces in them. If there are any spaces, the output batch scripts will likely fail to work properly.

IMPORTANT NOTE 2:
Upon first use of this tool, due to its unique syntax, you need to specify the location of the menu file. If you do not due this, the menu presented by ArcGIS will not make complete sense. For a detailed overview of how to do this (it will take 30 seconds to do), go to: http://www.sdmtoolbox.org/menu-fix-spatial-jackknife The default value is 1.
Remember the more values input will produce more SDMs created and require more computation time. For each regularization multiplier (RM), 15 models will be run (5 feature class groups and 3-5 spatial jackknife groups, 5x3=15 to 5x5=25). This number is multiplied for each replicate and each species modeled. Thus, if you have 2 species, 5 RMs and 2 replicates; this would result in 300 or 500 models run for 3 and 5 spatial jackknife groups, respectively (2 species *2 replicates* 5 RM * 15-25 models per run) 19. Apply a threshold to make binary model. This will generated a binary model in addition to the continuous model. In none is supplied, SDMtoolbox will use '10 percent minimum training presence' to calculated omission rates. If you prefer another threshold, please select it here. 20. Projection Climate Layers. Folders containing environmental data for projecting the MaxEnt models (often are future are past climates). To select multiple folders at once hold to shift. The layers MUST match the input environment layers, e.g. if Bio27 is used to build the model then the projection folder must contain an analog variable and have the identical name. Here resolution and spatial extent do not need to match input environmental layers. 21. Apply clamping when projecting 22. If checked, will predict areas of climate space outside of limits encountered during training 23. Number of CPUs to use for modeling 24. This will not display the MaxEnt GUI when running models-preferred 25. Check this box to perform all the analyses described into the information window. If not checked this will run the modeling as if executed from the MaxEnt GUI (no spatial jackknifing or independent evaluation of RV or feature classes). 26. This disables the use of the threshold feature class in the fifth group of feature class comparisons. This is the default setting for the latest version of MaxEnt. 27. This is the minimum number of points to execute spatial jackknifing. If below this value, the models will be trained and evaluated using either cross-validation, bootstrapping or subsampling (as specified below in steps: 30-32). Each non-spatially jackknifed group is optimized with independent tests of different combinations of the five model feature class types (FC) and input regularization multiplier (RM) values. 28. Replicates of each model parameter class in spatial jackknife runs 29. Number of groups to subdivide the landscape into. Higher the number the more models run, but also the more training points included in each model run. 30. If selected: Groups will be spatially segregated and numbers of occurrences within groups may not be equal. This analysis is more focused on natural spatial groups. This method is best if projecting models into other climates (i.e. current or past) and is particularly useful for training and evaluating model performance in potentially non-analogous climates. If not selected: Spatial jackknife groups will be spatially randomly and numbers of occurrences within groups will be equal (+/-1, due to unequal group sizes for some combinations of occurrences records and group number

Frequently asked questions and misconceptions regarding SDMtoolbox & Maxent:
Should I remove co-correlated variables?
As stated by Elith in her seminar paper, A statistical explanation of Maxent: "MaxEnt has an inbuilt method for regularization… that is reliable and known to perform well (Hastie et al., 2009). It implicitly deals with feature selection (relegating some coefficients to zero) and is unlikely to be improved -and more likely, degraded -by procedures that use other modelling methods to preselect variables (e.g., Wollanet al., 2008). In particular, it is more stable in the face of correlated variables than stepwise regression, so there is less need to remove correlated variables (unless some of them are known to be ecologically irrelevant), or preprocess covariates by using PCA and selecting a few dominant axes" If variables are highly co-correlated, the mathematical relationship of training values might be quite similar and the Maxent algorithm can potentially view them as equally good at characterizing that spatial patterns observed in training values. This does not result in overfitting. The two instances were co-correlated variables are of concern in Maxent models are: (1) the primary goal of the study aims to understand the explicit role of each environmental variable included in the model, and (2) projecting to other climates/landscapes with non-analogous climates. In the first case, given that several variables may equally explain the spatial patterns in data-training, the final 'important' variable will be selected randomly from that sub-group of highly correlated variables. This, then, gives the appearance that the others are not important. However, they are merely redundant and not needed. In the case of projecting to non-analogous climates, the addition of variables increases the likelihood of nonanalogous climates (NAC). Further, if highly correlated variables are included that have NAC, this can cause the MESS plots (how NACs are assessed in Maxent) to display a NAC for each co-correlated layer, suggesting that inferences based on these areas should be regarded with extreme care (more so than would be with a single uncorrelated layer).

Why use SDMtoolbox? How exactly does SDMtoolbox address model parameterization, discriminatory ability & overfitting? Why is SDMtoolbox awesome?
Low overfitting and high discriminatory ability are two prime desired qualities of niche models ( Overfitting is the tendency of a model to fit the random error (or any bias in the sample) rather than the true relationship between the calibration records and predictor variables. Often, overfit models predict the calibration data very well, but perform poorly on other data sets.
Overfitting is typically assessed with the false negative rate, also called omission error rate (OR henceforth). With an appropriately selected threshold converting a continuous prediction into a binary one, ORs indicate the proportion of presences incorrectly classified as falling into unsuitable areas (basically because the prediction is too tightly fit to the conditions at calibration localities; Anderson et al. 2003). The top model output from SDMtoolbox is the one with the lowest OR. AUC values are then used to assess performance post-hoc, as an independent of model tuning-you should report both these values in your results section. For instance, a model built with L features is less complex than one built with L and Q features. Hinge features model a piece-wise linear response to the environmental variable. This allows for parts of the response curve to be defined by a linear relationship while other parts can be defined by a more complex, non-linear relationship ( Phillips and Dudík, 2008). Thus, L features represent a special (restrictive) case of H features and result in less complex models (Phillips and Dudík, 2008). Note that even if multiple feature classes are allowed for model-building, not all classes will necessarily be incorporated in the final model. The default Maxent setting for feature class, called "auto features," applies the class or classes estimated to be appropriate for the particular sample size of occurrence records, according to a previous extensive tuning experiment (Phillips and Dudík, 2008). Phillips and Dudík (2008) selected the following feature classes for continuous variables as default for the corresponding sample sizes: all feature classes for at least 80 occurrence records; L, Q and H for sample sizes 15 to 79; L and Q for 10 to 14 records; only L for below 10 records (Phillips and Dudík, 2008)."

Using independent tests of model feature classes
While using a complex feature settings allows Maxent to produce a model that is more sensitive to details of a species' environmental tolerance, complex feature classes can lead to overfit models. Using SDMtoolbox, we repeat the methods of Shcheglovitova and Anderson (2013) to minimize model overfitting.

Carefully control background selection using a bias files
Bias files can control where background points are selected, and thereby avoid habitats greatly outside of a species' known occurrence. Background points are meant to be compared with presence data to help identify the environmental conditions under which a species can potentially occur. Typically, background points are selected within a large rectilinear area. Within such areas, environmentally suitable but uncolonized or biogeographicaly isolated habitat often exists. The selection of background points within these habitats increases commission errors (false positives). As a result, the 'best' performing model tends to be over-fitted because the selection criterion favors a model that fails to predict the species in the un-colonized climatically suitable habitat (

Spatially rarefy/filter input data
To perform well, most SDM methods require input-occurrence data to be spatially independent. However, researchers often introduce environmental biases into their SDMs from spatially autocorrelated occurrence points. It is important to eliminate spatial clusters of localities for model calibration and evaluation. When spatial clusters of localities exist, often models are over-fit towards environmental biases (reducing the model's ability to predict spatially independent data) and model performance values are inflated Hijimans et al. 2012;). This can be done in SDMtoolbox using the spatial rarify toolbox. Assumptions that may affect SDMs Specific considerations Data compilation occurrence records:

Summary of some basic considerations when generating SDMs
Are species presences (and absence) records representative of the actual distribution?
▪effects of species' natural history ▪geographic/environmental bias ▪intraspecific variability ▪positional uncertainty ▪sample size ▪sampling bias (e.g. towards more accessible areas) ▪taxonomic accuracy (e.g. subspecies or races) ▪temporal coverage in relation to environmental data Data compilation Environmental variables: Do environmental variables accurately capture the association between species subsistence and the environment at the relevant scale?
▪data quality and biases ▪effect on species distribution (direct vs. indirect) ▪resolution in space and time ▪spatial autocorrelation ▪spatial extent ▪temporal coverage and stability ▪type (categorical vs. continuous)

Model generation and calibration
Is the modelling algorithm appropriate given the data available and research question? ▪algorithm assumptions ▪algorithm performance ▪under different scenarios ▪input data type (e.g. presences only vs. presence/absences) ▪output generated (e.g. presence/absence vs. continuous prediction) ▪sensitivity to model parameters

Model generation and calibration
Is the model appropriately calibrated for the data available and research question?
▪model complexity ▪model selection procedure ▪setting of model parameters ▪variable selection strategy

Model validation
Is validation performed on truly independent data and under appropriate settings? ▪assumptions/limitations of accuracy measurement ▪importance of use of multiple metrics ▪sensitivity to model parameters ▪threshold transformation of continuous predictions

Model projection
▪availability of validation data in projected regions ▪likelihood of niche shifts Is the species environment relationship likely to be maintained in space and/or time?
▪model uncertainty ▪model transferability ▪risks of interpolation and extrapolation