The Mahalanobis distance (Mahalanobis, 1936) is a statistical technique that can be used to measure how distant a point is from the centre of a multivariate normal distribution. Consider a data matrix A with m rows of observations and n columns of measured variables.
The Mahalanobis distance D2 for each observation vector xm = [xm1, xm2, xm3, …, xmn] is calculated as a function of an n-dimensional vector containing the means for each column of variables, and a variance–covariance matrix S of dimensions n × n that contains variances for each column along the main diagonal and pair-wise column covariances values elsewhere (Manly, 2005). (1)
When applied to m = 50 points in n = 2 dimensions, the calculated D2 values follow a characteristic elliptical pattern with D2 radiating out from the central location of the distribution (Fig. 1A).
While D2 can be calculated for any n-dimensions, the values of D2 are not comparable when n varies, as D2 increases as n increases (Figs. 2A–2C). However, the and S−1 used to calculate D2 (Eq. (1)) transform the values from each n into independent standard normal distributions, with centering and S−1 scaling and rotating each variable distribution. This means that D2 is essentially the sum of n independent standard normal variables, and as such follows a chi-squared distribution with degrees of freedom equal to the number of dimensions n (Manly, 2005).
For a chi-squared random variable with n degrees of freedom denoted as , the probability density function f of when x ≥ 0 is (2)
where Γ is a gamma function (as the chi- squared distribution is actually a special case of the gamma distribution), and the associated cumulative distribution function is (Johnson, Kotz & Balakrishnan, 1994). When the variables in A are normally distributed, the association between D2 and can be clearly seen (Figs. 2A–2C).
By converting D2 into probabilities using we can put D2 from any number of dimensions on a common 0–1 scale that indicates the probability that a location has a D2 that is greater than that would be expected by chance (Figs. 2D–2F). By specifying a significance level α this process is commonly used as an outlier detection method as it is: parameter free, computational efficient, accounts for collinearity between variable dimensions, and is scale independent (Aggarwal, 2017). Returning to our earlier example, having transformed the values from D2 to we can see that there are two points that are very likely to be outliers, and would be classified as outliers with α = 0.975 (Fig. 1B).
The potential of D2 for use in habitat modelling was first identified by Clark, Dunn & Smith (1993) and then discussed further in the context of niche modelling by Farber & Kadmon (2003)—which is how I will continue the discussion. The premise here is that given a data matrix A of m species observations for which various environmental variables n are measured, D2 can be used as a measure of niche suitability from an optimum location in environmental space. Having defined a niche in this way, by measuring the D2 for each location on a landscape a map of niche suitability can then be produced. The key advantages of using D2 over other methods are that the D2 method needs only presence information, and so does not require either absences or a background definition, and that independence of explanatory variables is not required (Clark, Dunn & Smith, 1993; Farber & Kadmon, 2003). Studies that have compared Mahalanobis distance to other modelling approaches have also shown that while the optimum method tends to vary by the species in question, the Mahalanobis distance approach performs well against a variety of other presence-only, presence-background, and presence-absence modelling approaches (Dettmers, Buehler & Bartlett, 2002; Johnson & Gillingham, 2005; Tsoar et al., 2007).
While introducing D2 into the ecological modelling domain Clark, Dunn & Smith (1993) also highlighted that D2 will follow a chi-squared distribution, and therefore the potential to convert the distances into probabilities. However, as ecological niche models aim to describe probability of belonging to the niche, rather than use the chi-squared cumulative distribution function that has 0 at the optimum, we use the inverse chi-squared cumulative distribution function that has 1 at the optimum. This means that the probabilities indicate locations with a D2 that is less than that would be expected by chance, and hence are more likely to be within the niche.
Unfortunately, when describing the use of a chi-squared distribution to convert D2 into probabilities, Clark, Dunn & Smith (1993) p.522 state that “Assuming multivariate normality, Mahalanobis distances are approximately distributed as Chi-square with n − 1 degrees of freedom, where n equals the number of habitat characters.”, but this is incorrect. D2 values follow a chi-squared distribution with degrees of freedom equal to n (Manly, 2005) as has already been clearly shown (Fig. 2). This error has been repeatedly described in the literature (Knick & Rotenberry, 1998; Farber & Kadmon, 2003; Hellgren et al., 2007), and has even permeated into software such as the R package adehabitat (Calenge, 2006).
To demonstrate this error and to examine its implications, I present an experiment based on a virtual ecology approach (Zurell et al., 2010)—which allows us to examine methodologies in a controlled system uncomplicated by the uncertainties of the real world!
Materials and Methods
The virtual ecology experiment began by defining, and therefore knowing truthfully, the fundamental niche of the imaginary species Mimbulus mimbletonia (Rowling, 2003). This fundamental niche N was defined using a multivariate normal distribution describing the niche in relation to two environmental variables of temperature and rainfall. (3)
A sample of observations was then created by randomly sampling the niche space ranging 15–35 °C of temperature and 0–200 mm of rainfall. At each randomly selected location the species was considered detected using a probability equal to the fundamental niche. This process was continued until a sample of 200 observations was generated.
The values of temperature and rainfall at each of the 200 sampled locations were then used to estimate the fundamental niche using Mahalanobis distances. The D2 was calculated for each of the sampled locations, and the resulting distribution of D2 was compared against a range of inverse cumulative chi-squared distribution functions based on differing degrees of freedom.
Finally, the D2 values for the samples were converted into probabilities of belonging to the fundamental niche, and these predictions were compared against the known fundamental niche values for the locations. Probabilities were calculated using chi-squared distributions with n and n − 1 degrees of freedom to examine any differences.
The fundamental niche defined using the multivariate normal distribution (Eq. (3)) produced an elliptically shape niche with positive correlation between rainfall and temperature (Fig. 3A). The random sampling resulted in a set of 200 samples that followed this elliptical niche pattern, with a greater concentration of samples towards the centre of the niche (Fig. 3B).
The D2 values calculated on the basis of these samples also followed an elliptical pattern (Fig. 3C), and when the calculated D2 values for each sample were plotted against the actual fundamental niche value for each sample, a trend that clearly follows the inverse cumulative chi-squared distribution when n = 2 can be seen (Fig. 3D). This is as we would expect as in this example the fundamental niche is based on the two environmental variables of temperature and rainfall.
When the D2 values were converted to probabilities using n = 2 degrees of freedom we see a near-perfect linear fit between the estimated niche suitability and the actual known niche suitability (Fig. 3E). This proves quite clearly that to get a truthful estimate of our known fundamental niche, the probabilities need to be based on an inverse cumulative chi-squared distribution with n degrees of freedom. In contrast, when the D2 values are converted to probabilities using n − 1 degrees of freedom we see a badly-fitting curvilinear relationship that underestimates niche suitability (Fig. 3F).
The results of the virtual ecology experiment (Fig. 3) clearly shows the erroneous under prediction of niche suitability when using a chi-squared distribution with n − 1 degrees of freedom. Fortunately most previous D2 studies simply rescaled the D2 into quantiles or ranks of increasing suitability (Knick & Dyer, 1997; Knick & Rotenberry, 1998; Johnson & Gillingham, 2005; Hellgren et al., 2007; Etherington et al., 2009) or binary classifications based on a threshold (Farber & Kadmon, 2003; Thatcher, Van Manen & Clark, 2006; Tsoar et al., 2007) and so are not affected by this problem. Also, although I could not find any examples of this, those studies that did create chi-square probabilities using n − 1 degrees of freedom, but then converted to categories based on quantiles or a predictive threshold would not have impacted the conclusions of the study as while the shape of the relationship becomes curved with n − 1, the trend is still one of monotonic increase, therefore the quantiles would be the same. However, those studies that have created chi-squared probabilities with n − 1 as some form of suitability index (Clark, Dunn & Smith, 1993) will have underestimated the niche suitability.
Given that D2 values are unitless and unbounded and are not directly comparable for different dimensions, I would argue that anyone using the Mahalanobis distance method should present their results as chi-squared probabilities. This will put Mahalanobis distance models on a 0–1 scale that enables models based on differing numbers of n to be directly comparable, and is consistent with most other types of ecological niche models that also use a 0–1 scale. As such, it will be very important that the chi-square probabilities are calculated correctly and hopefully the methodological description and experimental evidence presented here will enable that to be achieved.