Discrete natural neighbour interpolation with uncertainty using cross-validation error-distance fields

View article
PeerJ Computer Science

Introduction

Spatially continuous geographic phenomena are often only measured at point locations. Interpolation techniques provide a method to convert such point data into a continuous estimate of the phenomenon, and have become a fundamental computational technique of spatial and geographical analysts with key texts devoting large sections to interpolation methods (Burrough & McDonnell, 1998; O’Sullivan & Unwin, 2010; Slocum et al., 2014).

Natural neighbour (or Sibson) interpolation is an interpolation technique that was first presented by Sibson (1981). The method is based upon a Voronoi (or: Dirichlet, Thiessen) diagram that partitions space to identify those areas that are closest to a set of points (Okabe et al., 2000). Previous authors (Sambridge, Braun & McQueen, 1995; Watson, 1999) have noted several useful properties of natural neighbour interpolation: (i) the method is an exact interpolator, in that the original data values are retained at the reference data points; (ii) the method creates a smooth surface free from any discontinuities; (iii) the method is entirely local, as it is based on a minimal subset of data locations that excludes locations that, while close, are more distant than another location in a similar direction; and (iv) the method is spatially adaptive, automatically adapting to local variation in data density or spatial arrangement. To this list I would add: (v) there is no requirement to make statistical assumptions; (vi) the method can be applied to very small datasets as it is not statistically based; and (vii) the method is parameter free, so no input parameters that will affect the success of the interpolation need to be specified.

These properties make natural neighbour interpolation particularly well suited for the interpolation of continuous geographic phenomena from data points that have a highly irregular spatial distribution. While the choice of an appropriate interpolation method will always vary on a case by case basis, studies comparing interpolation methodologies with climate and land surface data demonstrate that natural neighbour interpolation is a highly competitive and sometimes optimal technique (Abramov & McEwan, 2004; Bater & Coops, 2009; Hofstra et al., 2008; Lyra et al., 2018; Yilmaz, 2007).

Unfortunately, natural neighbour interpolation can be relatively slow in comparison to other methods (Abramov & McEwan, 2004). The high computational cost arises from the need to insert a new point into the Voronoi diagram for every cell that will make up the interpolation field, and this geometric process becomes increasingly difficult in higher dimensions (Park et al., 2006). This has led to the development of discrete (or digital) natural neighbour interpolation that is significantly quicker than traditional approaches (Park et al., 2006) and has been applied successfully in a geographical context (Keller et al., 2015).

While natural neighbour interpolation has various useful properties, and the discrete form is computationally scalable, there is a great deal of uncertainty associated with any interpolation. Therefore, being able to associate interpolation estimates with some form of uncertainty would be highly desirable. Previous efforts for natural neighbour interpolation have been based upon fitting statistical uncertainty models (Bater & Coops, 2009; Ghosh, Gelfrand & Mlhave, 2012), but this approach is contrary to natural neighbour interpolation’s useful properties (v), (vi), and (vii). Therefore, for those researchers who decide that for their data and objectives natural neighbour interpolation is the best interpolation option, I present an approach to associate the interpolation with a measure of uncertainty that is consistent with all the useful properties of natural neighbour interpolation.

Materials & Methods

Discrete natural neighbour interpolation

In the 2-dimensional planar context that is most relevant to geographical applications, discrete natural neighbour interpolation begins by calculating a discrete Voronoi diagram. First, a raster spatial domain C of cells c is defined such that c ∈ C ⊂ ℝ2 and hence each c has coordinate attributes xy for its centre so all ci = {xiyi}.

The data points are then used to define a set P of n data cells P = {p1p2p3, …, pn} where P ∈ C, and each data cell has coordinate attributes for its cell centre xy and value z, so pi = {xiyizi}. When multiple data points occur within a raster cell, the resulting data cell has a value z that is the mean of all the data point values.

The discrete Voronoi polygon V(pi) that contains all the cells that are closest to each data cell can then be defined as V p i = c C d c p i < d c p j j i where d(c → p) is the Euclidean distance between the centre of the cells c and p. When c is equally distant from more than one p for convenience c is assigned to the p with smallest index. The set of n discrete Voronoi polygons then creates the discrete Voronoi diagram V P = V p 1 , V p 2 , V p 3 , , V p n that identifies which raster cells are closest to which data cells (Fig. 1A) (Okabe et al., 2000). In the process of calculating V(P) another set D(P → C) that records the Euclidean distance from the set of data cells P to all raster cells C (Fig. 1B) is created. As each data cell pi has an associated value zi, V(P) can be used to interpolate the data cell values across the raster to produce Z(P), which in a geographic information system (GIS) context is equivalent to nearest neighbour interpolation (Burrough & McDonnell, 1998; Tomlin, 1990) (Fig. 1C).

Discrete natural neighbour interpolation.

Figure 1: Discrete natural neighbour interpolation.

(A) For a set P of n data cells p the discrete Voronoi diagram V(P) defines which raster cells are closest to which data cells and (B) the distance to the closest data cell D(P → C). (C) V(P) is used to interpolate the values z of the data cells to produce Z(P). (D) For an interpolation cell ci the distance to all raster cells C is calculated as D(ci → C), and (E) by comparing D(ci → C) to D(P → C) identifies Z(ci) which are those cells of Z(P) that are as close or closer to the ci than any data cell p. The mean value of Z(ci) is the natural neighbour interpolation estimate z ˆ for ci, and by repeating this process for all raster cells (F) the natural neighbour interpolation is produced.

To interpolate the data cell values using natural neighbour interpolation, the set of Euclidean distances from an interpolation cell ci to all raster cells D(ci → C) is calculated (Fig. 1D). Then the discrete Voronoi polygon for the interpolation cell V(ci) is defined as V c i = c C D c i C D P C that is the set of raster cells that are as close or closer to the interpolation cell than any data cell. The set V(ci) can then be used to find the set of relevant data cell values Z c i = c Z P c V c i that will form the basis on the interpolation to that cell (Fig. 1E). The natural neighbour interpolation estimate z ˆ is then calculated as z ˆ c i = Z c i Z c i where ∑Z(ci) is the sum of the cell values in Z(ci) and ♯Z(ci) is the number of cells in the set Z(ci), hence z ˆ c i is simply the mean of Z(ci). By calculating z ˆ c i for all raster cells the natural neighbour interpolation is produced (Fig. 1F).

Calculating uncertainty

Cross-validation error

Global error estimation is a traditional approach to measure the uncertainty of geographic models (Zhang & Goodchild, 2002). Given a set of n paired observed o and modelled m values, the absolute error ei for each pair is ei = ∣mi − oi∣, and a global estimate of error using a method such as the mean absolute error (MAE) is calculated as M A E = 1 n i = 1 n e i that is simply the mean of all the absolute errors (Willmott & Matsuura, 2005).

However, there is little point in doing this for the data cells of natural neighbour interpolation as given property (i) that it is an exact interpolator the estimated value z i ˆ for the data cells will always be the same as the actual value zi so the absolute errors will always be zero. Therefore, MAE needs to be applied in conjunction with a cross-validation approach that iteratively withholds each data cell pi from the set of data cells P to produce the set {P − pi}, and then uses interpolation to estimate the value z i ˆ at the withheld data cell pi on the basis of a discrete Voronoi diagram V({P − pi}) that is developed without the withheld data cell. The absolute error ei for each data cell pi is then calculated as e i = z ˆ i z i and the cross-validation MAE can be calculated using Eq. (6).

Even with cross-validation the MAE like all global error estimates, such as the commonly used root-mean-square error (RMSE), are not ideal measures of uncertainty for a spatial interpolation (Zhang & Goodchild, 2002). As non-spatial methods that average errors across space they cannot indicate if errors are consistent across space or if higher errors in one region are balanced out by lower errors in another region. This is a critical limitation of global error estimation methods, as for application purposes it could be very useful to know where the interpolation uncertainty is higher or lower.

Cross-validation error field

One way to communicate the spatial uncertainty of geographical information is to map estimates of error (Zhang & Goodchild, 2002). This has been attempted before for natural neighbour interpolation (Bater & Coops, 2009; Ghosh, Gelfrand & Mlhave, 2012), but as already noted these statistical modelling approaches are contrary to natural neighbour interpolation’s useful properties (v), (vi), and (vii).

Another way to map estimates of error that is consistent with the properties of natural neighbour interpolation is the cross-validation error field (Willmott & Matsuura, 2006). This process begins in a similar manner to the cross-validation MAE, but once e has been calculated for each data cell, rather than average the errors using Eq. (6) the errors are assumed to be spatially autocorrelated and interpolation is used to interpolate e to estimate an absolute error field e ˆ . This use of localised absolute errors is highly advantageous as it is consistent with property (iii) of natural neighbour interpolation and allows for error estimates to reflect local changes in the spatial-autocorrelation of the phenomenon being interpolated, with lower errors in more autocorrelated areas and higher errors less autocorrelated areas.

However, while the cross-validation error field does indicate where interpolation errors are likely to be higher, it cannot be used directly as a measure of uncertainty for natural neighbour interpolation as ultimately the interpolation is calculated using all n data cells and given property (i) of natural neighbour interpolation is that it is an exact interpolator we know we will have zero error and hence zero uncertainty at the data cells.

On the basis of Tobler’s first law of geography that “everything is related to everything else, but near things are more related than distant things” (Tobler, 1970), Zhang & Goodchild (2002) recognise that distance is an important component of uncertainty as locations nearer to data should have less uncertainty. This relationship of increasing error with increasing distance to data has even been demonstrated for natural neighbour interpolation (Keller et al., 2015). Therefore, I propose to extend the cross-validation error field idea by incorporating distance to produce a cross-validation error-distance field that will better represent the uncertainty associated with natural neighbour interpolation.

Natural neighbour distances

A positive relationship between natural neighbour interpolation absolute errors and the minimum distance to a data cell has been shown (Keller et al., 2015), so this relationship could be used to predict absolute error as a function of distance from the nearest data point. However, minimum distance to a data cell is a simplistic metric that does not account for the number and spatial configuration of the data cells (Keller et al., 2015). In addition, using the minimum distance from data cells D(P) produces a field that has discontinuities along the edges of the discrete Voronoi polygons (Fig. 1B) that are contrary to the property (ii) of the natural neighbour interpolation method that creates surfaces free of any discontinuities. Therefore, the natural neighbour distance δ is presented as a more appropriate measure of distance that incorporates information about the number, spatial distances, and relative positions of the data cells forming the interpolation.

The method to calculate δ follows a very similar approach to that of calculating the interpolation, and therefore recycles various data structures that are used for the interpolation. For each interpolation cell ci the Euclidean distances to all data cells are calculated dj = d(ci → pj), and then using the Voronoi diagram V(P) these distances are interpolated via nearest neighbour interpolation to produce D(P) that is the distance to the data cells mapped into the discrete Voronoi polygons (Fig. 2A).

Computation of the natural neighbour distance.
Figure 2: Computation of the natural neighbour distance.
(A) For an interpolation raster cell ci the Euclidean distance to all data cells dj is calculated, and the discrete Voronoi diagram V(P) is used to produce D(P) that interpolates the distances by the discrete Voronoi polygons. (B) the cells of D(P) that are closer to ci than any data cell defines the set D(ci) and the mean value of this set gives the natural neighbour distance δ for ci. (C) When repeated for all raster cells a natural neighbour distance field is produced.

The set V(ci) can be used again to find the set of relevant data cell distances D c i = c D P c V c i

that will form the basis of the interpolation to that cell (Fig. 2B). The natural neighbour distance is then calculated as δ c i = D c i D c i

that is simply the mean value of the distances for the cells in D(ci). With δ calculated for all raster cells it becomes evident that unlike minimum distance that contains spatial discontinuities (Fig. 1B) the natural neighbour distance creates a smooth surface free of any discontinuities (Fig. 2C). Also, the minimum distance is an optimistic measure of distance as it only accounts for the closest data cell, whereas by comparison the distances for δ are larger as they recognise that the other data cells involved in the interpolation are further away.

Cross-validation error-distance field

To incorporate δ into the estimate of error to produce a cross-validation error-distance field, the first step is still a cross-validation process in which each data cell is iteratively withheld and an estimate of the value of the withheld data cell is made with the remaining n − 1 data cells. However, the absolute error e = z i z ˆ i is now divided by the natural neighbour distance δ to calculate a rate of error r for each data cell r i = z i z ˆ i δ i with these rates of error stored so that each data cell becomes pi = {xiyiziri}. Then when conducting the natural neighbour interpolation, while estimating the value z ˆ an estimate of the rate of error r ˆ can be simultaneously produced (Fig. 3A) and used to produce an error estimate e i ˆ = r i ˆ × δ i that when estimated for all cells produces a cross-validation error-distance field (Fig. 3B). The cross-validation error-distance field clearly captures information from the rate of error field (Fig. 3A) and the natural neighbour distance field (Fig. 2C) with lower error estimates in areas that have either low rates of error or natural neighbour distances, and higher error estimates in areas that have higher rates of error and/or natural neighbour distances. Therefore, the cross-validation error-distance field captures uncertainty information relating to local variation in both the autocorrelation of the underlying phenomenon field being interpolated and the spatial distribution of the data cells providing data for the interpolation.

Computation of the cross-validation error-distance field.
Figure 3: Computation of the cross-validation error-distance field.
(A) The rate of absolute error for each data cell ri calculated through cross-validation, and then an estimated rate of absolute error field r ˆ is produced by natural neighbour interpolation of r. (B) The cross-validation error-distance field e ˆ that is the product of r ˆ and the natural neighbour distance δ for each interpolation cell.

Virtual geography experiments

The discrete natural neighbour interpolation and cross-validation error-distance field algorithms described here were implemented using a Python computational framework (Pérez, Granger & Hunter, 2011) using the NumPy (Van der Walt, Colbert & Varoquaux, 2011), SciPy (Virtanen et al., 2020), and Matplotlib (Hunter, 2007) packages. Having proposed a new method, it is sensible to provide an evaluation of how performance varies under different conditions. However, in doing so it is important to remember that interpolation errors result not only from the efficacy of the interpolation method, but also from distribution of data points and the real (but unknown) distribution of the phenomenon field being interpolated (Willmott & Matsuura, 2006) that will be unique to each study. Also, what constitutes an acceptable level of interpolation error will also vary between studies. Therefore, the objective here is try and identify simple trends in performance to verify the methods work as would be expected and to provide some basic information that will help an analyst to make a more detailed assessment of whether interpolation is feasible or not.

To evaluate the effectiveness of the proposed interpolation methods, a series of in silico virtual geography experiments were conducted. Virtual geographies are a very useful approach for methodological evaluation as the conditions can be tightly controlled and explored fully. Virtual geographic phenomena fields for grids of 100 × 120 cells were created using the NLMpy package (Etherington, Holland & O’Sullivan, 2015) implementation of the mid-point displacement fractal algorithm that produces fields representing natural phenomena such as land surfaces (Fournier, Fussell & Carpenter, 1982). The spatial-autocorrelation of the values produced by the mid-point displacement method can be controlled by varying the h parameter to produce fields with spatial-autocorrelation that varies from low to high (Fig. 4).

Examples of virtual geographic phenomena fields created by the mid-point displacement fractal algorithm.

Figure 4: Examples of virtual geographic phenomena fields created by the mid-point displacement fractal algorithm.

The spatial-autocorrelation varies from low to high and is controlled by the h parameter that in these examples has been set to (A) h = 0, (B) h = 1, and (C) h = 2.

The underlying premise of the experiments was that with random sampling of a virtual geographic phenomenon with actual values z (Fig. 5A), natural neighbour interpolation can be used to produce estimated values z ˆ (Fig. 5B). The absolute difference between the actual values and the estimated values is the value error e z ˆ = z ˆ z (Fig. 5C) that will indicate how well the natural neighbour interpolation method works. The value error is also estimated by the cross-validation error-distance field e ˆ (Fig. 5D), and the absolute difference between the value error e z ˆ and the estimated error e ˆ is the error of errors e e ˆ = e ˆ e z ˆ that indicates how well the proposed cross-validation error-distance field performs (Fig. 5E).

The natural neighbour interpolation virtual geography experimental process.

Figure 5: The natural neighbour interpolation virtual geography experimental process.

(A) A virtual geography phenomenon field z with spatial-autocorrelation of h = 2 and n = 20 random sampling points, (B) the resulting natural neighbour interpolation z ˆ from the sampling points, and (C) value error e z ˆ = z ˆ z . (D) The cross-validation error-distance field estimated error e ˆ that is also produced during interpolation is then compared to the value error e z ˆ to produce (E) the error of errors e e ˆ = e ˆ e z ˆ . Interpolation performance as a function of e z ˆ and e e ˆ was summarised for cells within and outside the convex hull of the sampling points. The same experimental process in (A–E) is replicated in (F–J) for a virtual geography phenomenon field with spatial-autocorrelation of h = 1 and n = 10 random sampling points, demonstrating a reduction in interpolation performance at lower levels of spatial-autocorrelation and sampling.

To summarise the performance of both natural neighbour interpolation and the cross-validation error-distance field, the MAE (Eq. (6)) was calculated for the cells inside and outside of the convex hull of the sampling points for both e z ˆ (Fig. 5C) and e e ˆ (Fig. 5E). The MAE was chosen as the error statistic as it expresses error in the same units as the variable of interest and is insensitive to the number of cells in the sample (Willmott & Matsuura, 2006), which was important here as the convex hull area would vary as a result of the random sampling.

When the spatial-autocorrelation and number of sample points is reduced we would expect a reduction in performance of both the natural neighbour interpolation and the cross-validation error-distance field (Figs. 5A5E versus Figs. 5F5J). Therefore, to examine how the natural neighbour methods performed under varying conditions 500 experiments were conducted in which h randomly varied uniformly between 0.0 to 2.0 and n randomly varied uniformly between 10 to 100. The cross-validation MAE was also calculated for each experiment to assess if the cross-validation MAE could be used as an indicator of expected interpolation performance.

Results

The results from the virtual geography experiments demonstrate that, as would be expected for the cells within the convex hull of the sampling points, the MAE of the value errors e z ˆ from the natural neighbour interpolation (Fig. 6A) and error of errors e e ˆ from the cross-validation error-distance field (Fig. 6B) reduced as the number of data points n and the spatial-autocorrelation h of the underlying virtual phenomena fields increased. The effect of h was more important, as when h was low or high n did not have much effect on the performance. The importance of h is to be expected as all interpolation methods work on the assumption that the phenomenon being interpolated has sufficient levels of spatial-autocorrelation.

Performance of natural neighbour interpolation and cross-validation error-distance fields from 500 virtual geography experiments.

Figure 6: Performance of natural neighbour interpolation and cross-validation error-distance fields from 500 virtual geography experiments.

The mean absolute error (MAE) of cells within the convex hull around sampling points for different experimental combinations of the number n of random sampling points and the spatial-autocorrelation h of virtual phenomena fields for (A) the value errors e z ˆ from the natural neighbour interpolations and (B) the error of errors e e ˆ from the cross-validation error-distance fields that (C) were highly correlated. (D) Comparison of e z ˆ and the cross-validation MAE derived from the sampling points. Comparison of interpolation performance inside and outside the convex hull around the sampling points for (E) e z ˆ and (F) e e ˆ .

There was also a very strong correlation between e z ˆ and e e ˆ (Fig. 6C) and this similarity of behaviour under different conditions indicates that the cross-validation error-distance field meets the objective of providing a measure of uncertainty that is consistent with all the useful properties of natural neighbour interpolation.

While the results of the virtual geography experiments (Figs. 6A and 6B) indicate that lower average errors can be expected when n ≳ 20 and h ≳ 1.0 (Fig. 4B) such criteria cannot be easily applied by an analyst as while n is known h is unknown and in many situations will be hard to guess. Fortunately, while the cross-validation MAE that can always be calculated by an analyst is generally slightly higher than the e z ˆ there is still a strong correlation between the two variables (Fig. 6D), and this correlation is extremely useful as it indicates to an analyst the likely levels of e z ˆ and therefore e e ˆ too.

A comparison of e z ˆ and e e ˆ inside and outside of the convex hull around the sampling points clearly shows that while the performance follows a similar trend e z ˆ and e e ˆ can be expected to be higher outside of the convex hull (Figs. 6E6F).

Discussion

The virtual geography experiments indicate that under suitable conditions the natural neighbour interpolation field and the cross-validation error-distance field should provide useful estimates of a geographic phenomenon field with associated uncertainty. The fact that the cross-validation error-distance field reflects localised changes in the spatial distribution of both the underlying phenomenon and the point data is particularly useful, and contrasts with other spatial interpolation uncertainty methods such as MAE and RMSE that estimate error using a global approach.

The virtual geography experiments demonstrated that the performance of natural neighbour interpolation will be lower outside of the convex hull around the data points, as is expected (Watson, 1999)—although this is also likely to be true of all spatial interpolation techniques as beyond the convex hull interpolation becomes extrapolation. However, we do not suggest that interpolation should be restricted to within the convex hull as there may be occasions where the area of interest may occur slightly outside the convex hull. For example, when interpolating rainfall data from weather stations that are usually sited in settlements, there are likely to be areas of coastline along peninsulas and headlands that will not fall within a convex hull around the weather stations (Lyra et al., 2018). Therefore, it is logistically useful that discrete natural neighbour interpolation can produce estimated values beyond the convex hull of the available data points. What is helpful in this context is that the cross-validation error-distance field incorporates information on distance from data points, therefore as interpolations move further beyond the convex hull the error-field should increase to help to guard against erroneous estimates.

However, the responsibility of appropriate use of natural neighbour interpolation still belongs with the spatial analyst who must make decisions about whether interpolation is useful based on their knowledge of: the expected spatial-autocorrelation of the phenomenon being interpolated, the number and distribution of data points, the location of the areas for which interpolations are required, and the magnitude of the estimated errors in relation to the magnitude of the value estimates. And of course, the cross-validation error-distance field only captures uncertainty in the interpolation itself and does not incorporate any uncertainty that may arise from the data itself. While I have argued against the use of the cross-validation MAE as a measure of uncertainty, I would recommend that analysts continue to calculate the cross-validation MAE given its strong correlation with the performance of the natural neighbour interpolation, and therefore the performance of the cross-validation error-distance field too. Analysts can then use the cross-validation MAE as a helpful guide when deciding if interpolation is advisable or not. When doing so it is important to remember that as the cross-validation MAE is based on the use of n − 1 data cells, the error estimates may be slightly higher than the real errors that would be based on all n data that is ultimately used in the interpolation (Willmott & Matsuura, 2006). Therefore, the cross-validation MAE should be seen as a slightly conservative indication of likely interpolation performance.

Conclusion

For those researchers for whom natural neighbour interpolation is the best interpolation option, the cross-validation error-distance field method presented provides a way to assess the uncertainty associated with natural neighbour interpolations that is consistent with the useful properties of natural neighbour interpolation. While the cross-validation error-distance method has been described here in the context of discrete natural neighbour interpolation, there is no reason why this same approach could not be applied to geometric natural neighbour interpolation as well. Discrete natural neighbour interpolation has been implemented here in two-dimensional space for ease of visualisation, but the method will generalise to higher dimensions (Park et al., 2006) and in principle I cannot see any reason why the uncertainty method presented could not also be applied in higher dimensions by those who wish to do so. The approach could easily be adapted to other interpolation methods, as all that is required is a measure of weighted distances to the data points creating the interpolation. Given the promise of the algorithm, and to encourage its use and development, the Python code used to generate the examples presented is freely available under the permissive MIT License as supplementary material.

Supplemental Information

Python code to reproduce the examples and figures

DOI: 10.7717/peerj-cs.282/supp-1
  Visitors   Views   Downloads