The use of visible-near infrared (vis-NIR) spectroscopy for rapid soil characterisation has gained a lot of interest in recent times. Soil spectra absorbance from the visible-infrared range can be calibrated using regression models to predict a set of soil properties. The accuracy of these regression models relies heavily on the calibration set. The optimum sample size and the overall sample representativeness of the dataset could further improve the model performance. However, there is no guideline on which sampling method should be used under different size of datasets.

Here, we show different sampling algorithms performed differently under different data size and different regression models (Cubist regression tree and Partial Least Square Regression (PLSR)). We analysed the effect of three sampling algorithms: Kennard-Stone (KS), conditioned Latin Hypercube Sampling (cLHS) and k-means clustering (KM) against random sampling on the prediction of up to five different soil properties (sand, clay, carbon content, cation exchange capacity and pH) on three datasets. These datasets have different coverages: a European continental dataset (LUCAS,

Overall, the PLSR gives a better prediction in comparison to the Cubist model for the prediction of various soil properties. It is also less prone to the choice of sampling algorithm. The KM algorithm is more representative in the larger dataset up to a certain calibration sample size. The KS algorithm appears to be more efficient (as compared to random sampling) in small datasets; however, the prediction performance varied a lot between soil properties. The cLHS sampling algorithm is the most robust sampling method for multiple soil properties regardless of the sample size.

Our results suggested that the optimum calibration sample size relied on how much generalization the model had to create. The use of the sampling algorithm is beneficial for larger datasets than smaller datasets where only small improvements can be made. KM is suitable for large datasets, KS is efficient in small datasets but results can be variable, while cLHS is less affected by sample size.

In the last few decades, there has been growing interest in rapid soil characterisation. Infrared spectroscopy has gained interest for various soil analyses over the conventional ‘wet chemistry’ methods because the latter is laborious, costly and time-consuming. Furthermore, multiple soil properties can be predicted from a single soil spectrum (

In the mid-infrared region (MIR), the absorption is due to fundamental vibrations of organic and inorganic molecules in the soil; while in the vis-NIR region, absorption is due to overtones and the combinations of the fundamental vibrations found in the MIR region (

Spectroscopy in conjunction with these chemometric techniques have been proven to predict various chemical and physical properties of soil, such as pH, cation exchange capacity (CEC), carbonate content, organic carbon content, and soil texture (

There are various sampling algorithms available to select calibration samples in infrared spectroscopy, such as the Kennard-Stone (KS) algorithm, the conditioned Latin Hypercube Sampling (cLHS) and k-means clustering (KM). One of the most common sampling algorithms used in the infrared spectroscopy literature is the KS algorithm (

The red circles represent the samples selected by a particular sampling algorithm. (A) represents sample population, (B) represents random sampling, (C) represents the Kennard- Stone (KS) algorithm, (D) represents the conditioned Latin Hypercube sampling (cLHS) algorithm (E) represents the k-means clustering algorithm (KM).

^{2+}) were comparable regardless of the sampling method. This study warrants further research as it only considers two properties for a field (5 km^{2}) and regional scale (<500 km^{2}) with a calibration sample size of up to 380 samples for each dataset.

In this study, we compared three sampling algorithms (KS, cLHS, and KM) against random sampling on three different datasets at continental, regional, and local scale with various calibration sample sizes using two different regression methods: PLSR and Cubist regression modelling. The performance of the models is evaluated based on the average prediction accuracies of up to five different soil properties (sand, clay, carbon content, cation exchange capacity and pH). Thus, the objective of this paper is to investigate the effect of calibration sample size, the efficiency of sampling algorithms, and regression methods to predict various soil properties on soil samples from three different spatial extents.

Three datasets were used in this study. The first dataset is from Europe which represents a continental database. The second is a regional database from southern New South Wales (NSW) and northern Victoria (VIC), and the third is a local database from the locality of Hillston in south-west NSW, Australia.

Dataset 1 was obtained from the Land Use/Land Cover Area Frame Survey (LUCAS) database (^{2}). This database is a collection of composite soil samples from 0–20 cm depth. All samples were scanned with a FOSS CDS Rapid Content Analyzer (NIRSystems, INC.) operating within 400–2,500 nm wavelength range with 0.5 nm spectra resolution. Each spectrum is composed of 4,200 wavelengths. Only one-third of the database were considered for this study to reduce computational time, resulting in a subset of 5,639 observations. All samples had been analyzed for particle size distribution (clay and sand content), pH (in CaCl_{2}), organic carbon (g/kg), and cation exchange capacity (CEC; cmol/kg) among all other properties.

Dataset 2 consists of 379 soil samples of 68 different soil profiles from the wheat-belt of southern NSW and northern VIC covering approximately a 5,000 km^{2} area (_{2} (1:5), total carbon (%) and CEC (cmol/kg).

Dataset 3 consists of soil samples from different soil cores extracted to 1.5 m from the cotton-growing district of Hillston in south-west NSW (^{2} in size. The samples were collected in a survey conducted in 2002, consisting of 384 samples from 87 different sites. The soils in this area are mainly Vertisols, with some soils of sandier texture derived from Aeolian parent material (_{2}O), and CEC (cmol/kg) (

The summary statistics for all datasets are included in

Calibration set | Validation set | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

pH_CaCl2 | 3,639 | 2.66 | 5.89 | 5.79 | 9.25 | 1.34 | −0.27 | 1,000 | 3.11 | 5.78 | 5.72 | 8.01 | 1.32 | −0.2 |

CEC (cmol/kg) | 0 | 11.8 | 13.87 | 78.5 | 9.55 | 1.31 | 0 | 11.35 | 13.87 | 59.9 | 9.96 | 1.36 | ||

Clay (%) | 0 | 17 | 19.21 | 79 | 13.12 | 0.9 | 1 | 17 | 19.18 | 79 | 13.00 | 0.92 | ||

Sand (%) | 1 | 41 | 42.35 | 98 | 26.05 | 0.23 | 1 | 42 | 41.91 | 98 | 26.01 | 0.19 | ||

Organic Carbon (g/kg) | 0 | 18.9 | 24.96 | 99.5 | 18.67 | 1.63 | 0 | 19.5 | 25.6 | 99.5 | 18.78 | 1.49 | ||

pH_CaCl2 | 284(51)^{*} |
3.84 | 5.31 | 5.43 | 8.03 | 0.89 | 0.6 | 95(17)^{*} |
3.76 | 5.45 | 5.7 | 8.23 | 1.17 | 0.53 |

CEC (cmol/kg) | 0.4 | 7.08 | 8.62 | 28.21 | 5.12 | 1.15 | 1.6 | 8.87 | 10.88 | 36.43 | 7.21 | 1.33 | ||

Clay (%) | 5 | 20 | 26.06 | 70 | 16.23 | 1 | 7 | 21 | 29.09 | 74 | 17.28 | 0.96 | ||

Sand (%) | 14 | 60 | 57.12 | 91 | 16.42 | −0.46 | 17 | 59 | 55.82 | 81 | 16.47 | −0.7 | ||

Total Carbon (%) | 0.06 | 0.83 | 1.19 | 12.74 | 1.48 | 4.3 | 0.11 | 0.93 | 1.16 | 5.9 | 1.04 | 2.2 | ||

pH | 298(66)^{*} |
5.8 | 8.83 | 8.61 | 10.06 | 0.86 | −0.8 | 86(21)^{*} |
6.33 | 8.87 | 8.68 | 9.92 | 0.85 | −0.82 |

CEC (cmol/kg) | 3.19 | 28.67 | 26.88 | 50.71 | 9.18 | 0.76 | 2.65 | 27.89 | 26.84 | 53.84 | 9.04 | −0.41 | ||

Clay (%) | 8.7 | 53.7 | 49.47 | 64.4 | 12.56 | −1.79 | 4.4 | 51.85 | 46.9 | 63.7 | 13.19 | −1.51 | ||

Sand (%) | 19.73 | 35.55 | 39.28 | 90.26 | 13.97 | 1.98 | 23.81 | 38.41 | 42.21 | 94.73 | 13.53 | 1.7 |

The number in parentheses represents the number of different sites where the samples originated from.

The PCA was performed on the pre-processed vis-NIR spectra.

To ensure that all the spectra from the different datasets underwent the same spectra pre-processing treatment, spectra from the LUCAS dataset were resampled every 1 nm to have the same sampling intervals, resulting in 2,100 points. Spectra between 350–499 nm and 2,451–2,500 nm range were removed due to their low signal to noise ratio resulting in 1951 point spectra for all datasets. The resulting spectra were transformed to absorbance log (1/R), and pre-processed by Savitzky-Golay (SG) transformation (

Three different sampling algorithms were tested in this study against random sampling, including Kennard Stone (KS), conditioned Latin Hypercube Sampling (cLHS), and k-means clustering (KM). All of the sampling methods are based on different principles of selecting samples from the available spectra data to be used for model calibration. Except for the random sampling, the three other sampling algorithms were utilized to optimize the selection of representative samples from the spectra. Ideally, the samples selected to be used for model calibration should explain the variability in the original samples and ultimately provide reliable predictions on the validation dataset (

This is the simplest way of selecting samples. It creates a subset that follows the statistical distribution of the original dataset. While this is an unbiased method, it is not efficient as more samples are required to achieve the representativeness of the data (

This algorithm was developed initially to create a response surface of experimental design (

where _{o} is the candidate sample to be chosen. Here, the Euclidean distance is used (

Conditioned Latin Hypercube sampling has its origins in Latin Hypercube sampling (LHS), which was first proposed by _{1}_{2}_{k}, the _{1} are combined randomly (or in some order to maintain its correlation) with the _{2}, and so on until

K-means is a method to group data that are similar to each other into clusters. First, the data are allocated to the pre-defined number of centroids (center of the clusters). It is then optimized by minimizing the distance between the values of the data to its designated centroid while maximizing the distances among all the centroids. In this case, we utilized the Euclidean distance. Each data is reassigned to a cluster with the nearest centroid, and the new means becomes the new centroids. This process continues until no change in cluster members are observed (

All spectra derivation and calculation were performed with R statistical language and open-source software (^{2} values for the prediction of the various soil properties on the validation dataset. Other accuracy parameters (bias and RPIQ) are included in the Supplementary Material.

Each of the dataset was first randomly split into calibration and validation set (∼75% and ∼25% respectively). For the continental dataset, 1,000 samples were retained as the validation set, and the rest of the samples were utilized as a calibration set. In the smaller datasets (regional and local), the topsoil and subsoil samples were paired prior to data splitting. The dataset were split based on the unique profile location as suggested by

To reduce the computational time, all the sampling strategies were applied to the principal components (PC) space of the pre-processed vis-NIR spectra. First, the principal component analysis was performed on all the dataset to determine how many principal components to be kept to explain 99% of the variances within the dataset. Nine, six and five PCs were retained for continental, regional and local dataset respectively. The R package ‘base’ was used to select the random samples (

The number of sample sizes was set at 50, 100, 150, 200, 250, 300, 400, 500, 1,000, 2,000 and 3,000 for the continental dataset, and 50, 100, 150 and 200 samples for both the regional and local dataset. All these different size calibration dataset models were validated with the same validation set from its respective dataset. All but the KS sampling algorithm were repeated fifty times and the average performances were reported in this study because the same samples were produced at each iteration, and hence removing the need of multiple repetitions. The methodology flow chart is illustrated in

For each calibration set, the modelling required using R implementations of PLSR (

To investigate the effect of different types of regression models on prediction accuracy, the two models (PLSR and Cubist) were generated for each soil property and different calibration sample size for each dataset. This results in more than three thousand realizations and models for each dataset. The performance of the PLSR and Cubist regression model was evaluated on five soil properties for the continental and regional dataset and four soil properties for the local dataset. All results presented here are based on the validation set.

The boxplots comparing the two regression models (PLSR and Cubist) using various sampling algorithms with various calibration sample sizes for the different datasets are included in ^{2} value of various properties for that dataset using a given calibration sample size and sampling algorithm. For a comparison between the effects of regression models, only the performance of random sampling method is discussed in this section. The effect of sampling algorithm will be discussed later in the paper.

Each boxplot represents the results for the 50 repetitions of the various soil properties predicted. cLHS, conditioned Latin Hypercube sampling; KM, k-means clustering; KS, Kennard-Stone.

Each boxplot represents the results for the 50 repetitions of the various soil properties predicted. cLHS, conditioned Latin Hypercube sampling; KM, k-means clustering; KS, Kennard-Stone.

Each boxplot represents the results for the 50 repetitions of the various soil properties predicted. cLHS, conditioned Latin Hypercube sampling; KM, k-means clustering; KS, Kennard-Stone.

For the continental dataset, pH was predicted best using the PLSR model with calibration sample size of 3,000 (^{2} = 0.81), followed by clay content (^{2} = 0.73), CEC (^{2} = 0.68), OC (^{2} = 0.59) and sand content (^{2} = 0.53). For the Cubist modelling and calibration sample size of 3,000, the model performance for each of the soil properties were: pH (^{2} = 0.83), clay content (^{2} = 0.70), CEC (^{2} = 0.61), OC (^{2} = 0.58) and sand content (^{2} = 0.52). More detailed results are included in the

For the regional dataset with the calibration sample size of 200, using the PLSR model the ranking from the highest to lowest in terms of the ^{2} was CEC (^{2} = 0.82), pH (^{2} = 0.79), clay (^{2} = 0.75), sand content (^{2} = 0.74) and total C (^{2} = 0.72). Using the Cubist model and calibration sample size of 200, the best performance of the model in terms of ^{2} were CEC (^{2} = 0.80), pH (^{2} = 0.73), clay content (^{2} = 0.72), sand content (^{2} = 0.71) and total carbon respectively (^{2} = 0.70).

For the local dataset with the calibration sample size of 200, using the PLSR model the best models in terms of ^{2} were ranked as clay (^{2} = 0.77), pH (^{2} = 0.72), CEC (^{2} = 0.71) and sand content (^{2} = 0.70). With the Cubist model and calibration sample size of 200, the best-fitted models were clay (^{2} = 0.73), followed by pH (^{2} = 0.72), CEC (^{2} = 0.69) and sand content (^{2} = 0.68).

In general, the PLSR provided better prediction than the Cubist regression, regardless of the calibration sampling size and sampling algorithm (see

cLHS, conditioned Latin Hypercube sampling; KM, k-means clustering; KS, Kennard-Stone.

In the continental dataset using the PLSR model, there was a steady increase in model performance (lower RMSE) as calibration sample size increased (see

cLHS, conditioned Latin Hypercube sampling; KM, k-means clustering; KS, Kennard-Stone.

cLHS, conditioned Latin Hypercube sampling; KM, k-means clustering; KS, Kennard-Stone.

As the number of samples for calibration increased, the prediction became more accurate following the general pattern of a learning curve (see

Regardless of the sampling algorithm, the use of the PLSR model for the continental dataset, yielded pretty much similar performance with sample sizes greater than 1,000 (

With the Cubist model in the continental dataset, the cLHS and KS algorithm converged to the performance of the random sampling at a sample size of 2,000. However, when using the KM algorithm, the predictions became worse with an increasing number of samples. No plateau had been reached in the smaller datasets (regional and local) using the Cubist model, with the KM algorithm performing worse as the calibration sample size increased. This means that for a large number of samples, the KM algorithm does not partition the data effectively, and should not be used.

The performance of the KS algorithm increased as sample size increased in the regional dataset, except for the prediction of pH. In the local dataset, only the pH prediction improved as calibration sample size increased to 200 (

Firstly, we evaluate the sampling algorithm that produced the lowest error. For the continental dataset with the PLSR model, overall the KM algorithm performed best for clay, sand, pH and organic carbon (giving the lowest RMSE) for sample sizes <1,000 (

To be able to quantify the effectiveness of a sampling algorithm, its performance is compared against the performance of the random sampling method by way of the ratio between RMSE values from each sampling approach and the random sampling approach. The average performance prediction for the various soil properties were then plotted as boxplots illustrated in

Each boxplot represents the average of 50 repetitions of the five different soil properties predicted. The solid black line represents the average performance of the random sampling. cLHS, conditioned Latin Hypercube sampling; KM, k-means clustering; KS, Kennard-Stone.

Each boxplot represents the average of 50 repetitions of the five different soil properties predicted. The solid black line represents the average performance of the random sampling. cLHS, conditioned Latin Hypercube sampling; KM, k-means clustering; KS, Kennard-Stone.

Each boxplot represents the average of 50 repetitions of the four different soil properties predicted. The solid black line represents the average performance of the random sampling. cLHS, conditioned Latin Hypercube sampling; KM, k-means clustering; KS, Kennard-Stone.

For the continental dataset, the combination of an effective sampling algorithm with PLSR model could improve the overall model performance. The KS algorithm was able to provide a calibration subset dataset that improved the model performance in comparison to the random sampling up to sample size of 500 (see

In the regional dataset, the KS algorithm with PLSR model performed best with a median RMSE reduction of 2–8% (ranging from 0–19%). The KM algorithm provided a subset of calibration dataset that contibuted to better model outcomes when compared to the random sampling, starting at calibration sample size of <150 (see

In the local dataset, the KS algorithm also provided samples with the lowest RMSE prediction. However, note that the variation in the RMSE was quite large (see

The choice of regression model clearly affected the model performance. In general, the PLSR model performed better than the Cubist model. This could be due to the un-optimized hyperparameters used in the Cubist model in this study. By adding number of committees or neighbours in the Cubist model, the model generated would be more robust. However, caution should be taken when tuning these hyperparameters as overfitting could be introduced when the calibration sample set is small.

Sample size and sample representativeness affected the performance of the regression model. As calibration sample size increased, the model performance improved which follows a pattern of a learning curve. Increasing sample size only could improve the model prediction up to a certain point, and further addition of calibration sample data would not lead to a better model. The optimum calibration sample size relied on how much generalization the model has to create. When the model performance is optimized, it is unnecessary to add more calibration samples.

Since the choice of sampling algorithm also affects the model performance, the selection thereof from a soil spectral modelling perspective requires due consideration. In particular, we found the combined use of regression models and a sampling algorithm that represents the sample population better (cLHS) have higher accuracy in comparison to those that tend to pick up the outlier in the sample population (KS), which logically makes sense. Although the KM algorithm performed well on the larger continental dataset and the KS algorithm performed best on the smaller regional and local datasets, the cLHS algorithm provided the most robust sampling algorithm. However, this efficiency of the sampling algorithm in improving predictions was more beneficial in the larger dataset. This suggests that sampling algorithms were not as effective in smaller datasets, and random sampling itself should be sufficient. Furthermore, the combined use of a sampling algorithm with certain regression models should be done with caution, as we showed earlier. The use of the KS algorithm in conjunction with Cubist models yielded large variations in model performance.

We noted that in this study, the sampling algorithms (cLHS, KM and KS) selected samples based on the principal components of the spectra, while the calibration models used the pre-processed spectra. Thus, their use in sampling algorithms may not be optimal, and perhaps that leads to the low performance of the cLHS method. Although similar results are expected, future research should look into comparing the performance of sampling algorithm both by using PCs as well as the pre-processed spectra.

We explored the effect of three different sampling algorithms in comparison to random sampling on different calibration sample sizes using two different regression models on three different datasets.

For the datasets we evaluated, generally, the PLSR model gives better performance in comparison to the Cubist model. It generated much more robust models regardless of the sampling algorithm. A future study could assess the optimization of Cubist hyperparameters.

The Cubist tree model is highly affected by the choice of sampling algorithm, especially KS. The KS sampling technique is not recommended for use in rule-based or tree models.

Although an increase in calibration set size could increase the performance of the model, we found that in a continental dataset, calibration sample size ≥1,000 does not provide much improvement to model prediction. This also means that only 25% of the samples need to be fully analysed to provide a good calibration set.

The KM algorithm was suitable to select calibration dataset for larger datasets up to a point (∼1,000 samples), however, the performance deteriorated with increasing samples size, with KM being the worst for smaller datasets.

Conversely, the KS algorithm performed better on the smaller datasets and worse in large datasets. As the algorithm picks extreme spectra, KS can result in a good calibration for certain soil properties, but poor calibration in other properties.

The use of cLHS algorithm provided more robust sampling algorithms regardless of sample sizes.

Overall, the efficiency of the sampling methods (in comparison to random sampling) is more significant in the larger dataset in comparison to the smaller datasets.

The authors acknowledge the Sydney Informatics Hub and the University of Sydney’s high performance computing cluster Artemis for providing the high performance computing resources that have contributed to the research results reported within this paper.

Budiman Minasny is an Academic Editor for PeerJ.

The following information was supplied regarding data availability:

Ng, Wartini; Minasny, Budiman; Malone, Brendan; Filippi, Patrick (2018): Optimum sampling algorithm for the prediction of soil properties from the infrared spectra. figshare. Fileset.