Review History


All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

  • The initial submission of this article was received on September 19th, 2024 and was peer-reviewed by 2 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on November 1st, 2024.
  • The first revision was submitted on February 11th, 2025 and was reviewed by 3 reviewers and the Academic Editor.
  • A further revision was submitted on April 25th, 2025 and was reviewed by 2 reviewers and the Academic Editor.
  • A further revision was submitted on May 21st, 2025 and was reviewed by the Academic Editor.
  • The article was Accepted by the Academic Editor on May 23rd, 2025.

Version 0.4 (accepted)

· May 23, 2025 · Academic Editor

Accept

Thanks for the revisions, the manuscript is now accepted for publication.

[# PeerJ Staff Note - this decision was reviewed and approved by Jörg Oehlmann, a PeerJ Section Editor covering this Section #]

Version 0.3

· May 15, 2025 · Academic Editor

Minor Revisions

Please revise as suggested and resubmit. (L923-929)

Reviewer 1 ·

Basic reporting

no comment

Experimental design

no comment

Validity of the findings

no comment

Additional comments

All of my previous points have been addressed.

Minor comments:

l921-929: “This may be attributed to the inherently more dynamic nature of climate variables, which tend to vary more over time compared to vegetation. […] In contrast, vegetation indices are often available as long-term averages, which may reduce the importance of their dynamic versions.”

This section needs a bit more explanation to avoid confusion. By definition, climate parameters summarize long-term weather patterns, usually at 30-year intervals. To say that they are more dynamic than vegetation indices doesn’t make sense in this context, especially considering the cropland vegetation dynamics (i.e., growing seasons, harvest, etc.)

Reviewer 3 ·

Basic reporting

The manuscript meets the necessary standards in its current form, and I recommend acceptance for publication.

Experimental design

idem

Validity of the findings

idem

Additional comments

idem

Version 0.2

· Apr 2, 2025 · Academic Editor

Major Revisions

Please revise and resubmit.

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

Reviewer 1 ·

Basic reporting

Yes.

Experimental design

Methods:

L172-182: It is not clear why the authors decided to apply the model to regions without soil (e.g. bare rock) instead of removing/masking them from the maps:
• “soil surveyors (unintentionally, but for practical purposes) not collecting any samples in areas where soil is non-existent”: What is the added value of including these regions when neither soil nor samples are present?
• “Without adding such a small number of pseudo-zeros, the ML model could potentially overestimate SOCD across the continent.”: I’m not sure what is meant by “overestimate”. Does it mean that regions without soil (e.g. bare rock) are assigned values above 0? Because then you could also argue that in the current state (bare rock = 0 SOCD) the maps underestimate SOCD on a continental level. Either way, it would be most practical for the model evaluation and map user to assign NA to all non-soil areas, like it was done for artificial land.
• “This “pseudo-zero 179 dataset” accounts for only 2% of the entire dataset”: Although the proportion of zeros is small, they can significantly influence the accuracy measures (i.e. R2/CCC) because they are located at the graph origin (Fig. 8). For this reason, I would strongly suggest removing all artificial data which not based on ground truth/lab measurements from the accuracy assessment.

L198ff: It is unclear if the SOCD samples have been cut off at a certain value. Fig. C shows values up to 400, while Fig. 8 seems to only include values up to ~150.

L202f: “Depth-wise, most samples represent the topsoil (0–20 cm), as shown in Fig. 3, although there are still enough samples to also represent deeper soils”: This statement might be true for 20-100cm, but definitely not the case for 100-200 cm. I think it is very difficult to argue that 245 samples are enough to cover all of Europe.

L207ff: This is very confusing: The land cover class “bare land and lichens or moss” includes 80% pseudo-samples from GLanCE? Why not relabel this class to “bare rock or shifting sand”? Bare land is not the same as shifting sand and I would expect the SOCD to be above 0 in most cases (which is also visible in Fig. 3).

L210ff: “bare land and lichens or moss show the lowest values”: This is unsurprising considering the addition of pseudo-zeros. It should be clarified that the word “samples” only refers to real SOCD measurements.

Figure 3: As shown in this Figure, there seems to be a mismatch between the available soil samples and the extent of the presented SOCD maps. While many LU types and depths are not or only poorly represented in the training data, the maps are produced for all combinations LU/depth combinations, even though no validation is provided (Fig. 9). It is unclear why some classes (artificial land/wetland) were masked from the output maps, while others with similar poor performances and low sample sizes (i.e. bare land/shrubland) were included. Furthermore, it is difficult to follow why the model was extended to a depth (100-200cm) that is poorly represented by all LU types and the samples in general. While it was noted that the map should be used with caution and interpreted with local knowledge, it can be argued that it is the map producer's task to mask out the regions where the model does not apply and no training samples are available (see abstract “Maps, based primarily on topsoil data from cropland, grassland, and woodland, are best suited for applications related to these land covers and depths”).

Figure 3: Also, a mean SOCD of 0.9 (20-50cm) seems to be extremely low for woodland soils in Europe. I assume that most of these samples were collected in regions with very shallow soils (Spain?). Either way, it remains questionable if the model is transferable to deeper woodland soil in the research area.

L390: “bias”

Validity of the findings

Results:

Figure 5: Considering that the paper is called “Spatiotemporal prediction of soil organic carbon density for Europe (2000–2022) based on **Landsat-based spectral indices time-series**”, it appears that the dynamic/annual land indices (NDSI/NDTI/NDVI) only contributed very slightly to results/maps. Taking into account the much higher importance of soil depth and long-term vegetation indices, it is not at all obvious the inclusion of Landsat time series, and dynamic covariates in general, actually improved the SOCD prediction accuracy. This should be tested or at least pointed out and discussed in detail.

Figure 5: The truncation of “soil_depth” should be visualized in the bar and y-axis.

Figure 8: Were the measures (R2/CCC) calculated based on the log scale? In this case, the figure would be misleading as the top row shows the original values. The corresponding values (R2/CCC) should be shown for both plots (log-transformed/original). Also, the bottom row highlights why the “pseudo-zero” values should be excluded from the accuracy assessment, as they are not based on measured SOCD values but affect the calculated R2 and CCC.

Figure 9: The “bare land” class seems to be strongly influenced by the pseudo-zero “samples”. I would, again, strongly suggest that this artificial data set is not used for validation. Also, the CCC value seems to be very high for woodland (20-50cm). Considering the low mean SOCD value reported in Figure 3, it remains questionable if this result is representative of the woodland in Europe.

Figure 15: In this Figure, it is very difficult to identify the differences between each of the predicted maps. I would suggest to plot the relative changes of each map instead of the absolute values.

Discussion:

L684ff: I don’t understand the author's decision to include the depth interval from 100-200cm in the model and output maps. The sample size (245) seems way too low to cover the research area and provide reliable results (L831ff). Furthermore, the measured SOCD values seem to be inconsistent with the remaining samples, as written in line 205ff. While I agree with the authors’ statement that deep SOC is crucial, I would emphasize that a sound data basis is necessary to provide reliable results/maps.

L692: “Despite their strong correlation with predicted SOCD, vegetation indices such as NDVI, FAPAR, and BSF all rank highly in feature importance, suggesting that they contribute differently to the model.”: This statement needs further justification/proof. Random Forest selects random features for each split and highly correlated features can show similar feature importance, even though they don’t provide additional information.

Section 4.3: This section lacks clarification as to why some LU/LC classes were masked from output maps and others weren’t. The authors acknowledge that wetlands and artificial land have “poor model performance and limited data availability” and that the resulting maps are not “suitable” for these classes. The same is true, however, for the shrubland (CCC = -0.25) and bare land classes, which have been mostly trained with “pseudo-zero” samples. Instead of highlighting “the need for caution when using the maps for specific land covers and depth intervals”, I suggest extending the masking of the output maps to exclude LU classes/depths which do not provide reliable results and have very limited data availability.

L874-877: “Second, when features were available in both dynamic and static forms, the model often favored long-term representations. The dominance of static features likely reflects the larger spatial variability of SOCD compared to its temporal variability, as noted by Heuvelink et al. (2020).” This is a very important outcome and somewhat contradictory to the overall framing of this study. The spatial variability of SOCD in Europe is very high, which is well-reflected by the importance of the stable features (soil depth, long term vegetation, climate, etc.). However, the paper is framed as “spatiotemporal predictions based on Landsat time-series”, which don’t seem to be very important for the current version of the SOCD model. This trade-off (spatial variability vs ability to detect changes/importance of dynamic features) should be discussed in detail. L899ff: “the temporal dimension is primarily introduced through dynamic features”: I think that this section would benefit from a more detailed discussion about the selection, and importance, of the dynamic features in the model. What does it mean if the dynamic climate covariates are more important than the dynamic vegetation covariates? How does the high importance of stable vegetation covariates affect the ability to detect LU/SOCD changes?

L878ff: Compared to the previous version of the model, the “annual” vegetation parameters lost importance, in contrast to the “longterm” ones (Fig. 10, old manuscript). As shown in Figure 5, most of the dynamic features are climate covariates (pr/lst), while the annual vegetation indices seem to rank low. With this in mind, and considering the much higher importance of the longterm indices, it would be very hard for the SOCD model to detect and model vegetation changes. This tendency is also visible in the predicted maps (Figure 15), which seem to be much more temporally stable, compared to the previous version of the manuscript.

L901ff: “However, accurately projecting SOCD dynamics into unknown time periods depends on whether dynamic features effectively capture underlying processes.” Considering the previous comments, I’m not sure if this is currently the case. The authors concluded in L705 that “The relationship between climatic variables and SOCD is complex”, and it is not clear if and to which degree the dynamic climate features changed throughout the research period.

L908-911: “Due to the lack of validation data with repeated SOCD measurements, the temporal model uncertainty and the ability to predict SOCD changes were mainly analyzed through estimated PIs, assuming that model predictions are valid and their uncertainty is effectively captured by the PIs. Under this assumption, PIs can serve as a reference to identify “model-detectable” SOCD changes.”: I’m not sure how this statement is in line with the “lagging effect”, described in L884. “Model-detectable” would mean that the *actual* SOCD changed to such a degree that it can be captured by the model/uncertainty. This can, however, only be validated with repeated SOCD measurements because of the lagging effect, described above. Instead, I would argue that the model is only able to detect *potential* SOCD changes, which do not necessarily align with the “model-detectable” changes.

Additional comments

The authors provided an extensive revision of the manuscript entitled “Spatiotemporal prediction of soil organic carbon density for Europe (2000–2022) based on Landsat-based spectral indices time-series” and did a good job addressing the papers’ main shortcomings regarding the temporal uncertainty and model performance for different land use classes. Following the changes in model design and evaluation, however, some questions arise that need to be further addressed and clarified.
My main concerns include:

1) The extent of the SOCD map, which includes LC classes (e.g. bare land/shrubland), and depths (e.g. 100-200cm) which are highly underrepresented in the training data.

2) The reasoning behind the inclusion of “pseudo-zero” samples for training and evaluation, extending the model to regions without soil (e.g. bare rock).

3) The very high importance of the stable model features, which don’t seem to align with the paper's title: “Spatiotemporal prediction […] based on **Landsat-based spectral indices time-series** ”

Reviewer 2 ·

Basic reporting

The authors have substantially improved the manuscript and all my concerns and remarks have been sufficiently addressed.

Experimental design

The authors have substantially improved the manuscript and all my concerns and remarks have been sufficiently addressed.

Validity of the findings

The authors have substantially improved the manuscript and all my concerns and remarks have been sufficiently addressed.

Reviewer 3 ·

Basic reporting

The paper has already undergone a previous review round, and I feel overwhelmed by the numerous modifications made to address the reviewers' concerns. The responses were clear and detailed. The entire analysis was made with new additional soil data. The manuscript was rewritten to ensure clarity and consistency. However, a minor/moderate revision is still needed before publication.

The title should better reflect the findings. Among the 67 index features, only 18 were from Landsat, and they were not the most dynamically important variables.

In Figure 8, for the bottom panel, consider adding the performance metrics in the titles to facilitate easy comparison.

L419, please correct ‘exlusion’ by ‘exclusion’

L233, (HO et al., 2023) O in lower case

L234, Scorpan, all lower case or UPPER CASE

L174, (Stanimirova et al., 2023) is not the appropriate reference. Please check.

L420, maybe is better “probability of extrapolation (risk)”. L937f, maybe "... avoid extrapolation risk"

The official name of Turkey is now "Türkiye”

Experimental design

Section 2.4.2, could you clarify how the 10% of the data was selected for calibration, as well as for training and testing? Was the selection done randomly? Additionally, did you split the data by considering entire soil profiles, or were individual horizon observations used? Please provide more details on your approach. Furthermore, why did you choose to use only 10% of the data for calibration? Would it not be beneficial to use a portion of the training set for calibration as well?

Similarly, when performing cross-validation (ISIW-CV), was the selection based on entire soil profiles or only individual soil layers? Could you clarify the sampling approach used to ensure consistency in the evaluation process?

2.6 Extrapolation probability
In statistical terms, extrapolation probability refers to the likelihood that a prediction falls outside the feature space covered by the training data. This is a well-known limitation of tree-based algorithms like Random Forest, which struggle with making reliable predictions beyond their training domain. A higher extrapolation probability increases the risk of inconsistent predictions.

However, at first glance, the term "higher extrapolation probability" might suggest a desirable ability to generalize well, rather than a potential issue. To avoid confusion and make it clearer for end-users, I would suggest using "extrapolation risk" instead. This term better conveys the potential dangers of making predictions in areas where the model lacks sufficient training data.

Please explain how did you determine the exclusion probability threshold to 64%

Figure 16: The extrapolation probability map values may be misleading, as higher values correspond to a lower probability of extrapolation, while lower values indicate a higher probability. To enhance clarity, consider renaming ‘Extrapolation Probability’ to ‘Extrapolation Risk’ or a similar term. Alternatively, explicitly define the meaning of the values in the figure caption. For instance: "Figure 16: Extrapolation probability map [...]. Higher values (red) indicate areas with a lower likelihood of extrapolation, whereas lower values (blue) correspond to areas with a higher likelihood. The color convergence between blue and red is set at 64%."

2.7 Map prediction
The way the spacetime block (STB) method was implemented is somewhat unclear. For example, if the mean SOC (or perhaps SOCD?) for STB-1 was computed as:

mean_SOC_STB1 = (SOC_P1 + SOC_P2 + SOC_P3 + SOC_P4) / 4

where STB-1 corresponds to a specific time range (e.g., 2000–2004) and depth range (e.g., 0–20 cm), my main concern is how the predictions were aggregated between time steps and depth layers. For example: How were predictions for intermediate years (e.g., 2001, 2002) handled? How was depth aggregation performed for layers between 0 and 20 cm, especially given that depth was included as a covariate?

If only four predictions (P1–P4) were generated, this approach would resemble a 2.5D modeling approach rather than a fully 3D approach. In that case, what was the added value of including depth as a covariate?

It looks like there may be a labeling error in Figure 4. The embedded text on the right currently states: Mean tree 1, Mean tree 1 (Did you mean Mean tree 2?), …, Mean tree N. Please verify and correct the numbering to ensure consistency. Could you clarify whether the prediction was made for SOC or SOCD? Please verify and ensure consistency in the text (L426f) and figure 4.

Validity of the findings

The importance of the soil_depth feature should be interpreted with caution, as depth was directly included in the SOCD calculation. Additionally, depth to bedrock naturally varies across locations and is not uniformly distributed. This raises concerns about the validity of SOCD predictions up to 2 meters in areas where surface rocks are prevalent—how can such predictions be considered reasonable in those regions?

Moreover, soil_depth had the highest global Shapley value, 15 times larger than that of the static features. This is problematic because soil_depth was also used in the SOCD calculation, potentially introducing a strong correlation between the predictor and the target variable. Such feature circularity should be avoided to ensure reliable model interpretation.

Additionally, the static features had lower Shapley values than the dynamic ones, which could lead to constant predictions over time, even when climatic conditions fluctuate (e.g., during drought years).

In Figure 6, the observed negative (pseudo-linear) correlation between soil_depth and SOCD may also be influenced by the limited availability of data for deeper soil layers. This data imbalance could reduce the model’s ability to learn accurate relationships at greater depths.

Version 0.1 (original submission)

· Nov 1, 2024 · Academic Editor

Major Revisions

Please revise and re-submit.

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors agree that they are relevant and useful.

Reviewer 1 ·

Basic reporting

Clear and unambiguous, professional English used throughout:
Yes

Literature references, sufficient field background/context provided:
L39-43: Please provide more details and literature on why SOCD is preferred over SOC content when it comes to analyzing the SOC dynamics. SOC contents can change without affecting the total SOC stocks in some cases (e. g. after the implementation of reduced tillage measures) and vice versa.
Additionally, it should be mentioned that SOC change is a very slow process that takes place over a long time and is difficult to measure, even with continuous soil samples (e. g. see Poelau & Don 2013: 10.1016/j.geoderma.2012.08.003; Smith et al. 2020: 10.1111/gcb.14815)

Professional article structure, figures, tables. Raw data shared:
Yes

Self-contained with relevant results to hypotheses:
Yes

Experimental design

Section 2.2.1: Please provide information on the land use/land cover types of the samples that were considered for the model. What are the shares of agricultural and non-agricultural samples for each country and year? Also, since the SOCD distributions and changes are highly related to the LU and LU change (mentioned in L269f), and soil depth (mentioned in L150f) it would be helpful to provide descriptive statistics on each combination of LU and soil depth to improve the interpretability of the model results.

L132-136: Please specify how the different criteria affected the assignment of the quality scores and how it was decided which samples were used for the model (e.g. “only samples with scores above xx were used for training/validation”). Also, it would be helpful to provide the scores of the different data sets in Table 1.

L143-147: The introduction states that SOCD is preferred over SOC content, as it is a more comprehensive measure and requires additional BD data (L41ff). This is somewhat contradictory to the implementation of this pedotransfer function, as it doesn’t create additional information beyond the SOC content. Please clarify if all 106,128 valid SOC measurements were used for modeling, even the ones without corresponding BD measurements (L137-144). If this is the case, the soil depth was first used to derive the SOCD (Eq. 2) and later also used as a model feature? This would introduce a strong bias and significantly complicate the evaluation of the model.

L149f: Please specify how the soil depth of the samples was transformed into a model feature. Did you consider the upper/lower limit or a mean value?

L172f: Please specify if the “quality scores” were also considered for the data split. Did you only consider samples with available BD measurements for the test data or also include derived/transferred SOCD values from Eq. 2?

L251f: The harmonized EU lithology map is not mentioned in the results or discussion. Was it selected for the final model and how did it affect the predictions?

L290-301: This section is missing details on what exact Landsat images (5, 7, 8, 9?) were considered for the data cube and how the surface reflectance data of the different sensors was harmonized. Please provide details if any kind of spectral adjustment was conducted to cross-calibrate the different Landsat sensors (TM, ETM, OLI) (e.g. see Roy et al. 2016: https://doi.org/10.1016/j.rse.2015.12.024). If not, it would be necessary to provide further information on the Landsat data consistency from 2000 to 2022, considering the differences in data quality and availability within this time frame.

2.4.3 Model evaluation: Taking into account the large scope of the presented spatiotemporal SOCD model (2000 to 2022, continental scale, across all LU/LC types and soil depths), the model evaluation should provide more details on the temporal consistency (accuracy for different years with available samples) and LU/LC types (accuracy for cropland, forest, peatland, etc.) instead of only reporting the accuracy based on the combined samples.

2.5 Map production: The map production was carried out for different years from 2000 to 2022, even though the training samples only cover a fraction of that time frame (2007 -2018) and are concentrated in one year (2018; L148). It therefore needs to be pointed out the model was temporally transferred and applied to years and Landsat data that was not included in the training data.

2.7 Temporal uncertainty analysis of SOCD predictions: This section needs further details on the assumptions behind the temporal uncertainty analysis. The model has not been trained or verified with any samples before 2007, how can it be assured that the model applies to all years from 2000 to 2022? “If the PIs at two moments in time extensively overlap, it implies that detecting changes within this timeframe using our current model is impossible” (L418f): The model was trained to predict SOCD, not SOCD change, how can this statement be justified to without providing validation based on SOCD measurements? Also, it should be mentioned that a temporal uncertainty analysis based on model predictions (e.g. PIs) is different from a temporal analysis based on repeated samples, as it is not backed by actual SOCD measurements.

Validity of the findings

Figure 5: Please add an explanation of why the value range of “Observed SOCD” is different for the middle row (0 – 150) and the bottom row (0 – 80).

Table 3. Considering the strong dependency between soil depth and SOCD (Figure 2) it is expected that the model yields high R2 and CCC values if the measurements from all depth intervals are combined for the validation. To reduce this dependency and increase the interpretability of the results it would be therefore helpful to also report the accuracy for each depth interval individually.

L511f: The feature importance of “soil_depth” is very high compared to the other variables, highlighting the strong dependency between depth and SOCD. As the predictions are carried out in four predefined steps (0-20 cm, 20-50 cm, etc.), the feature only influences predictions between but not within depth intervals. This is another argument why the model accuracy should also be reported for each depth interval specifically.

Figure 10: Since one of the main goals of this study is to detect SOCD changes, it should be also analyzed and discussed how the importance of the dynamic features (e.g. red_annual, etc.) relates to the static features (e.g. soil_depth, ndvi_longterm, etc.).

Figure 13: To increase the interpretability of the exemplary PIs in this figure (peatland, forest, cropland), please provide additional information on prediction accuracy for the different LC/LU classes, based on the samples provided to the model.

L608ff: “For point X, the position and magnitude of the PI relative to the prediction of SOCD remain stable, allowing a discernible decrease trend despite the high PIW.” This statement needs further evidence and justification. The PIs are influenced by many factors, such as data quality, LU/LU change, vegetation, etc. How can it be assumed that changes in PIs reflect actual changes in SOCD without providing measurements or validation, especially as the training data only covers a fraction of the time frame?

Figure 14. This Figure is a bit misleading as it gives the impression that cropland soils have lost a substantial amount of SOCD (-25 %) in a very short amount of time (~ 20 years). Without a change in land use, e.g. conversion from permanent grassland to cropland, such a trend seems unlikely and needs further explanation/validation. Additionally, it needs to be discussed why the model predicted higher SOCD for cropland than for woodland from 2000 to 2012 which would be uncommon for soils in Germany.

L720ff: Based on the strong dependencies between LU/LC and SOCD, the dominant role of vegetation is unsurprising if the model is trained with samples from soils with permanent vegetation (grassland, forest, peatland, etc.), and cultivated cropland. What is currently missing in the discussion is the fact that the temporal variability of vegetation is fundamentally different from the temporal variability of SOC. While the aboveground vegetation can change within a very short time, the soils and SOC will change and adapt to the new conditions much slower. With this in mind, please discuss the importance of short-term (annual) and long-term vegetation indices on the predictions in more detail and provide evidence that the predicted SOCD changes can be measured.

L751ff: Here, the dependencies between LU/LC and SOCD need to be analyzed and discussed in more detail. How do model accuracy and uncertainty differ between different LC/LU classes (e.g. peatland, forest, cropland)? Is the model able to detect SOCD changes within a LU class (e. g. permanent cropland) or only if a LU change takes place (e. g. permanent grassland to cropland)?

L777ff: “Only areas with significant changes in SOCD or consistent PI patterns aligned with SOCD changes would be detectable”. Again, this assumption needs to be justified in more detail. The analysis of PIs does not replace the validation with measured SOCD changes, especially if the training samples do not cover the full prediction time frame. Figure 14 shows that the PIs were much larger in 2000-2004, which is not covered by the training samples. This could be caused by differences in Landsat data quality and availability and need to be discussed.

L786f: The statement that the PIs can be used to “generate robust uncertainty characterizations at the regional level” is based on the assumption that the model is temporally transferrable to each year from 2000 to 2022. This needs to be either tested or discussed in detail.

L810f: At this point, it would be necessary to further discuss the effects of using transferred SOCD values (Eq. 2 ) on the overall uncertainty of the results. The SOC content can change independently from the SOC stock and does not offer the same detail of information.

L842-845: I agree that validation of the temporal model consistency is a difficult task, especially if the model is applied to predict SOC density in a large area and long period (2000-2022), including all LU/LC types and depth intervals. This is a huge task and requires a lot of soil monitoring data that will probably never be available. Because of this, however, it is even more important to thoroughly discuss the assumptions that were made for the model and the accuracy that the users can expect from the results. As the model has been mostly trained with data from 2018 and applied to years without any training or validation data, it has to be pointed out that currently not possible to test the temporal model consistency and accuracy of predicted SOCD trends.

Additional comments

The manuscript is concerned with the ambiguous task of modeling the soil organic carbon density (SOCD) across Europe, including all land use types, and soil depth intervals from 0 to 200cm. Using earth observation data and further dynamic and static features, a spatiotemporal model (3D space and time) was trained with soil samples from 2007 to 2018 and used to carry out SOCD predictions from 2000 to 2022. In general, the manuscript is well structured and involves a lot of work regarding the harmonization of EU-wide SOCD sampling data and the preparation of an array of large-scale and high-resolution covariates. In the current state, however, some open questions and gaps regarding the model input, evaluation, and application remain and should be addressed:

• The model evaluation is limited to one test set, combining the SOCD samples from all years, land use classes, and soil depths. Considering the strong dependencies between SOCD and LU/soil depth, and the large scope of the model, additional details and evaluations are necessary to increase the interpretability of the results.

• Most of the SOCD training samples were taken in 2018 and do not match the time frame of the produced maps (2000 to 2022). Additional tests are necessary to evaluate the temporal transferability of the model, especially because the availability and quality of Landsat data, used as the main model input, are very different during this time frame.

• Due to a lack of validation data on repeated SOCD samples, the temporal model uncertainty and ability to predict SOCD changes were mostly analyzed using the prediction intervals (PIs) of the resulting maps from 2000 to 2022. Considering that SOCD changes are a very slow process and difficult to measure, even with continuous soil samples, the ability of PIs to validate the predicted SOCD trends should be carefully justified and discussed.

Reviewer 2 ·

Basic reporting

This study predicts soil organic carbon density and associated uncertainty on a continental scale in Europe between 2000 and 2022 at 30 m resolution at 4 depth intervals. The authors use widely used digital soil mapping methods, such as the use of high-resolution remote sensing data, machine learning (random forest for mean predictions and quantile regression forest for probabilistic predictions) and assessing models using accuracy plots and metrics. Beyond this, the authors go a step further by also using Shapley values to explain model predictions, conducting temporal uncertainty analysis and using change of support and spatial aggregation to reduce prediction errors. However, I have several main concerns and suggestions for improvement:
- The spatio-temporal distribution of the data, which covers only a very small part of the 3D+T prediction area and timeframe and the implications of predicting into “unknown space and time” (e.g. https://doi.org/10.1111/2041-210X.13650)
- The use of blocked spatial cross validation, which is pessimistically biased for clustered data (e.g. https://doi.org/10.1016/j.ecolmodel.2021.109692, https://doi.org/10.1016/j.ecoinf.2022.101665)
- There are many similarities in the methodology, results and discussion points with https://doi.org/10.1038/s43247-024-01293-y yet this study was apparently overlooked.
- Despite choosing a probabilistic model, quantifying and validating prediction uncertainty and spatial aggregation, the recommendations and conclusions are overly optimistic and should be more cautious, given the large potential risks and credibility of soil carbon monitoring (https://doi.org/10.1016/j.jenvman.2022.117142, https://doi.org/10.1111/gcb.16570).

These points are addressed in more detail in my comments directly in the attached PDF. In summary, I strongly recommend the authors to:
1. Apply "area of applicability" (AOA, https://doi.org/10.1111/2041-210X.13650) or a similar approach so that predictions are not made into unknown space
2. Refrain from using spatial cross-validation and instead adjust the cross validation strategy according to https://doi.org/10.1016/j.ecoinf.2022.101665
3. Provide more cautious and scale-dependent recommendations for end-users, also highlighting potential examples of poor / inproper use of such space-time maps.

Experimental design

The study fits well within the aims and scope of the journal. The objective is defined in the introduction but sometimes lost a bit when reading the article because of many different specific methods used. Some methods and results could perhaps be moved to supplements (transformation, feature selection, etc.) so that red thread is more clear throughout article. Study could be embedded better into space-time DSM research (comparison to other 3D+T approaches).

Validity of the findings

See other comments above, especially four points listed.

Additional comments

The text is well-written but further improvements are necessary in the structure and choice of detail per section. Consider to move some aspects to supplementary information where I indicated. Unnecessary adjectives that are not clearly defined and often subjective can be removed to improve writing and make it more scientifically objective (e.g. “best available data”, “highest quality laboratory measurements”, “analysis-ready images”, etc.). Text could be more self-contained.

Sometimes key literature is missing. See other 3D+T approaches, most recently and notably https://doi.org/10.1038/s43247-024-01293-y. But also early approaches in 3D+T (https://doi.org/10.1016/j.spasta.2015.04.001) and other space-time SOC mapping studies e.g. numerous references in Hungary (e.g. https://doi.org/10.1016/j.geoderma.2021.115356). Not enough information is provided about sampling depth, how depth and time were incorporated during model calibration and prediction. Usually sufficient context is provided but soil science background is often missing or relatively limited (e.g. much more detail on remote sensing than soil context).

Figures and tables are sufficient but layout could be improved. Check if labelling subplots is necessary. It is great that raw data and code are shared.

Annotated reviews are not available for download in order to protect the identity of reviewers who chose to remain anonymous.

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.