Estimating compressive strength of CO2 incorporated concrete with data augmentation and explainable regression modeling

View article
PeerJ Computer Science

Introduction

The escalating threat of global warming, fuelled primarily by greenhouse gas emissions, has emerged as one of the most critical challenges (Mariappan et al., 2023; Rashid et al., 2024; Arif et al., 2025). Among these gases, carbon dioxide ( CO2) accounts for around 75% of total global emissions due to its abundance and persistence in the atmosphere (Sick, Stokes & Mason, 2022; Yaacob et al., 2024). Cement production, a cornerstone of the construction industry, significantly contributes to approximately 8% of global CO2 emissions (Chadha et al., 2024). This is mainly attributable to the calcination of limestone and the energy-demanding procedures involved (Naik, Kraus & Siddique, 2002). With the increasing demand for concrete, the imperative to mitigate its environmental impact intensifies. Innovative strategies, including CO2 sequestration and utilisation, are being explored to transform CO2 from a liability into a valuable asset, thereby reducing its impact on climate change (Lippiatt, Ling & Pan, 2020). CO2 sequestration in concrete entails capturing carbon dioxide and integrating it during the mixing or curing phases (Zhang, Ghouleh & Shao, 2017; Singh et al., 2021). This procedure enables a sequence of chemical reactions that transform atmospheric CO2 into stable calcium carbonate CaCO3, which is embedded within the concrete matrix (Qian et al., 2018). These reactions improve the material’s structural integrity and diminish its porosity (Sharma & Goyal, 2018). The primary carbonation reactions are as follows:

Reaction during hydration:

2(3CaOSiO2)+3CO2+3H2O3CaCO3+3CaO2SiO23H2O

2(2CaOSiO2)+CO2+3H2OCaCO3+3CaO2SiO23H2O

Reaction with hydration products:

Ca(OH)2+CO2CaCO3+H2O

3CaO2SiO23H2O+3CO23CaCO3+2SiO2+3H2O

These reactions densify the microstructure of concrete, thereby enhancing its compressive strength and durability (Šavija & Luković, 2016). Moreover, carbonation minimises the overall carbon footprint of concrete by incorporating CO2 that would otherwise aggravate atmospheric pollution (Adesina, 2020).

Related works

The potential of incorporating CO2 in concrete has been extensively investigated (Chen & Gao, 2020; Suescum-Morales, Fernández-Rodríguez & Jiménez, 2022; Ahmed, Ahmad & Adekunle, 2024). Carbonation has demonstrated the ability to improve both the fresh and hardened characteristics of concrete while minimising its environmental impact (Monkman, Grandfield & Langelier, 2018; Monkman & MacDonald, 2017). The incorporation of CO2 in fresh concrete during mixing often reduces slump owing to rapid water loss resulting from exothermic carbonation reactions (Samniang et al., 2021). The initial and final setting times of carbonated concrete tend to be prolonged due to the free water produced during the reaction between Ca(OH)2 and CO2; however, optimized CO2 dosages can accelerate the setting process, resulting in a reduction of initial set times by 22–25% and final set times by 21–25% (Monkman, MacDonald & Hooton, 2016; Monkman, Grandfield & Langelier, 2018). Carbonation in hardened concrete considerably improves compressive strength by forming nanosized calcium carbonate, which densifies the matrix and fills microvoids (Monkman, 2018; Shah et al., 2018). Optimal CO2 dosages can increase compressive strength by 14% in 1 day and up to 30.9% in 28 days (Wang, He & Yang, 2018; Kumar et al., 2019). Accelerated carbonation curing has also been found to enhance early-age strength by 20–26% in 3–58 days (Monkman, MacDonald & Hooton, 2015). Excessive CO2 dosages might adversely impact compressive strength, highlighting the necessity for precise dosage optimisation (Saikia & Rajput, 2024). Another significant advantage of carbonation is its enhancement of durability. Carbonated concrete exhibits reduced porosity, lower water absorption, and improved resistance to chloride penetration and freeze-thaw cycles (Sharma & Goyal, 2022; Zhang & Shao, 2018). Carbonation refines pore structures, reduces pore sizes, and promotes the formation of stable calcium carbonate and C-S-H gel, resulting in a denser and more durable matrix (Cui et al., 2015; Jang & Lee, 2016). These findings collectively underscore the potential of CO2 utilisation to enhance concrete performance while addressing sustainability objectives.

Conventional methods for assessing the compressive strength of concrete necessitate extensive experimental testing, which is both time-consuming and resource-intensive (Candelaria, Kee & Lee, 2022; Getahun, Shitote & Abiero Gariy, 2018). Regression modelling serves as a viable alternative, enabling the prediction of compressive strength based on key input variables like cement content, water-to-cement ratio, aggregate size, and CO2 dosage (Tam et al., 2022). Regression models utilise statistical techniques and machine learning (ML) algorithms that enable assessing the mechanical properties of concrete efficiently, minimising dependence on physical testing (Singh et al., 2023). Regression modelling has been utilised in various application areas (Haque et al., 2025). Traditional models such as multiple linear regression and polynomial regression have proven effective in predicting the strength of conventional concrete but often encounter difficulties with complicated, nonlinear interactions (Yeh, 1998). Advanced models, such as support vector regression (SVR) and random forest (RF), have demonstrated efficacy with high-dimensional data, attaining prediction errors below 5% in strength assessments (Ly et al., 2021). Artificial Neural Networks (ANN) excel at capturing nonlinearities, with optimised architectures providing precise predictions for diverse concrete types, such as recycled aggregate and self-compacting concrete (Chopra, Sharma & Kumar, 2016).

Getahun, Shitote & Abiero Gariy (2018) developed a three-layer ANN model utilising 15 input parameters and attained a Mean Average Precision Error (MAPE) of around 2.09%, designating cement, water, RHA, and w/c ratio as the most significant factors. In a similar way, Yan et al. (2013) effectively utilised SVR to predict tensile strength from compressive strength, hence validating the efficacy of margin-based learners in property mapping. Hybrid and ensemble models such as Adaptive Neuro-Fuzzy Inference Systems and ensemble techniques such as Gene-Expression Programming and Extreme Gradient Boosting (XGB) improve prediction accuracy by combining algorithmic strengths, outperforming standalone models (Chu et al., 2021; Shamsabadi et al., 2022). Deep neural networks enhance the capabilities using extensive datasets, while hybrid and ensemble approaches perform exceptionally well under varied situations, rendering regression modelling essential for optimising concrete mix designs. The approaches, such as Convolutional Neural Network (CNN)-Bidirectional Long Short-Term Memory (BiLSTM) architectures and RF-optimised back-propagation networks, have been documented to surpass conventional models, especially in predicting strength at various curing ages (Leondes, 2002; Bharathi, Manju & Premalatha, 2017).

To mitigate data scarcity challenges in forecasting the compressive strength of concrete, as experiments are costly and limited, generative models like Conditional Tabular Generative Adversarial Networks (CTGAN) and Tabular Variational Autoencoders (TAVEGAN) have emerged as viable solutions (Xu et al., 2019; Apellaniz, Parras & Zazo, 2024). These models produce synthetic but realistic datasets by assimilating the fundamental distributions of existing data, thereby improving model training. The integration of synthetic data generation with conventional and sophisticated regression models enhances predictive efficacy, alleviating the risk of overfitting and broadening the applicability of ML techniques to specialised concrete types or innovative mix designs. The implementation of Explainable AI (XAI) techniques enhances the interpretability and reliability of predictive models by offering transparent insights into their decision-making processes. XAI methodologies, including SHapley Additive exPlanations (SHAP), facilitate the identification of essential input variables affecting concrete strength predictions, thereby enhancing decision-making in construction practices (Lundberg & Lee, 2017). In the context of CO2 curing and accelerated carbonation curing, maturity-based models have been developed to quantify the synergistic effects of CO2 concentration, relative humidity, and flow rate on early-age strength development, demonstrating that optimal curing conditions can yield strengths comparable to those of 28-day moist-cured samples within hours (Xuan, Zhan & Poon, 2018). These models formalise the influence of CO2 dosage and curing environment through logistic-type functions, incorporating characteristics associated with carbonation kinetics and hydration acceleration. At a material scale, mathematical modelling has shown that carbonation and CO2 uptake are affected by cement chemistry, SCM content, exposure class, and surface-area-to-volume ratio, with high surface-area-to-volume elements demonstrating up to 255% more CO2 sequestration capacity compared to normal specimens (Souto-Martinez et al., 2017). These mechanistic insights underscore the need to integrate CO2 curing characteristics with mix design parameters in ML frameworks to enhance prediction accuracy.

Recent studies in predictive modelling for civil engineering materials increasingly utilise XAI and data augmentation techniques. SHAP has been pivotal in revealing the impact of individual features on model outputs, enhancing transparency and trust. For instance, in the study quantifying compressive strength in limestone powder incorporated concrete, SHAP was employed to understand the influence of variables like water-to-cement ratio and limestone dosage (Mishra, 2025). Additionally, the SHAP analysis revealed that recycled coarse aggregate has an inverse impact on the strength of Fibre-reinforced recycled aggregate concrete (Alsharari, 2025). Furthermore, SHAP analysis revealed cement content and curing age as the most significant factors affecting compressive strength (Abioye et al., 2025). SHAP explanations reveal interactions among features during the prediction process. While cement generally has a positive impact on strength, it also exhibits localised adverse effects. Findings offer valuable insights for guiding the design of high temperature resistant concrete materials (Meng et al., 2025). Utilising literature insights and limitations in current approaches, this study proposes an ML framework for estimating the compressive strength of CO2 incorporated concrete. Advanced regression modelling, synthetic data generation, and model explainability techniques are used to improve accuracy, solve data scarcity, and deliver actionable feature contribution insights. This study has the following objectives.

  • Evaluates the impact of regulated CO2 incorporation during concrete mixing on the development of compressive strength in concrete formulated with varied water-cement ratios and cement types (OPC and PPC) at various curing ages.

  • It examines the use of regression modelling approaches to estimate the compressive strength of CO2-incorporated concrete.

  • It introduces and compares models to generate high-quality synthetic data for concrete strength prediction to reduce data scarcity.

  • To optimise ML models capable of accurately predicting the compressive strength of concrete with Incorporated SHAP to interpret the predictions of the models, providing insights into feature contributions and enhancing the transparency of the models.

The rest of the article is organised as follows: ‘Experimentation and Data Collection’ describes the experimental setup and data collection. ‘System Description’ describes the problem formulation and algorithm, including the problem statement. ‘Results’ covers the data quality, data augmentation, ML model performance, and XAI model interpretability. After discussing the findings, the study concludes in ‘Discussions’.

Experimentation and data collection

Materials

The study utilised two cement types: Ordinary Portland Cement (OPC) and fly ash–based Portland Pozzolana Cement (PPC), adhering to IS: 269-2015 ; equivalent to ASTM C150/C150M and IS: 1489 (Part 1)–1991 equivalent to ASTM C595/C595M, respectively, with specific gravities of 3.15 and 3.06 and standard consistencies of 28% and 34%. River sand, locally sourced and classified as Zone II according to IS: 383-2016; equivalent to ASTM C33/C33M, was used as fine aggregate, exhibiting a specific gravity of 2.65, a fineness modulus of 2.82, and a water absorption of 1.60%. Coarse aggregates comprised crushed stone (20 and 10 mm), assessed in accordance with IS: 383-2016; equivalent to ASTM C33/C33M, exhibiting specific gravities of 2.68 and 2.77, fineness moduli of 6.00 and 6.95, and water absorptions of 0.17% and 0.37%, respectively. Potable tap water was used for both casting and curing. A polycarboxylic ether polymer-based superplasticizer (Fosroc Auramix 200) complying with IS: 9103-1999 ; equivalent to ASTM C494/C494M, was employed as an admixture. Furthermore, 99.9% pure industrial-grade carbon dioxide gas was supplied in high-pressure cylinders (about 300 psi or 20 bar) for the carbonation process.

Mix design and methodology

The study examined the impact of CO2 incorporation during mixing on the compressive strength of concrete. The data encompassed various concrete mix proportions, binder types, and CO2 dosages, facilitating an extensive assessment. The concrete mixes comprised two types of binders, OPC and PPC, with binder quantities of 500, 400, and 350 kg/m3. Additional constituents comprised water (175–193 kg/m3), coarse aggregates (1,200.63 kg/m3), fine aggregates (674.584 kg/m3), and superplasticisers (0.65% by weight of cement). CO2 was injected in concentrations of 0%, 0.05%, 0.10%, 0.15%, and 0.20% by weight of cement, serving as a critical factor in assessing compressive strength. The input parameters for regression modelling were binder type, binder content, water-to-binder ratio, CO2 dosage, and testing age. The concrete mixture proportions are listed in Table 1 range of parameters is listed in Table 2.

Table 1:
The concrete mixture proportions.
Binder type Binder kg/m3 Water kg/m3 Coarse aggregate kg/m3 Fine aggregate kg/m3 Superplasticizer % by cement weight CO2 content % by cement weight
PPC 500 175 1,200.63 674.584 0.65 0
PPC 500 175 1,200.63 674.584 0.65 0.05
PPC 500 175 1,200.63 674.584 0.65 0.10
PPC 500 175 1,200.63 674.584 0.65 0.15
PPC 500 175 1,200.63 674.584 0.65 0.20
OPC 400 182 1,200.63 674.584 0.65 0
OPC 400 182 1,200.63 674.584 0.65 0.05
OPC 400 182 1,200.63 674.584 0.65 0.10
OPC 400 182 1,200.63 674.584 0.65 0.15
OPC 400 182 1,200.63 674.584 0.65 0.20
PPC 400 182 1,200.63 674.584 0.65 0
PPC 400 182 1,200.63 674.584 0.65 0.05
PPC 400 182 1,200.63 674.584 0.65 0.10
PPC 400 182 1,200.63 674.584 0.65 0.15
PPC 400 182 1,200.63 674.584 0.65 0.20
OPC 350 193 1,200.63 674.584 0.65 0
OPC 350 193 1,200.63 674.584 0.65 0.05
OPC 350 193 1,200.63 674.584 0.65 0.10
OPC 350 193 1,200.63 674.584 0.65 0.15
OPC 350 193 1,200.63 674.584 0.65 0.20
PPC 350 193 1,200.63 674.584 0.65 0
PPC 350 193 1,200.63 674.584 0.65 0.05
PPC 350 193 1,200.63 674.584 0.65 0.10
PPC 350 193 1,200.63 674.584 0.65 0.15
PPC 350 193 1,200.63 674.584 0.65 0.20
DOI: 10.7717/peerj-cs.3316/table-1
Table 2:
Range of parameters in the original dataset.
Variables Minimum Maximum
WBRatio 0.35 0.55
OPC 0.00 500
PPC 0.00 500
DAYS 7.00 56.0
CO2 0.00 0.20
DOI: 10.7717/peerj-cs.3316/table-2

In this study, CO2 was injected directly during the mixing process. High-purity CO2 gas (99.9%) was supplied through a cylinder equipped with a dual-stage regulator, pressure gauge, and calibrated rotameter to ensure a regulated flow rate of 3.0 ± 0.2 L/min. The injection duration for each batch (30–45 s) was determined based on the mixer capacity and the required CO2 dosage relative to cement weight, ensuring delivery within ±2% of the target. The gas was injected into the concrete mixer through a 10 mm inner-diameter reinforced hose, with the mixing chamber enclosed by gasket-lined acrylic panels to reduce leakage, leaving only a narrow, sealed opening for the hose connection and a pressure relief vent equipped with a one-way valve. The measured CO2 was injected consistently, enabling its reaction with the hydrating cement and promoting carbonation within the mix. Safety protocols comprised continuous laboratory ventilation to ensure ambient CO2 levels remained below 1,000 ppm, using personal protective equipment (safety goggles, nitrile gloves, and CO2-rated respirators), leak detection via portable monitors, and emergency shut-off valves in compliance with ASTM C1768 standards. This process was essential for achieving the intended interaction between CO2 and the binder. Following the mixing process, the concrete was cast into standard moulds to assess its compressive strength. Compressive strength tests were carried out at 7, 28, and 56 days, in accordance with IS: 516-1959 norms; equivalent to ASTM C39, to evaluate the strength of the CO2-incorporated concrete.

Compressive strength test results

This study examines the impact of CO2 incorporation during mixing on the compressive strength of concrete, assessed at 7, 28, and 56 days to determine both early-age and long-term strength development. The findings demonstrate that the incorporation of CO2 at regulated dosages during mixing significantly improves the compressive strength of concrete, especially at reduced water-to-cement (w/c) ratios. The optimum CO2 concentration ranges from 0.05% to 0.10% for the majority of mixes, beyond which a reduction in strength is observed. This behaviour can be attributed to the carbonation reactions between CO2 and calcium hydroxide ( Ca(OH)2), which leads to the formation of calcium carbonate ( CaCO3). This reaction optimally refines the pore structure and enhances the density of the concrete matrix, consequently augmenting strength. However, excessive CO2 exposure may lead to premature carbonation, hindering the hydration process and reducing the availability of Ca(OH)2 for further strength development (Monkman et al., 2018; Monkman & MacDonald, 2016). For the w/c = 0.45 mixes, OPC-based concrete exhibited enhanced compressive strength with increasing CO2 dosage up to 0.10%. The control mix demonstrated compressive strengths of 24.25, 37.62, and 39.12 MPa at 7, 28, and 56 days, respectively. The highest strength (50.58 MPa at 56 days) was observed at 0.10% CO2, indicating that moderate carbonation positively impacted the microstructure. Nevertheless, beyond 0.10%, a reduction in strength was noted, likely due to excessive carbonation impeding further hydration (Monkman & MacDonald, 2017). Similarly, for PPC mixes, the compressive strength improved up to 0.05% CO2, to a maximum of 55.87 MPa at 56 days, after which the strength began to decrease, especially at 0.15% and 0.20% CO2 concentrations. The presence of pozzolanic materials in PPC would have facilitated early strength development, but prolonged exposure to CO2 would have reduced the efficacy of the long-term pozzolanic reaction (Samniang et al., 2021).

For the w/c = 0.35 mixes, with a reduced water content and inevitably higher strength, a more significant improvement due to CO2 incorporation was observed. The control OPC mix exhibited compressive strengths of 32.63, 50.19, and 53.20 MPa at 7, 28, and 56 days, respectively. The highest strength (60.09 MPa at 56 days) was attained at 0.10% CO2, indicating the improved densification effect resulting from controlled carbonation. A similar trend was seen for PPC, where 0.05% CO2 yielded the highest strength of 59.53 MPa at 56 days. However, beyond this concentration, strength began to decrease, indicating that excessive carbonation would have led to the formation of a dense outer layer that impeded further hydration of the cementitious matrix (Xu et al., 2022). Conversely, the w/c = 0.55 mixes exhibited a comparatively lower strength enhancement due to CO2 incorporation. The control OPC mix had compressive strengths of 18.19, 26.33, and 28.47 MPa at 7, 28, and 56 days, respectively. The compressive strength attained a maximum of 31.25 MPa at 56 days with 0.05% CO2; however, with higher CO2 dosages, the increase in strength was negligible. A similar pattern was noted in PPC, with the highest strength (30.95 MPa at 56 days) obtained at 0.05% CO2, followed by a decrease in strength. This limited strength improvement can be attributed to the higher porosity of w/c = 0.55 mixes, which would have reduced the effectiveness of CO2-induced densification. Furthermore, at high w/c ratios, carbonation may progress more rapidly. However, it does not substantially enhance overall strength development due to the extensive availability of pore water (Wang et al., 2019).

The results validate that the inclusion of CO2 during mixing improves concrete strength at optimal dosages (0.05–0.10%) owing to the beneficial effects of early-age carbonation. The regulated formation of CaCO3 enhances the packing density of the cement matrix and decreases porosity, thereby enhancing strength. However, when CO2 concentration exceeds a critical threshold (beyond 0.10%), the carbonation reaction consumes excessive Ca(OH)2, which may interfere with long-term hydration and strength development (Rashid & Singh, 2023). The impact of the w/c ratio, as shown in Figs. 1A and 1B, is also evident in the findings, as reduced w/c ratios (0.35 and 0.45) demonstrate a significant increase in strength. This is attributed to the denser microstructure of low w/c ratio mixes, which enables regulated carbonation to enhance the pore structure without substantially reducing hydration (Li et al., 2019). Conversely, high w/c ratio mixes (0.55), as shown in Fig. 1C exhibit increased porosity, hence limiting the efficacy of CO2-induced densification. The comparison between OPC and PPC indicates that PPC typically has superior long-term strength due to the pozzolanic reaction. However, PPC appears more sensitive to excessive CO2 exposure, as seen from the more pronounced decline in strength at 0.15% and 0.20% CO2 concentrations. The delayed pozzolanic reaction may be impeded by early carbonation, hence influencing the formation of additional cementitious compounds (Šavija & Luković, 2016). The study indicates an optimal CO2 range of 0.05–0.10% for strength enhancement, beyond which carbonation impacts hydration. The effect is more pronounced at lower water-to-cement ratios (0.35 and 0.45) and with OPC in comparison to PPC. These insights can be valuable for developing sustainable cementitious composites through the integration of regulated CO2 utilisation while ensuring optimal performance.

Compressive strength results of mixes at 7, 28, and 56 days for different w/c ratios: (A) w/c = 0.35, (B) w/c = 0.49, and (C) w/c = 0.55.

Figure 1: Compressive strength results of mixes at 7, 28, and 56 days for different w/c ratios: (A) w/c = 0.35, (B) w/c = 0.49, and (C) w/c = 0.55.

System description

The methodology presented in Fig. 2 outlines a systematic approach to estimating the compressive strength of CO2-incorporated concrete using a combination of data augmentation, ML modelling, and XAI techniques. The process is divided into three key steps:

  • 1.

    The workflow begins with model training on experimental data collected. Initial ML models create baseline performance measures.

  • 2.

    Synthetic data generation and integration: Two different models generate high-quality synthetic datasets. After evaluation, the best synthetic data is mixed with the real dataset to improve model training.

  • 3.

    Advanced ML models predict concrete compressive strength using the combined real and augmented dataset. SHAP provides transparency and understanding of feature contributions in model predictions.

Framework for estimating compressive strength of 
${{\rm{C}}}{{{\rm{O}}}_2}$CO2
-incorporated concrete.

Figure 2: Framework for estimating compressive strength of CO2-incorporated concrete.

This comprehensive approach not only seeks to boost prediction accuracy but also improves model interpretability, leading to improvements in sustainable concrete technology.

Problem formulation

The primary objective of this study is to develop an Artificial Intelligence (AI) model M that can accurately forecast the compressive strength of concrete ( yR) based on its mix proportions. The dataset DRn×m, where n is the number of samples and m represents the features, poses challenges due to its limited size ( n=270) and the complex, non-linear relationships. Moreover, the lack of interpretability in traditional ML models requires methods to explain predictions, ensuring trust and usability in real-world applications. This study addresses these challenges through a comprehensive framework that integrates data augmentation, ML, and model explainability.

The first step in the proposed framework involves addressing data scarcity by generating synthetic samples using both CTGAN and TVAE. The CTGAN model GCTGAN and the TVAE model GTVAE are independently trained on the original dataset D to generate synthetic datasets DsyntheticCTGAN and DsyntheticTVAE respectively, each consisting of kn samples, where k>1 is the augmentation factor. A quantitative and qualitative comparison of the synthetic datasets is then conducted using evaluation metrics and model performance metrics when trained on the synthetic data. The framework selects the synthetic dataset characterised by superior quality and the most representative samples based on this comparison. The final augmented dataset is then constructed as:

Dcombined=DDsyntheticbest,where Dsyntheticbest is the superior synthetic dataset from either CTGAN or TVAE. This approach ensures that only the best-quality synthetic data is utilised, leading to improved model performance and robustness. The ML model M is trained on Dfiltered with the objective of minimising prediction errors for compressive strength. The training process is governed by the following mathematical objective function:

minME(x,y)Dfiltered[L(y,M(x))],where L denotes the loss function, chosen as Mean Squared Error (MSE):

L(y,M(x))=1ni=1n(yiy^i)2,with y^i=M(xi) representing the predicted strength for input xi. The framework incorporates Explainable AI (XAI) techniques to enhance model interpretability. SHAP is used to decompose predictions into feature contributions, providing insights into the relative importance of each feature. The SHAP decomposition is given by:

SHAP(f(x))=ϕ0+i=1mϕi,where ϕ0 is the base value representing the average prediction, and ϕi quantifies the contribution of feature i to the prediction for input x. This interpretability ensures transparency and trust in the model’s predictions, making it suitable for practical applications. Further, the model undergoes iterative optimisation to improve its performance. This includes hyperparameter tuning and retraining to ensure the model achieves the best possible accuracy and generalisation. The optimisation process is reinforced through 10-fold cross-validation, which splits the dataset into 10 subsets, iteratively training the model on nine subsets while validating on the remaining subset. This method mitigates overfitting and yields a reliable assessment of the model’s efficacy. The evaluation metrics used to assess the model’s performance include Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R2, which are defined as follows:

MAE=1ni=1n|yiy^i|,RMSE=1ni=1n(yiy^i)2,R2=1i=1n(yiy^i)2i=1n(yiy¯)2where y¯ is the mean of the true values yi. The final output is an optimised model Moptimised that delivers accurate predictions for compressive strength while providing interpretable insights into the underlying factors. This framework effectively addresses the challenges of data scarcity, complexity, and model transparency in concrete strength forecasting.

Algorithm

The proposed Algorithm 1 integrates synthetic data generation to address data scarcity in concrete strength prediction. It combines real and synthetic data and trains ML models. Explainable AI methods, specifically SHAP, are employed to interpret feature contributions, enhancing model transparency.

Algorithm 1:
Estimating compressive concrete strength using data augmentation.
 Input: Original dataset DRn×m with n=270 rows and m=6 features.
 Output: Optimized model M with explainable insights.
 /* Data Preparation*/
1 Dffill(D).
2 DMinMaxScaler(D).
 /* Data Generation */
3 Train CTGAN and TVAE models GCTGAN and GTVAE on D: GCTGANfit(D), GTVAEfit(D).
4 Generate synthetic datasets DsyntheticCTGAN and DsyntheticTVAE: DsyntheticCTGANGCTGANgenerate(kn),
DsyntheticTVAEGTVAEgenerate(kn), where k is the synthetic-to-real ratio.
5 Compare synthetic datasets and select the best quality dataset:
Dsyntheticbestselect_best(DsyntheticCTGANDsyntheticTVAE).
6 Combine real and selected synthetic data: DcombinedDDsyntheticbest.
 /* Training */
7 Split Dfiltered into training ( Dtrain) and testing ( Dtest) sets.
8 Train ML model M: Mtrain(Dtrain).
 /* Evaluation */
9 Evaluate M on Dtest:
10 Compute metrics RMSE(M), MAE(M), and R2(M).
 /* Optimization */
11 Refine M through iterative retraining and hyperparameter tuning:
12 Moptimizedtune(M).
 /* Explainable AI */
13 Apply SHAP to interpret M:
14     SHAP(f(x))=ϕ0+i=1mϕi,
 where ϕi are feature contributions.
DOI: 10.7717/peerj-cs.3316/table-8

Results

Data preprocessing and analysis

The original dataset consists of 270 rows and includes the crucial factors affecting concrete compressive strength. The dataset comprises variables including the Water-to-Binder Ratio (WBRatio) between 0.350 and 0.550, OPC and PPC with values between 0.000 and 500.0 kg/m3, curing days (DAYS) spanning from 7 to 56 days, and CO2 dose varying from 0.000 to 0.200. Every row in the data shows a unique concrete mix configuration that relates these input variables to the measured compressive strength. The small size of the dataset and the complexity of the relationships among variables emphasise the need for synthetic data augmentation to support the training process of ML models and predictive accuracy. The variable WBRatio exhibits a narrow and consistent distribution, indicating minimal variability across the samples. Similarly, CO2 levels show a tightly clustered distribution, suggesting well-regulated conditions with limited variability. In contrast, DAYS, representing the curing age, displays a broader distribution, indicating diverse sample ages included in the dataset. The cement variables, OPC and PPC, also show consistent distributions, with OPC spanning a slightly wider range compared to PPC. Lastly, STRENGTH, representing material strength, exhibits significant variability between 20 and 60, reflecting the range of structural performance observed.

The correlation matrix in Fig. 3 reveals key relationships among the variables in the dataset. A strong negative correlation (−0.72) between the WBRatio and material strength (STRENGTH) indicates that higher water content reduces structural strength, aligning with known material science principles. Similarly, the strong inverse relationship (−0.96) between OPC and PPC reflects their complementary roles in material composition. The moderate positive correlation (0.54) between curing time (DAYS) and strength highlights the beneficial impact of extended curing on material performance. These findings emphasise the critical role of controlling the water-binder ratio and curing time to optimise material strength while highlighting the interplay between OPC and PPC in mixtures.

Correlation matrix of dataset features.

Figure 3: Correlation matrix of dataset features.

Data augmentation

The original dataset undergoes preprocessing, ensuring consistency and compatibility for subsequent steps. Small datasets often lead to models that are prone to overfitting, where the model learns specific patterns in the data rather than general trends, resulting in suboptimal performance. This can compromise the reliability of insights and predictions. To overcome data scarcity, CTGAN and TAVE are employed to generate synthetic data, which expands the dataset by producing samples that mimic the statistical properties of the original data. The process commenced with metadata detection using the Metadata method from the SDV library. Metadata serves as a critical component in preserving the structural integrity of synthetic data by capturing schema definitions, data types, and interdependencies among variables. This ensures that the generative model adheres to the logical and statistical constraints inherent in the original dataset. Both models were subsequently trained on the original data using 1,000 epochs to ensure convergence and learning of the underlying data distributions. The model generated 10,000 synthetic samples; the samples are filtered, and substantially high-quality data augment the dataset to enable more robust model development. The filtering process utilised Cook’s Distance to ensure the reliability of the generated synthetic data. Following the application of a linear regression model to the integrated real and synthetic dataset, we calculated Cook’s Distance for each synthetic sample to assess its impact on the model fit. Samples exhibiting a Cook’s Distance value exceeding 4/n were classified as high-influence outliers and subsequently excluded. This enabled the removal of synthetic samples that, although not definitive statistical outliers, had the potential to distort learning and diminish model robustness.

Synthetic data quality was evaluated through diagnostic and statistical similarity measures as shown in Table 3. Through data quality assessment reports for CTGAN and TVAE, it is evident that both models attain perfect scores in data validity, data structure, and general basic data quality, each scoring 100%. However, on closer examination of data structure elements, one finds notable variations. In both column shapes (97.42% vs. 85.79%) and column pair trends (94.59% vs. 62.97%), the TVAE model outperforms CTGAN. As a result, TVAE’s (96.01%) general data structure score is rather better than CTGAN’s (74.38%). Although both models preserve fundamental data integrity, TVAE generates synthetic data with a more accurate representation of feature distributions and relationships. Further, Column-wise data quality assessments, as in Table 4, confirm that TVAE consistently outperforms CTGAN across all evaluated criteria. The TVAE model demonstrates exceptional capability in maintaining the distributional integrity of the WBRatio (0.9978), OPC (0.9890), and CO2 (0.9823), achieving nearly perfect scores across the majority of columns. Conversely, CTGAN demonstrates challenges in capturing intricate relationships among these variables, evidenced by its relatively poor performance, particularly in the CO2 (0.7147) and STRENGTH (0.749) columns. The WBRatio exhibits the most significant disparity (0.656 vs. 0.9978), indicating that TVAE provides a more authentic synthetic representation of this crucial attribute. TVAE’s a superior ability to generate high-quality synthetic data that closely mirrors the statistical properties of the actual dataset. However, recognising that not all synthetic samples may accurately represent the original distribution. In such cases, outliers are removed, and only synthetic samples closely aligned with the original data distribution are retained.

Table 3:
Comparison of data quality evaluation for CTGAN and TVAE.
Evaluation metric CTGAN (%) TVAE (%)
Data validity 100.0 100.0
Data structure 100.0 100.0
Overall 100.0 100.0
Column shapes 85.79 97.42
Column pair trends 62.97 94.59
Overall 74.38 96.01
DOI: 10.7717/peerj-cs.3316/table-3
Table 4:
Comparison of column-wise data quality metrics for CTGAN and TVAE.
Column Metric CTGAN score TVAE score
WBRatio KSComplement 0.765777 0.997833
OPC TVComplement 0.985947 0.989000
PPC TVComplement 0.966797 0.978800
DAYS TVComplement 0.969233 0.951067
CO2 KSComplement 0.714700 0.982300
STRENGTH KSComplement 0.744969 0.946000
DOI: 10.7717/peerj-cs.3316/table-4

Note:

TVComplement, Total variation complement; KSComplement, Kolmogorov-Smirnov complement.

Machine learning modelling

Once the data preparation is complete, the final dataset obtained is split into training and testing sets in a (80:20) ratio. We utilised a 10-fold cross-validation strategy during the training phase to ensure robust model evaluation and prevent overfitting. This approach involves dividing the training data into ten subsets of equal size. In each iteration, nine folds are allocated for training, while the remaining fold serves for validation. This procedure is executed ten times, guaranteeing that each data point is utilised for validation precisely once. The resulting data is analysed using regression models, including a diverse range of ML models, including advanced algorithms such as Light Gradient Boosting Machine (LGBM), Gradient Boosting Regressor (GBR), RF, and K-nearest Neighbors Regressor (KNN). Ensemble techniques like AdaBoost Regressor (ADA) and Extra Trees Regressor (ET), along with Extreme Gradient Boosting (XGBoost), were also evaluated for their predictive performance. Traditional tree-based models, such as the Decision Tree (DT), were considered alongside linear models, including Linear Regression (LR) and Elastic Net (EN), to accurately estimate the compressive strength of CO2-incorporated concrete. The models are rigorously evaluated using performance metrics, including MAE, RMSE, and R2.

The model performance using the original dataset is initially assessed to obtain the overall improvement in the model predictability after data augmentation. The Table 5 presents a comprehensive comparison of various regression models. The results highlight LGBM as the best-performing model across metrics, achieving the lowest MAE (2.5954) and RMSE (3.1579), as well as the highest R2 score (0.9356). This suggests LGBM’s superior ability to capture patterns in the data with minimal error. This suggests that LGBM not only reduced errors but also accounted for the greatest percentage of variance in the target variable, thereby illustrating its generalisation and robustness. GBR and RF also demonstrate strong performance, though slightly less optimal compared to LGBM. Ensemble models, such as ADA and ET, show moderate performance. On the other hand, simpler models like LR and EN exhibit significantly higher errors and lower R2 scores, indicating their limited capability. These results suggest that linear models struggled to capture the underlying relationships within the dataset.

Table 5:
Model performance on original dataset.
Model MAE RMSE R2
LGBM 2.5954 3.1579 0.9356
GBR 2.6934 3.2865 0.9299
RF 2.8967 3.5157 0.9216
KNN 2.9203 3.5510 0.9171
ADA 3.0062 3.7472 0.9079
ET 3.1587 3.8622 0.9049
XGBoost 3.2000 3.8744 0.9042
DT 3.2426 3.9587 0.9003
LR 4.7495 5.7305 0.7850
EN 4.9119 6.0493 0.7528
DOI: 10.7717/peerj-cs.3316/table-5

The LGBM model’s robust performance is collectively illustrated by the Fig. 4 that has been presented. The model’s performance is illustrated by the learning curve Fig. 4A, which demonstrates strong generalisation capability with minimal overfitting as the number of training epochs increases. This is evidenced by the convergence of training and validation scores. This equilibrium implies that the model is neither overfitting nor underfitting. The model’s strong predictive accuracy and calibration are further emphasised by the high R2 value (0.913) and the tight clustering around the identity line in the prediction error plot Fig. 4B, which compares the predicted values ( y^) to the actual values ( y). By visualising the distribution of prediction errors, the residuals diagram Fig. 4C provides an additional layer of diagnostic analysis. The model predictions show no significant bias, as evidenced by the random dispersion around the zero line. The histogram’s roughly normal error distribution suggests that the model performs well without systematic deviations. The validation curve Fig. 4D offers a comprehensive understanding of the impact of the max_depth hyperparameter on the model’s performance. It illustrates that the model obtains optimal performance at a max depth of approximately 4–6, the parameters utilised are presented in Table 6.

Performance of LGBM utilising original data (A) learning curve, (B) prediction error plot, (C) residuals plot and (D) validation curve.

Figure 4: Performance of LGBM utilising original data (A) learning curve, (B) prediction error plot, (C) residuals plot and (D) validation curve.

Table 6:
Hyperparameters for LightGBM model.
Parameter Value Description
boosting_type gbdt Boosting algorithm type (Gradient boosted decision trees)
learning_rate 0.4 Step size shrinkage to prevent overfitting
num_leaves 150 Maximum number of leaves per tree
max_depth −1 Maximum tree depth
n_estimators 20 Number of boosting iterations
feature_fraction 0.5 Fraction of features used in each iteration
bagging_fraction 0.9 Fraction of data used for bagging
bagging_freq 3 Frequency of bagging
min_child_samples 6 Minimum number of samples per leaf
min_child_weight 0.001 Minimum sum of instance weight per leaf
min_split_gain 0.3 Minimum loss reduction to make a split
reg_alpha 0.005 L1 regularization term on weights
reg_lambda 0.0005 L2 regularization term on weights
random_state 123 Random seed for reproducibility
folding_strategy KFold(10) 10-fold cross-validation
DOI: 10.7717/peerj-cs.3316/table-6

In comparison to the original dataset, the performance of ML models on the augmented dataset exhibits a significant improvement in all evaluation metrics, as illustrated in Table 7. The augmented dataset improved the R2 of the majority of models and reduced estimation errors. LGBM maintains its lead in performance, with an MAE of 1.1847, an RMSE of 1.3833, and an R2 of 0.9872. In the same way, other ensemble models, including ET, GBR, and RF, exhibit robust performance, with R2 values that are nearly identical, ranging from 0.9872 to 0.9848. Interestingly, KNN also demonstrates improved performance (MAE = 1.2665, RMSE = 1.5335, R2 = 0.9843), which is a significant improvement from its original dataset performance. However, ADA exhibits less improvement, with an increase in R2 from 0.9079 to 0.9616, but it continues to lag behind the dominant models. The weakest performers are linear models, including LR and EN, despite the augmentation of the dataset. Although the R2 of LR has slightly improved from 0.7850 to 0.8458, EN continues to struggle with an R2 of 0.5851. This validates the idea that the data’s complexity still necessitates more advanced, non-linear modelling methods.

Table 7:
Model performance on augmented dataset.
Model MAE RMSE R2
LGBM 1.1847 1.3833 0.9872
GBR 1.2657 1.5108 0.9848
RF 1.1849 1.3843 0.9872
KNN 1.2665 1.5335 0.9843
ADA 1.9324 2.3998 0.9616
ET 1.1850 1.3846 0.9872
XGBoost 1.1848 1.3842 0.9872
DT 1.1850 1.3847 0.9872
LR 3.9731 4.8093 0.8458
EN 5.6742 7.8864 0.5851
DOI: 10.7717/peerj-cs.3316/table-7

The LGBM model exhibits a significant improvement in performance against the original dataset on the augmented dataset, as illustrated in Fig. 5. The learning curve Fig. 5A emphasises a robust generalisation capability as the training and cross-validation scores closely converge. This improved performance is further substantiated by the prediction error plot Fig. 5B, which displays an R2 value of 0.988. This enhancement suggests that the model trained on the augmented data is capable of predicting values with a significantly higher degree of precision. The model calibration is improved by the denser and more consistent distribution of points along the diagonal, which implies a reduced number of outliers. The distribution of residuals is more densely concentrated around the zero line in the residuals plot Fig. 5C than in the original dataset. Further, demonstrates a more concentrated and symmetrical distribution, while the residuals exhibit less variance. The validation curve Fig. 5D suggests that the model’s performance is less susceptible to fluctuations in the max-depth hyperparameter on the augmented dataset. The validation score remains high across a broader range of max-depth values, indicating that the augmented data not only enhanced performance but also increased model robustness and reduced the risk of overfitting.

Performance of LGBM utilising augmented data (A) learning curve, (B) prediction error plot, (C) residuals plot and (D) validation curve.

Figure 5: Performance of LGBM utilising augmented data (A) learning curve, (B) prediction error plot, (C) residuals plot and (D) validation curve.

Explainable artificial intelligence

The SHAP summary graphs from both the original Fig. 6 and augmented data Fig. 7 provide a comprehensive visualisation of the significance of features and their influence on the model’s predictions. Consistency in the model’s interpretation of feature influences is indicated by the identical feature rankings and SHAP value distributions in the analysis of both plots. The most influential features of PPC, OPC, CO2, DAYS, and WBRatio were analysed. The strong predictive power of the model is demonstrated by the broad distribution of SHAP values. OPC and CO2 exhibit substantial impacts, as evidenced by the extensive SHAP value ranges that indicate their critical roles in shaping the model output. In contrast, the SHAP value distributions of DAYS and WBRatio are more restricted, which indicates a more modest impact on predictions and potentially less predictive strength.

SHAP summary plot for the original dataset.

Figure 6: SHAP summary plot for the original dataset.

SHAP summary plot for the augmented dataset.

Figure 7: SHAP summary plot for the augmented dataset.

The uniform influence of features on model predictions in both scenarios is underscored by the consistent SHAP value range of −10 to 10 across plots. The correlation between the magnitude of the feature and the direction of the prediction is facilitated by the colour gradient from low to high feature values. For instance, positive SHAP values, indicated by darker colours, indicate a direct relationship with the model’s output. The vertical dispersion observed in PPC and OPC suggests potential feature interactions or non-linear influences effectively represented by the model, while the balanced SHAP value spread for CO2 suggests a linear or consistently influential relationship across its value range. The model’s feature attributions remain unaltered, as evidenced by the identical nature of both SHAP plots. This consistency suggests that the data augmentation procedure maintained the same feature distributions and relationships.

The interpretability and predictive capacity of the model, as illustrated in Fig. 8A, are considerably enhanced, as demonstrated by the feature importance plot for the augmented dataset. The CO2 feature’s dominance is the most significant observation, as it has a variable importance score of approximately 300, indicating that it has a substantial impact on the model’s predictions. Other features, including DAYS, WBRatio, OPC, and PPC, also influence the model’s decision-making to a lesser extent. The model’s comprehensive visualisation of the contribution of individual features to a specific estimation is illustrated in the SHAP force plot in Fig. 8B. The model anticipates a value of 24.72, which is significantly lower than the base value of approximately 35.25. The estimated value is substantially reduced by the most influential feature in this prediction, DAYS, which has a value of 7. The estimation is pushed toward a lower outcome as a result of the significant negative impact indicated by the blue bar associated with DAYS. This implies that the estimation value is significantly reduced by higher values of DAYS, underscoring the model’s sensitivity to this feature. Furthermore, the estimation is also adversely affected by the CO2 feature, which has a value of 0, although to a lesser extent than DAYS. The predicted value is further reduced by the narrower blue bar associated with CO2, which reinforces its modest yet tangible impact.

(A) Feature importance plot for the dataset and (B) SHAP force plot demonstrating feature contributions to model prediction.

Figure 8: (A) Feature importance plot for the dataset and (B) SHAP force plot demonstrating feature contributions to model prediction.

The plot effectively distinguishes the direction of influence, with blue indicating features that contribute to a reduced estimation error. The absence of red bars indicates that no features in this instance influenced the prediction to a higher value, indicating a cumulative downward influence on the estimated output. The model’s prediction is reduced from the base value of 35.25 to 24.72 as a result of the combined effects of DAYS = 7 and CO2 = 0, which highlights the extent of these features’ influence. The plot improves the interpretability of the model by clarifying the impact of feature values on specific predictions from a practical standpoint. A DAYS value of 7 is indicative of an early-stage curing period, which is typically associated with lower strength values in applications such as concrete strength estimation. This value is consistent with the reduced estimation. Likewise, a CO2 value of 0 indicates that the environmental or compositional conditions are additional factors that contribute to the lower strength outcomes. The insights obtained from this force plot not only enhance comprehension of the model’s decision-making process but also offer practitioners actionable information. For example, strategies that involve changing feature values, such as optimising curing time (DAYS) or managing CO2 exposure, have the potential to enhance predicted outcomes.

Interpreting SHAP for concrete mix design

The SHAP analysis not only identifies the most significant variables in forecasting compressive strength but also offers prescriptive guidance for optimising concrete mixtures. The uniform feature rankings in both datasets indicate that the model is consistent in domain-specific correlations. Specifically, CO2 demonstrates the greatest SHAP value magnitudes, signifying a substantial impact on strength. Positive SHAP results at moderate CO2 concentrations show that regulated CO2 curing can improve strength; however, negative SHAP values at zero or minimal CO2 exposure imply a decline in performance. This suggests that implementing CO2 curing at optimised dosages enhances results, while preventing high amounts that may lead to declining strength. Likewise, curing time (DAYS) has an inverse correlation in the initial stages: substantial negative SHAP values for brief curing durations (e.g., 7 days) signify diminished strength, consistent with cement hydration dynamics. Prolonging curing beyond initial stages, especially for mixtures with significant early age sensitivity, can enhance forecasts favourably, therefore augmenting performance. Binders like OPC and PPC have extensive SHAP value distributions, indicating significant interaction effects with curing time and CO2 dosage.

Discussion

This study presented a framework for predicting the compressive strength, enhanced through synthetic data generation using the TVAE model. The approach addresses data scarcity challenges by augmenting limited experimental datasets with statistically validated synthetic samples. Further, integration of a filtering mechanism using Cook’s Distance to eliminate synthetic outliers that could compromise model reliability. The models demonstrated high predictive accuracy. These findings suggest that the proposed framework can serve as a scalable and efficient tool for early-stage mix design evaluation, potentially reducing the need for extensive laboratory testing in standard design scenarios. Despite the promising results, the study has certain limitations that warrant consideration. While the validity of the synthetic data was evaluated statistically and a small subset of generated mixes was experimentally verified. The experimental compressive strengths of selected synthetic designs were within ±10% of the predicted values. However, such partial validation does not account for more complex phenomena, such as carbonation-induced alterations in hydration kinetics. Furthermore, the models were trained exclusively on OPC and PPC-based concretes with defined water-to-binder ratios and curing regimes. As a result, the models are applicable primarily for interpolation within this domain, and extrapolation to high-performance concrete, ultra-high-performance concrete, self-compacting concrete, or varying environmental exposures (e.g., humidity and temperature) should not be assumed without retraining and independent validation using representative datasets. Future studies should incorporate multi-source experimental data that includes various binder systems, admixtures, and curing methods to improve generalizability.

Conclusion

This study focuses on the incorporation of CO2 during mixing as an effective carbon sequestration strategy, contributing to the reduction of atmospheric CO2. The regulated integration of CO2 during mixing improves concrete compressive strength, with optimal results reported at 0.05–0.10% CO2 dosage. The improvement in strength is more significant in mixes with reduced water-cement ratios (0.35 and 0.45). Beyond 0.10% CO2 dosage leads to a decline in strength owing to carbonation-induced interference with hydration. OPC mixes demonstrate superior sensitivity to higher CO2 dosages compared to PPC, which is also depicted from ExAI results. Further, it successfully demonstrates a robust approach for estimating the compressive strength of CO2 incorporated concrete by integrating advanced ML techniques with synthetic data generation. The proposed framework effectively addresses the challenges associated with limited experimental data by generating and rigorously validating synthetic data. The exceptional predictive performance of the model, with an R2 value of 0.9872 exhibiting 5.52% improvement, MAE of 1.1847 indicating 54.35% improvement, and RMSE of 1.3833 with 56.20% improvement, underscores the efficacy of combining synthetic data augmentation. The incorporation of XAI techniques bridges the gap between the model’s complexity and its interpretability, fostering confidence in its deployment within the construction, materials science and industries. Future research will focus on broadening the applicability of this framework to other domains facing similar data scarcity challenges, exploring its potential in diverse research and industrial applications.

Supplemental Information

Supplemental Figures.

DOI: 10.7717/peerj-cs.3316/supp-1

Original Experimental Dataset.

Each data point speaks about the strength of the concrete.

DOI: 10.7717/peerj-cs.3316/supp-2