Application of machine learning techniques for warfarin dosage prediction: a case study on the MIMIC-III dataset

View article
PeerJ Computer Science

Main article text

 

Introduction

Background

Objective

Challenges of missing data in MIMIC-III dataset

Problem statement

Rational for INR prediction

Organizational structure

Literature review

Methodology

Dataset overview and clinical relevance

Data preprocessing and merging techniques

  • Step 1: Inner merging was used to focus on coagulation and biochemical factors, ensuring that only complete records across key variables were included. Although this reduced error for the ML models, it also excluded rare case, a necessary tradeoff to obtain comprehensive feature sets.

  • Step 2: Outer merging was applied to primary coagulation factors using “Subject ID” and “Chart Time” as keys. This approach retained all patient records, including those with missing data, ensuring that rare cases, which could offer valuable clinical insights, were preserved for analysis.

  • Step 3: Inner merging was re-applied, combining selected coagulation and biochemical factors. This balanced data completeness with the inclusion of critical clinical markers across a broader patient population.

Dimensionality reduction methods

Missing value imputation methods

Methods for missing completely at random

Methods for missing at random

Methods for missing not at random

Methods of outlier detection—isolation forest

Prediction model—random forest

Methodology and performance evaluation

  • Data segmentation and imputation techniques: Step 1: MICE imputation is combined with basic ML models and outer merging strategies. This step retains all patient records, including those with missing data, preserving rare cases but introducing challenges in handling extensive missing data. Step 2: Advances to more sophisticated imputation methods while continuing with MICE. Inner merging is employed to maintain records with complete data across key variables, leading to a more accurate dataset but excluding some rare patient cases. Step 3: MICE imputation is integrated with ML methods like DAE & GAN to address MNAR data. Balanced merging techniques are introduced to preserve critical clinical markers, resulting in a robust dataset for predictive modeling.

  • Dimensionality reduction and outlier detection: After imputation, PCA simplifies the dataset by capturing major patterns, while t-SNE explores non-linear relationships. Isolation Forest is used for outlier detection, removing extreme values that could distort predictive accuracy and producing a more homogenous dataset for modeling.

  • Predictive modeling and feature importance: RF is employed for predictive modeling due to its ability to handle non-linear relationships and its robustness against overfitting. It predicts INR based on key clinical variables and identifies influential predictors through its feature importance function. Step 1: The initial dataset, processed with basic imputation and outer merging, results in relatively lower accuracy due to the high amount of missing data. Step 2: More sophisticated imputation and inner merging improve predictive accuracy. Key predictors, such as coagulation markers, INR, and genetic factors, are better identified. Step 3: The most refined dataset, imputed with DAEs and GANs, reduced via PCA and t-SNE, and cleaned of outliers, produces the most accurate INR predictions. Critical predictors like CYP2C9, VKORC1, biochemical factors, and INR levels are highly ranked in the feature importance analysis.

  • Performance evaluation and metrics: Imputation effectiveness and predictive model accuracy are evaluated using RMSE, chosen for its sensitivity to larger errors, which is crucial in clinical contexts like INR prediction. The RF regression model with consistent hyperparameters is applied across all steps to ensure fair comparison, attributing performance differences directly to imputation effectiveness. RMSE, preferred over MAE and MAPE, emphasizes outliers and maintains precision within small target ranges, making it the most suitable metric here.

Results

Feature importance analysis

Methodology

  • Step 1: Initial analysis sought to confirm the prominence of critical factors like Factor II and Factor VII. However, unexpected variables like platelet smear dominated the results, suggesting that the imputation process had introduced significant bias, distorting the natural relationships and compromising the model’s clinical reliability (Luo et al., 2016).

  • Step 2: With a more comprehensive set of biochemical and coagulation markers, the feature importance distribution improved. However, some inconsistencies remained—key predictors such as those related to warfarin’s effects did not retain their expected significance. While the model showed progress, the imputation method required further refinement to fully capture the complex interactions between predictors (Che et al., 2018).

  • Step 3: By refining the imputation approach further, Step 3 aligned well with clinical expectations. Critical factors, such as white blood cell count (WBC), bicarbonate, and gender, were consistently ranked as the most influential predictors (Xu & Wunsch, 2005; Bolón-Canedo, Sánchez-Maroño & Alonso-Betanzos, 2016). This validation demonstrated that the imputation effectively preserved the necessary relationships, making the predictions more reliable and clinically applicable (Nazabal et al., 2020).

Key insights and clinical implications

Clinical application of feature importance

  • Factor II is key to clot formation as a precursor to thrombin, which converts fibrinogen to fibrin. Warfarin inhibits prothrombin synthesis, prolonging PT and elevating INR. Low prothrombin levels increase the risk of over-anticoagulation and bleeding. Our model identifies Factor II as critical for INR prediction, recommending a 10-15% warfarin reduction to prevent excessive anticoagulation and bleeding.

  • Factor VII is highly sensitive to warfarin and an early indicator of its effect on coagulation. A decrease in Factor VII rapidly increases PT and INR, making it crucial for immediate dose adjustments. The model’s reliance on Factor VII highlights its importance in real-time warfarin management, where fluctuations may signal unstable INR control and the need for frequent monitoring. A sharp decline in Factor VII, even with stable INR, may require a preemptive dose adjustment to prevent thrombotic events.

  • Factor X is a central component of the common coagulation pathway and directly facilitates the conversion of prothrombin to thrombin. Reductions in Factor X caused by warfarin delay clot formation, leading to prolonged PT and elevated INR. Our model highlights Factor X as a critical factor for fine-tuning warfarin dosing. When a patient’s Factor X levels drop more than expected, the model may recommend a slight increase in warfarin to maintain INR within the therapeutic range, thereby reducing thrombotic risk. Conversely, elevated Factor X levels may signal the need to decrease INR to prevent under-anticoagulation.

Dimensionality reduction and outlier detection

Step 1: principal component analysis for initial exploration

Step 2: transition to t-SNE for non-linear structure exploration

Step 3: outlier detection with isolation forest

Inference drawn from missing value imputation methods

  • Deletion methods: Deletion methods handle missing data by removing records with incomplete information. While simplifying the dataset, this introduces significant drawbacks in high-dimensional clinical datasets like MIMIC-III, where missing data in critical variables (e.g., lab results and vital signs) is common. Deletion leads to selection bias, overrepresenting patients with fewer complications and excluding those with irregular monitoring, often more severely ill or with rare conditions (Waljee et al., 2013; Beaulieu-Jones et al., 2017). This bias skews the dataset toward typical cases, reducing variability needed for accurate predictions and oversimplifying clinical responses like INR outcomes (Sohrabi & Tajik, 2017). Consequently, deletion methods can result in inaccurate warfarin dosing recommendations, increasing the risk of complications such as bleeding or thrombosis in atypical patients (Emmanuel et al., 2021). Additionally, deletion reduces dataset size, weakening statistical power and limiting the detection of significant relationships between patient characteristics and INR outcomes, particularly in critical subgroups (Murray, 2018). For instance, in MIMIC-III, 80% of prothrombin time (PT) results and 70% of Glasgow Coma Scale (GCS) scores are missing, skewing predictions toward stable cases. Similarly, 55% of Heparin administration data is missing, directly affecting INR management (Yoon, Jordon & Schaar, 2018).

  • Single imputation methods: Single imputation techniques (mean, median, or mode imputation) applied to the MIMIC-III dataset also fall short in handling missing data, resulting in biased and less reliable INR predictions (Beaulieu-Jones et al., 2017; Waljee et al., 2013). These methods oversimplify by replacing missing values with statistical averages, distorting natural variability and reducing patient heterogeneity (Rubin, 2004). Patients with more complete data often have severe conditions due to intensive monitoring, biasing imputation towards these cases, while patients with milder conditions and sparser data are underrepresented (Madley-Dowd et al., 2019). Mean imputation centralizes predictions around the average, overlooking individual differences and leading to inaccurate dosing recommendations (Beaulieu-Jones et al., 2017). Median imputation, though more robust to outliers, still oversimplifies patient-specific responses. Mode imputation is unsuitable for continuous measures like INR, further obscuring critical differences in patient responses and increasing the risk of adverse outcomes (Waljee et al., 2013). These methods fail to capture the complexity of clinical data, highlighting the need for more sophisticated imputation approaches for accurate INR predictions (Nazabal et al., 2020).

  • Multiple Imputation by Chained Equations (MICE): was applied to the MIMIC-III dataset for imputing missing data using an iterative process, where each variable is predicted based on observed values from other variables. MICE worked effectively for common conditions like sepsis and heart failure with MCAR or MAR. However, it struggled with MNAR, especially in sicker patients, leading to biases that compromised the accuracy of INR predictions—critical for warfarin dosing decisions. MICE’s assumption that missingness could be explained by other observed variables often faltered in complex cases involving systematic missingness. MICE also encountered difficulties with high-dimensional clinical data, where the relationships between variables are often non-linear. This limitation affected the precision of INR predictions, a metric influenced by multiple interacting factors. Moreover, MICE was not designed to handle temporal dependencies, treating variables as static, which led to inaccuracies in time-sensitive predictions like fluctuating INR levels that are crucial for adjusting warfarin dosages. While MICE outperformed simpler methods like mean/mode imputation, it fell short compared to advanced techniques like DAEs and GANs, which better captured non-linearities and temporal dynamics, resulting in 15-25% lower RMSE and MAE for INR predictions. MICE’s reliance on predictor equations also increased the risk of overfitting or mis-specification, particularly in rare or atypical clinical cases. Thorough validation was essential to ensure MICE’s imputations did not compromise INR prediction accuracy, especially in high-risk groups where precise warfarin dosing is critical.

  • k-nearest neighbors (KNN): imputation was applied to the MIMIC-III dataset, using nearby data points based on clinical features like vital signs and lab values. While effective for common conditions, KNN introduced bias when handling rare conditions or atypical cases by favoring frequent patterns, leading to less accurate imputations. This issue was particularly significant for INR predictions in complex cases where clinical data deviated from norms. Additionally, KNN failed to account for temporal trends, which are essential for accurate INR predictions and warfarin dosing adjustments. Ignoring time-dependent data disrupted patient trajectories, causing inaccuracies in modeling time-sensitive outcomes. Comparative analyses showed KNN performed similarly to simpler methods like mean/mode imputation but was consistently outperformed by deep learning models like DAEs and GANs. These models captured non-linear and time-dependent relationships better, reducing RMSE and MAE by 10–30%. Moreover, KNN was vulnerable to bias in MNAR, disproportionately affecting sicker patients with more incomplete data, skewing INR predictions and increasing the risk of complications like bleeding or thrombosis. Careful validation was needed to ensure imputed values did not compromise clinical accuracy, especially for rare conditions or time-sensitive situations (Yoon, Jordon & Schaar, 2018).

  • Gaussian mixture model (GMM): GMM imputation was applied to the MIMIC-III dataset to address the complex patterns of missing data by modeling them as a mixture of multivariate normal distributions. While GMM effectively handled non-linear relationships in the data, its performance was mixed when faced with MIMIC-III’s unique challenges. For common clinical conditions with well-represented distributions, such as heart failure or pneumonia, GMM produced reasonable imputations. However, for rare conditions or patients with atypical trajectories, GMM often struggled to accurately estimate missing values, as its mixture components could not adequately capture the sparsity and variability inherent in these cases. The absence of temporal modeling in GMM proved a critical limitation in MIMIC-III, where patient trajectories and clinical trends over time (e.g., changes in INR) are essential for accurate predictions. GMM imputation, by treating data points as static, often failed to reflect the time-sensitive nature of clinical variables. This resulted in disruptions to the continuity of patient data, particularly for outcomes like warfarin dosage adjustments that rely on tracking fluctuations in INR over time. Comparative evaluations showed that while GMM outperformed basic imputation techniques such as mean/mode imputation in capturing static relationships, it was consistently outperformed by advanced models like DAE and GANs, which accounted for both non-linear and temporal dependencies. These models achieved a 15-30% reduction in RMSE and MAE compared to GMM, particularly for predicting INR—a key factor in optimizing warfarin dosing in critical care settings. GMM also exhibited biases in cases of non-random missing data (MNAR), a common issue in MIMIC-III, where sicker patients or those with more severe conditions tend to have more incomplete data. GMM’s reliance on well-represented distribution patterns resulted in biased imputations, which skewed predictions for these high-risk subgroups. Rigorous validation was necessary to ensure that GMM’s imputations did not compromise clinical accuracy, particularly in patients requiring precise monitoring and individualized treatment strategies.

  • Denoising autoencoders (DAEs): DAEs were applied to the MIMIC-III dataset to impute missing data by learning latent representations and reconstructing complete data from corrupted inputs. They were particularly effective at capturing non-linear relationships in high-dimensional clinical data, such as vitals, lab values, and patient history, which significantly improved the accuracy of INR predictions compared to simpler methods (Vincent et al., 2008; Gondara & Wang, 2018). By handling noise and maintaining the integrity of data distributions, DAEs contributed to more reliable INR imputation, which directly impacted clinical decisions, such as warfarin dosing (Nazabal et al., 2020). However, in cases of extreme missingness or outliers, DAEs sometimes produced unrealistic imputations, especially when latent features failed to generalize to rare clinical events (Hastie et al., 2009). Careful tuning of the model was necessary to mitigate these risks. DAEs also effectively reduced bias in MNAR, a common issue in sicker patients, improving INR predictions in these challenging cases (Rubin, 2004). Despite their strong performance, DAEs struggled with temporal dependencies, limiting their ability to accurately predict time-dependent INR fluctuations, which are critical for guiding warfarin adjustments. Optimizing the model and validating the imputations were essential to ensure accurate INR predictions, particularly in patients with rare conditions.

  • Generative adversarial imputation networks (GAIN): a GAN-based model, was applied to the MIMIC-III dataset to impute missing data by leveraging an adversarial framework that captured complex, non-linear relationships in high-dimensional clinical data (Yoon, Jordon & Schaar, 2018). The generator modeled missing values while the discriminator distinguished real from imputed data, allowing GAIN to produce realistic imputations that closely matched the original data distributions. This was especially valuable for MNAR data, common in sicker patients, where traditional methods like k-NN and MICE faltered (Rubin, 2004; Sterne et al., 2009). GAIN excelled in capturing patterns in vitals, lab values, and patient history, contributing to more accurate INR predictions and better-informed warfarin dosing decisions. However, the adversarial nature of GAIN occasionally led to extreme or unrealistic values, particularly when the generator and discriminator dynamics were not well-balanced during training (Goodfellow et al., 2020). These distortions were more pronounced in rare clinical cases or when imputing data with extreme missingness patterns, leading to overfitting or the generation of anomalous values that deviated from the expected distribution (Nazabal et al., 2020). Careful hyperparameter tuning and validation were essential to mitigate these effects and ensure the reliability of the imputations (Gondara & Wang, 2018). Temporal dependencies also presented challenges for GAIN, as it excelled in cross-sectional imputations but was less effective with time-dependent variables such as INR trends, which are critical for guiding warfarin adjustments. Despite these limitations, GAIN consistently outperformed simpler methods, achieving 20-35% reductions in RMSE and MAE (Yoon, Jordon & Schaar, 2018). Proper training and validation were crucial to maintain stability, avoid extreme imputations, and ensure accurate clinical predictions, particularly for complex or rare patient profiles (Hastie et al., 2009).

  • Variational autoencoders (VAEs): We applied VAEs to the MIMIC-III dataset to impute missing data by learning latent representations and using probabilistic models for reconstruction. VAEs excelled in capturing non-linear relationships in clinical data, but the probabilistic nature of the model introduced bias, particularly in cases of non-random missingness from critically ill patients (Kingma & Welling, 2013). This bias directly impacted the accuracy of INR predictions, a crucial factor in determining precise warfarin dosing (Rubin, 2004). Misjudged INR predictions led to improper dosing, increasing the risk of complications such as bleeding or thrombosis (Yoon, Jordon & Schaar, 2018). We also observed that VAEs struggled with temporal dependencies since they were designed for cross-sectional imputation, limiting their ability to model time-dependent trends like fluctuating INR levels (Murray, 2018). This reduced the model’s effectiveness in predicting INR trends over time, which are essential for making informed adjustments to warfarin dosages based on a patient’s evolving clinical condition. Despite these challenges, VAEs demonstrated a strong ability to handle high-dimensional data, though they required significant hyperparameter tuning to optimize layers and latent dimensions. Validation was essential to ensure that the imputed values were clinically accurate and did not compromise INR predictions (Donders et al., 2006). With proper tuning and validation, we improved the accuracy of INR predictions, minimizing the risk of adverse outcomes related to improper warfarin dosing in critical care settings. However, occasional spikes or drops in INR predictions were observed due to the bias and difficulty with non-random missingness, affecting the smoothness of the prediction patterns (Nazabal et al., 2020).

Discussion

Rationale for excluding step 1 and step 2 data

  • The mean INR values in both step were alarmingly outside the recommended therapeutic range, suggesting significant issues with data integrity. In Step 1, the mean INR reached 54, while in Step 2, it was 37—dramatically exceeding the standard therapeutic range of 2.0 to 3.0 (or 2.5 to 3.5 for high-risk patients). Such extreme values indicate the presence of substantial outliers or data inconsistencies. Including these outliers in the analysis would introduce significant bias, distort clinical insights, and compromise the overall reliability of the findings. Consequently, relying on these skewed values for clinical decisions would create unacceptable levels of uncertainty, jeopardizing the accuracy of the model and its ability to produce meaningful and reliable predictions.

  • Dimensionality reduction: techniques like PCA and t-SNE were applied to differentiate patterns across Steps. Step 1 showed clear clusters with strong linear relationships, while Step 2 presented more dispersion and variability, and Step 3 demonstrated more uniform patterns with reduced outlier influence. This progression highlighted how the underlying data patterns evolved across steps, with Step 3 providing the most stable foundation for model training and prediction.

  • The increasing noise and variability observed across steps significantly impacted the analysis. Step 1 data was relatively clean with well-defined clusters, but by Step 2, noise led to greater dispersion and more complex nonlinear relationships, complicating reliable predictions. Despite preprocessing, Step 2’s elevated complexity presented challenges for critical tasks like warfarin dosing, while Step 3’s reduced noise and improved patterns allowed for more reliable analysis.

  • A crucial finding emerged from the feature importance analysis. Key coagulation parameters, such as Factors II, VII, and X—pharmacokinetically critical for determining warfarin dosage—were ranked unexpectedly low in steps 1 and 2. This discrepancy from established clinical knowledge suggested that these steps failed to properly represent the factors driving warfarin metabolism. However, in Step 3, the ranking of these critical features aligned more closely with clinical expectations, indicating that Step 3 better captured the relevant dynamics of warfarin dosing.

Interpretability of machine learning models

Scalability and clinical applications

Limitations and future directions

Conclusions

Supplemental Information

Part of the filtered dataset.

DOI: 10.7717/peerj-cs.2612/supp-1

Working Python Code.

DOI: 10.7717/peerj-cs.2612/supp-2

Raw patient-level clinical data extracted from the Stage 3 dataset, containing laboratory and physiological measurements.

DOI: 10.7717/peerj-cs.2612/supp-3

Normalized and scaled patient-level clinical data extracted from the Stage 3 dataset for machine learning applications.

DOI: 10.7717/peerj-cs.2612/supp-4

Additional Information and Declarations

Competing Interests

The authors declare that they have no competing interests.

Author Contributions

Aasim Ayaz Wani conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Fatima Abeer analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The data is available at the MIMIC-III Clinical Database: https://physionet.org/content/mimiciii/1.4.

Funding

The authors received no funding for this work.

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more