Comparison of pediatric scoring systems for mortality in septic patients and the impact of missing information on their predictive power: a retrospective analysis

Background Scores can assess the severity and course of disease and predict outcome in an objective manner. This information is needed for proper risk assessment and stratification. Furthermore, scoring systems support optimal patient care, resource management and are gaining in importance in terms of artificial intelligence. Objective This study evaluated and compared the prognostic ability of various common pediatric scoring systems (PRISM, PRISM III, PRISM IV, PIM, PIM2, PIM3, PELOD, PELOD 2) in order to determine which is the most applicable score for pediatric sepsis patients in terms of timing of disease survey and insensitivity to missing data. Methods We retrospectively examined data from 398 patients under 18 years of age, who were diagnosed with sepsis. Scores were assessed at ICU admission and re-evaluated on the day of peak C-reactive protein. The scores were compared for their ability to predict mortality in this specific patient population and for their impairment due to missing data. Results PIM (AUC 0.76 (0.68–0.76)), PIM2 (AUC 0.78 (0.72–0.78)) and PIM3 (AUC 0.76 (0.68–0.76)) scores together with PRSIM III (AUC 0.75 (0.68–0.75)) and PELOD 2 (AUC 0.75 (0.66–0.75)) are the most suitable scores for determining patient prognosis at ICU admission. Once sepsis is pronounced, PELOD 2 (AUC 0.84 (0.77–0.91)) and PRISM IV (AUC 0.8 (0.72–0.88)) become significantly better in their performance and count among the best prognostic scores for use at this time together with PRISM III (AUC 0.81 (0.73–0.89)). PELOD 2 is good for monitoring and, like the PIM scores, is also largely insensitive to missing values. Conclusion Overall, PIM scores show comparatively good performance, are stable as far as timing of the disease survey is concerned, and they are also relatively stable in terms of missing parameters. PELOD 2 is best suitable for monitoring clinical course.


INTRODUCTION
Early detection of critically ill patients is essential for timely, good care in a suitable facility. Sepsis remains one of the leading causes of childhood death, although our understanding of the pathophysiology of sepsis has changed drastically in the last couple of decades due to the development of new diagnostic projections and strategies in the treatment of this complex illness (Dellinger et al., 2013).
To help assess severity of illness for risk stratification in terms of required resources, stratify patients prior to randomization in clinical trials, compare intra-and interinstitutional outcome and survival, improve quality assessment as well as cost-benefit analysis, and to facilitate clinical decision making, prognostic scoring systems were established in the 1980s and have been improved and validated since (Lemeshow & Le, 1994;Marcin et al., 1998).
The first scoring systems were developed for adults and were less suitable for use in children. Finally, corresponding scores were presented specially for children and continuously developed and improved (Leteurtre et al., 1999;Pollack, Ruttimann & Getson, 1988;Shann et al., 1997). Some of them permit the probability of survival to be estimated as a function of the determined score. Today's scores, which are especially suitable for children, are, for example, the Pediatric Risk of Mortality (PRISM) score, from which its further developments, namely the PRISM III and PRISM IV scores, the Pediatric Index of Mortality (PIM) score, were derived, the PIM2 and PIM3 scores and the PELOD (Pediatric Logistic Organ Dysfunction) score followed by the PELOD 2 score (Leteurtre et al., 2006;Leteurtre et al., 2013;Leteurtre et al., 1999;Pollack et al., 2016;Pollack, Patel & Ruttimann, 1996b;Pollack, Ruttimann & Getson, 1988;Shann et al., 1997;Slater, Shann & Pearson, 2003;Straney et al., 2013). Only few studies deal with the question whether the individual scores are equally suitable for all types of conditions (Dewi et al., 2019;Gemke & van Vught, 2002;Leteurtre et al., 2001;Muisyo et al., 2019;Qiu et al., 2017b).
For each score, the development identified specific times or timescales for patient enrollment, within which the score provides the most accurate indication of the patient's condition and the likelihood of survival (Leteurtre et al., 2006;Leteurtre et al., 2013;Leteurtre et al., 1999;Pollack et al., 2016;Pollack, Patel & Ruttimann, 1996b;Pollack, Ruttimann & Getson, 1988;Shann et al., 1997;Slater, Shann & Pearson, 2003;Straney et al., 2013). However, some patients develop certain life-threatening conditions-such as sepsis-only during their inpatient stay and are actually hospitalized for a completely different reason, for example following a surgical intervention (Sidhu et al., 2015). In such a case, it is obvious that the condition of the patient determined at admission can only conditionally predict the course of a complication developed at a later time. Although some scores consider the admission reason in their evaluation (Pollack et al., 2016;Straney et al., 2013), the further course is still open.
Scores are calculated based on vital signs, laboratory parameters and other patient parameters. In everyday clinical practice and of course as a consequence of the retrospective study design it is often not possible to determine all required data, because they are not recorded, not collected or can no longer be found. However, incomplete data raise the question of score accuracy. Some evidence suggests that knowing the completeness of necessary data is essential for correct score results (Agor et al., 2019;Gorges et al., 2018).
This knowledge of data completeness and its impact on the results is becoming of even more interest with regard to the keyword ''artificial intelligence.'' More and more artificial intelligence assessments are based on patient stratification and different kinds of scores (Abbasi, 2018;Komorowski et al., 2018). For this reason, it is wise to also know more about the influence of ''missing values'' on the accuracy of the informative value of the scores, since not all parameters needed for creation of the scores are available or measured.
In this study, the usual pediatric scores are compared in terms of their predictive value in a group of septic children. Also, the optimal timing for determining these scores in the clinical course picture of sepsis was evaluated in addition to the influence of the lack of data on the predictive value of the different scores.

METHODS
This retrospective analysis included 398 critically ill pediatric patients treated at Innsbruck Medical University Hospital.

Inclusion of patients
The medical files of patients younger than 18 years of age with diagnosed sepsis or a proven blood stream infection between 2000 and 2019 were reviewed. Children fulfilling the definitions according to the International Pediatric Consensus Conference (Goldstein, Giroir & Randolph, 2005) were included. The current definition of pediatric sepsis is systemic inflammatory response syndrome (SIRS) in the presence of or as a result of suspected or proven infection (Goldstein, Giroir & Randolph, 2005). SIRS is given when at least two of the four criteria are present, one of which must be abnormal temperature or leukocyte count (Goldstein, Giroir & Randolph, 2005). In this connection, the fulfillment of SIRS criteria is dependent on the age-specific normal values. The study protocol was approved by the institutional review board of the Medical University of Innsbruck (AN2013-0044 and EK Nr: 1109/2019).

Data collection
We collected the demographic variables such as age, sex and the diagnosed underlying disease. The underlying disease was assigned to the appropriate organ category: central nervous system, cardiovascular system, respiratory system, hepatic or renal failure. Also recorded was whether the patient suffered from an oncologic disease. Furthermore, we collected routinely measured laboratory parameters on the day of peak C-reactive protein.
C-reactive protein was chosen as parameter for the most severe stage of sepsis since it reflects the inflammatory process and is widely used in clinical routine. Many studies have described an interrelation between an elevated C-reactive protein level and sepsis (Koozi, Lengquist & Frigyesi, 2020;Maury, 1989;Povoa et al., 1998;Presterl et al., 1997;Schentag et al., 1984) and that C-reactive protein is highest at the most severe point of sepsis (Castelli et al., 2004;Lobo et al., 2003;Povoa et al., 1998;Povoa et al., 2005). Furthermore, elevated C-reactive protein is also associated with organ failure (De Beaux et al., 1996;Ikei, Ogawa & Yamaguchi, 1998;Pinilla et al., 1998;Rau et al., 2000;Waydhas et al., 1996), which makes C-reactive protein a suitable parameter for the surveillance of sepsis severity.
PRISM, PRISM III, PRISMIV, PIM, PIM2, and PIM3 as well as PELOD and PELOD2 scores were retrospectively assessed. Since in the realm of this study the scores were evaluated for their ability to depict the disease process and etiology, we have chosen two time points for score assessment, namely the day of admission as well as the day of peak C-reactive protein (File S1). In this way we were able to analyze whether the time of assessment influences the predictive power of the individual scores. In-hospital mortality and multi-organ dysfunction syndrome (MODS) were chosen as outcome parameters. During data acquisition, the percentage of missing values was recorded for each score as a function of the number of their requested parameters.

Statistical analysis
A mathematician (TH) not involved in the study procedures or patient assessment was responsible for the statistical analyses conducted using R, version 3.5.3. We present continuous data as median (25th to 75th percentile) and categorical variables as frequencies (%). We show effect size and precision with estimated median differences between survivors and non-survivors for continuous data and odds ratios (OR) for binary variables, with 95% CIs. All statistical assessments were two-sided, and a significance level of 5% was used. We applied the Wilcoxon rank sum test and Fisher's exact test to assess differences between survivors and non-survivors.
The precision of the scores as the difference between predicted mortality and actual mortality is presented depending on the percentage of missing parameters. With respect to their diagnostic ability, the scores were compared by means of ROC curves, and DeLong's test was used to assess differences in ROC AUC. Corresponding 95% CIs were provided for the ROC AUC of the scores and the differences in the ROC AUC between scores. For this analysis only complete data were used.

Patient characteristics
For this analysis 398 patients met the eligibility criteria for study inclusion. In-hospital mortality in these septic children was 13.6% (n = 54). The median age of the children was 29.6 months, whereas 14.6% of the study population consisted of neonates. There was no difference in survival between males and females (see Table 1).
The most affected organ in terms of underlying disease was the respiratory system in 26.8% of the children followed by diseases of the central nervous system (22.3%) and the cardiovascular system (21.6%). Only the rate of kidney failure was significantly higher in the non-survivors, whereas the proportion of affected central nervous systems and digestive tracts tended to be higher in the non-survivors as well.

Evaluation of scores for the predictive ability for mortality
As seen in Fig. 1, the best prediction abilities in our study are seen for PIM (0.76; 0.68 to 0.76), PIM2 (AUC 0.78; 0.72 to 0.78) and PIM3 (AUC 0.76; 0.68 to 0.76), although there is no significant difference between them and the other tested scores except for PRISM. PRISM shows the poorest mortality prediction of all tested scores and is significantly poorer than PRISMIII (p = 0.0122), PIM (p = 0.0059), PIM2 (p = 0.0125) and PELOD2 (p = 0.0359). Also, the predictive ability of the scores PRISMIV and PELOD is as poor as that of PRISM, although with a slightly higher AUC. The most recent PRISMIV and PIM3 scores show no improvement in mortality prediction. On the contrary, they even show a deterioration in predictive value in our specific septic population as compared to the predecessor scores.
No difference was seen in thromboembolic complications or bleeding events between survivors and non-survivors ( Table 2). The parameters for organ function with regard to renal or hepatic impairment also show no difference. Furthermore, in this septic population none of the recorded inflammatory parameters, namely C-reactive protein, procalcitonin, and interleukin-6, differentiate between survivors and non-survivors. Only the coagulation parameters show different values depending on the survival of the septic children. Fibrinogen, antithrombin and platelets were significantly higher in the survivors. As seen in the global coagulation tests, prothrombin time (PT; Quick) and activated thromboplastin time (aPTT), the patients who did not survive were in a hypocoagulable state. This is also reflected in the statistically larger number of bleeding complications seen in the non-survivors.
Admission versus peak C-reactive protein: does the time of score evaluation matter?
To address the next question, namely whether the time of scoring makes a difference, two time points were compared: admission and the time when sepsis was most severe according to peak C-reactive protein. Except for PELOD, all scores improved towards the peak in C-reactive protein, as seen from their AUCs in Table 3. PRISMIV and PELOD2 even improved significantly and became, together with PRISMIII, the scores with the highest predictive ability, as seen in their AUC of 0.8, 0.84 and 0.81, respectively. The worst performance at this time was seen for PRISM followed by PELOD and PIM3.

Missing values
Due to the nature of a retrospective design and the non-availability of all data for scoring, we checked whether there is an influence on the predictive ability of the different scores. For this purpose, we compared the actually observed mortality and the individual mortality predictions as well as the AUCs depending on the different accepted extent of missing values.
Comparison of predicted versus actual mortality starts with only patients having no missing values for scoring, as seen in Fig. 2. The more missing values are accepted for scoring, the more patients are included, until 100% of the total patient population is included for scoring independently of the extent of their missing values.
As expected, depending on the size of the analyzed population, the line depicting the difference in mortalities settles only at a certain population size. When only those patients are included who have few missing values the patient number is very small, too small to make a validated statement about the difference in predicted and observed mortality.
With the increasing number of missing values allowed, all scores underestimate the actual mortality, except the PELOD score. The fewer missing parameters are accepted, the more similar the predicted and the actual mortality. Exceptions here are the PELOD and the PRISM scores as well as the PIM3 score with a high negative influenceability due to missing parameters, whereby the small number of cases limits the statement. Also, when comparing AUCs the small sample size is limiting, especially in PIM3. Table 3 ROC analysis of scores predicting mortality. a Results of the score analyses for mortality prediction at hospital admission and on the day with the highest level of C-reactive protein (CRP). The dark grey fields show the AUC with 95% CI. The numbers in the fields above the dark grey fields give the estimated difference in ROC curves with 95% CI and the fields below show the correlation with the corresponding ROC curves. Red p values indicate a significant difference. Only completed cases with all scores available are included.   The AUC of the scores changes only minimally as a consequence of the number of missing values. Excluded here is PIM3, whose AUC with the completeness of the parameters is lower than the AUCs with missing values.
The PRISM score had a difference of 0.1 in the AUCs with the highest AUC of 0.66 at 50% missing values allowed. PRISMIII and PRISMIV had a difference of 0.11 in the AUCs. PRISMIII and PRISMIV had their highest AUCs of 0.84 at 30% missing values allowed. Also, the PELOD score had a difference of 0.11 in its AUCs, calculated according to the degree of missing values, with the highest AUC of 0.74 at 30%-40% missing values allowed.
PIM and PIM2 had a difference of only 0.05 in their AUCs and had their highest AUC at 30%-40% and 20% missing values allowed. Also, PELOD 2 had a small difference of 0.07 with its highest AUC at 30% missing values allowed.

DISCUSSION
The aim of this study was to investigate and compare various common mortality risk assessment scoring systems, namely PRISM, PRISM III, PIM, PIM2, PIM3, PELOD and PELOD2 in pediatric sepsis patients. In doing so, we also evaluated different time points for the score assessments, namely PICU admission and the day of C-reactive protein peak. Furthermore, we investigated the influence of missing parameters on the predictive power of the scores.

Comparison of the scores at admission
The mortality rate in our study was at 13.6% in the midfield of other PICUs in developed countries (Ruth et al., 2014). The difference between the predicted and the actual mortality of the individual scores in our septic patient population is roughly comparable to that of other studies (Arias Lopez et al., 2018;Dewi et al., 2019;Hamshary et al., 2017;Patki, Raina & Antin, 2017;Qiu et al., 2017a;Qureshi, Ali & Ahmad, 2007;Taori, Lahiri & Tullu, 2010).
The PIM scores (PIM1, PIM2, PIM3) underestimate overall mortality as compared to actual mortality, as confirmed by other studies (Patki, Raina & Antin, 2017;Qiu et al., 2017a;Taori, Lahiri & Tullu, 2010). By contrast, the PRISM score, with its mortality prediction, gives quite a good comparison of the actual mortality of the entire population. This has also been confirmed in other studies (Qiu et al., 2017a;Taori, Lahiri & Tullu, 2010). While PELOD showed a slight overprediction in mortality in our population, PELOD 2 showed a significant underprediction of the observed mortality. An even greater discrepancy was found in a study by Goncalves et al. (2015).
Nevertheless, when looking at the ability to predict mortality for the individual patient, the PIM 2 score shows the best performance as reflected by highest AUC, closely followed by PIM and PIM3 but also by PRISM III and PELOD 2. Other studies also found a slightly higher AUC in PIM2 than in PIM (Brady et al., 2006;Slater & Shann, 2004). Even the good performance of PRISM III and PELOD 2 for the individual mortality prediction was also shown by Gonçalves et al. in a general critically ill pediatric population (Goncalves et al., 2015).
We found that the PRISM score to be the worst performer (AUC 0.63) in our septic population. In contrast to our findings, a prospective study conducted in pediatric patient populations of specialist multidisciplinary ICUs showed the AUC (0.90) of the PRISM score to be clearly higher than our result (Slater & Shann, 2004). In another study of children with meningococcal sepsis, the PRISM score even outperformed the PIM score (Leteurtre et al., 2001).
There are various ongoing discussions as to whether newer versions of the individual scores will improve the predictive value. While a multicenter study in Italy confirmed a significant improvement in PIM 3 compared to PIM 2 (Wolfler et al., 2016), Tyagi, Tullu & Agrawal (2018 found no relevant improvement between PIM 2 and PIM 3. We were able to determine an increase in the predictive value of the scores along versions, but the last versions (PIM 3 and PRISM IV) brought no further improvement.

Comparison of the scores recorded at the time of the C-reactive protein peaks
The scores are intended for a broad mass, meaning all kinds of diseases and conditions. For each score, a specific point in time or timeframe was determined, for which the best performance is to be expected. While the PRISM and PRISM III scores are calculated after 24 h in-hospital, the PIM scores are computed within the first hour after admission. One drawback of a 24-hour assessment is that the patient may already be dead before the score can give a prognosis. In the case of an assessment made in the first hour, however, there is an inaccuracy factor regarding preclinical care. In seriously ill, well-cared-for and stabilized patients, a score may be deceptively low in value. As of version PIM 2, an attempt was made to compensate this with an additional parameter (''main reason for ICU admission'').
In septic patients there is a similar problem: in some cases sepsis was the reason for admission, while in other cases sepsis developed during the course of hospital stay, possibly in postoperative patients (Sidhu et al., 2015;Wang et al., 2018). In such a case, sepsis cannot be predicted at the time of admission and thus at the time of data collection. A score calculated during the most severe septic phase would therefore show better performance. Thus, our clientele's scores were reassessed at peak C-reactive protein. This analysis revealed that with disease progression PRISM IV and PELOD 2 were becoming significantly more precise in predicting mortality. We conclude that for PRISMIV and PELOD 2 the time when evaluation is performed is important for mortality prediction, while for the other scores the time of evaluation has no significant influence on the predictive ability of these scores. Just as Leteurtre et al. (2015) we feel that especially the PELOD 2 score serves well to monitor the progression of disease severity and predict outcome when evaluated regularly during the course of the disease.

Comparison of scores for missing parameters
Although it is difficult, even in a prospective setting, to collect all the necessary data for creating the scores (Tibby et al., 2002), it is even more difficult with a retrospective study design. Parameters are not or not fully recorded, not available at the specified time, not collected or lost due to incomplete documentation. However, this reflects the realism of everyday life. This problem has already been addressed by the developers of the PRISM score, who concluded that the missing values are often normal and therefore will hardly influence the score (Pollack et al., 1996a). The same assumption that missing values are normal was partially implemented in the scoring validation studies (Gorges et al., 2018;Leclerc et al., 2017). It was also incorporated into the PIM score, so that it is possible to specify missing data as such and thus there is no change in score points, which makes sense to a certain extent. For example, a lactate level that describes tissue hypoxia may not have been subject to lab testing by the treating physician, because the patient's medical condition was not presumed to be so poor. The situation is similar for other parameters.
Nevertheless, this assumption might be misleading: for example, there is only a small blood volume available for laboratory testing, especially in young pediatric patients (Sztefko et al., 2013). This might be supported by a validation study, where PELOD 2 and PRISM III scores show decreased performance when it is assumed that the unavailable data are within normal ranges (Gorges et al., 2018).
We were able to show that only in a few cases was it possible to retrospectively collect 100% of the data for scoring. Here only a small patient population remained, so that the analyses could no longer be performed validly. However, it was seen that patients with a low percentage of missing values have high mortality.
With increasing data availability predicted and actual mortality approached each other, similar to what Agor and his team found in their study of the impact of missing scores on adult scores (Agor et al., 2019). For our patients, the predicted and the actual mortality were quite close, except for the PRISM, PELOD and PIM3 scores, where the difference between the predicted and the actual mortality fluctuated, especially when only few missing values were allowed.
The most stable scores in terms of missing values, defined as the maximum deviation from predicted and actual mortality, have been shown to be PRISM III, PIM and PIM 2.
Here, it has to be mentioned that, when a high percentage of missing values is allowed, mortality is underestimated by the scores, while with increasing data availability the scores tend to overestimate. The exceptions here are the PELOD score where the results are converse and PELOD 2 score, which consistently slightly overestimates mortality.
The AUC of the scores, however, changes only minimally with the number of missing values allowed. The smallest difference in the AUCs depending on the allowed missing values was seen in the PIM and PIM2 scores as well as the PELOD 2 score, which thus proved to be stable as compared to the missing values.

Limitations
The retrospective study design is limited in terms of score performance because some patients did not have all the data needed to calculate the scores. Since this is a single-center study, the number of patients needed for a valid statistical analysis is low overall, especially in the group of patients with 100% availability of the required data. In our study we assessed only the average effect of the missing values and not the weighting of the individual missing parameters necessary for the score. On the other hand, it was possible for us to draw a realistic scenario of data availability in connection with the score survey in a retrospective study design.

CONCLUSION
The results demonstrate that at ICU admission all PIM scores together with PRSIM III and PELOD 2 are the most suitable scores for mortality prognosis, while the PRISM score is the worst. Once the sepsis is pronounced, PELOD 2 and PRISM IV became significantly better in their performance and count among the best prognostic scores for use at this time. Therefore, when calculated at multiple times, the PELOD 2 score is best suited for assessing the prognosis on an ongoing basis.
Overall, PIM scores show comparatively good performance, are stable as far as timing of the disease survey is concerned, and they are also relatively stable in terms of missing parameters. PELOD 2 is the best for monitoring and is also relatively stable in relation to missing values.