A new parametric information-gain criterion for tree-based machine learning algorithms

Diogo Costa; Vasco Vieira Costa; Eugénio Rocha

doi:10.7717/peerj-cs.3319

A new parametric information-gain criterion for tree-based machine learning algorithms

Diogo Costa ¹, Vasco Vieira Costa², Eugénio Rocha^1,2

1Center for Research and Development in Mathematics and Applications (CIDMA), Universidade de Aveiro, Aveiro, Aveiro, Portugal

2Department of Mathematics, Universidade de Aveiro, Aveiro, Aveiro, Portugal

DOI: 10.7717/peerj-cs.3319

Published: 2025-12-02
Accepted: 2025-10-01
Received: 2025-04-25

Academic Editor: Shibiao Wan

Subject Areas: Algorithms and Analysis of Algorithms, Artificial Intelligence, Data Science
Keywords: Decision tree, Random forest, Machine learning, Information gain, Entropy

Copyright: © 2025 Costa et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Costa D, Costa VV, Rocha E. 2025. A new parametric information-gain criterion for tree-based machine learning algorithms. PeerJ Computer Science 11:e3319 https://doi.org/10.7717/peerj-cs.3319

The authors have chosen to make the review history of this article public.

Abstract

Decision Trees (DTs) remain one of the most important algorithms in machine learning for their simplicity, interpretability, and often satisfactory performance. Furthermore, they are critical foundational components for more performant models such as Random Forests (RFs) and Gradient Boosted Trees. Central to DTs is the splitting process, where data is partitioned according to criteria traditionally based on information-theoretic measures such as Shannon entropy or Gini index. In this article, we propose a novel parametric entropy-based information gain criterion designed to generalize and extend classical entropic measures to improve classification performance in DTs and RFs. We introduce a five-parameter entropy formulation capable of replicating and extending known entropy measures. This new criterion was incorporated into DT and RF classifiers and evaluated on a collection of 18 benchmarking datasets, including both synthetic and real-world data retrieved from publicly available repositories. Performance was assessed using 5-fold cross-validation and optimized via Bayesian hyperparameter search, with weighted F1-score as the primary metric. Compared to splitting criteria based on existing entropy/purity measures (e.g., Gini, Shannon, Rényi, and Tsallis), our method yielded statistically significant improvements in classification performance across most datasets. On multiclass and imbalanced datasets, such as the Wine Quality dataset, F1-score improvements exceeded 40% using RF algorithms. Bayesian signed-rank tests confirmed the robustness of our method, which never underperformed relative to standard approaches. The proposed entropy-based splitting criterion offers a flexible and effective alternative to classical information-theoretic measures, delivering improvements in classification performance.

Introduction

Decision Trees (DTs) are still among the most widely used models in Machine Learning (ML), despite their roots being traced to as far back as the early 1960s (Morgan & Sonquist, 1963). This enduring relevance is owed to these models’ underlying simplicity and practical performance. Succinctly, DTs operate by recursively splitting the dataset into subsets based on feature values, aiming to create groups that are as pure as possible concerning a given target variable. Further aiding their adoption, the output of these recursive operations can be represented in a tree-like flowchart structure, greatly increasing their interpretability (Kotsiantis, 2013). Moreover, DTs also introduce a low computational cost and are compatible with both classification and regression tasks, which has justified their use in a wide range of applications, including, for example, in the medical, financial, and industrial sectors (Costa & Pedreira, 2023; Mienye & Jere, 2024).

Since their inception over half a century ago, DTs have lost their performance advantage compared to the best of the supervised learning approaches (James et al., 2021). However, they remain a critical foundational component of more powerful ensemble methods like Random Forests (RFs) and Gradient Boosted Trees, which are algorithms that remain highly competitive, particularly on tabular datasets, against even state-of-the-art neural network models (Grinsztajn, Oyallon & Varoquaux, 2022; Uddin & Lu, 2024). Moreover, there has been a newfound interest in the development of DTs due to their interpretability (Hwang, Yeo & Hong, 2020; Hernández et al., 2021), as the black-box nature commonly associated with most predictive and classifier models becomes an increasingly greater issue in an age where explainable machine learning is becoming a necessary condition (Rudin, 2019; Roscher et al., 2020).

Central to the induction (or construction) of DTs is the process of recursive data partitioning (Costa & Pedreira, 2023), which is guided by the output of a discrete function over the input attributes (Kotsiantis, 2013). The selection of the most appropriate function is typically determined by some splitting criteria, such as information gain, the Gini value, or the Gain ratio (GR), with the selection of these criteria directly influencing the structure and performance of the resulting tree. In the more (now) classical implementations of DT models, such as the Iterative Dichotomiser 3 (ID3) algorithm (Quinlan, 1986), information-theoretic measures (e.g., Shannon entropy) have well-established roles as splitting criteria. Nonetheless, past research, such as the work done by Nowozin (2012), shows that potential limitations remain in standard methods, as their performance is not uniform across different domains, data distributions, or learning objectives.

Recently, there has been additional research in either the development of novel splitting criteria (Leroux, Boussard & Dès, 2018; Ayllón-Gavilán et al., 2025; Hwang, Yeo & Hong, 2020; Loyola-González, Ramírez-Sáyago & Medina-Pérez, 2023) or in re-adapting existing methods with less common information-theoretic measures (Ignatenko, Surkov & Koltcov, 2024; Nowozin, 2012; Maszczyk & Duch, 2008; Gajowniczek, Zabkowski & Orłowski, 2015). In these works, the goal becomes the development of alternative and corrected measures aimed at improving split balance, reducing overfitting, and enhancing the generalization of the output DTs, with the adequate selection of the splitting criteria becoming particularly important in the context of high-dimensional, imbalanced, or noisy datasets. In a detailed analysis provided in Hernández et al. (2021), it is shown that no one split evaluation measure is capable of consistently outperforming all others. Alternatively, one approach is to combine multiple evaluation measures and select the candidate splits that better adapt to the input data (Loyola-González, Ramírez-Sáyago & Medina-Pérez, 2023).

However, in this article, we introduce an alternative perspective, based on the development of a more generic notion of entropy that is more parametric than existing approaches, and that enables the generalization of existing information-theoretic criteria most commonly used as the basis of splitting procedures. Through the added degrees of freedom provided by this approach, it is possible to construct splits that better adjust to input data. This method, evaluated for both open data repositories and synthetic datasets, is seen to achieve performance improvements of over 40% in F1-score relative to standard methods in common datasets such as the ‘Wine Quality’ dataset, when using RF algorithms. Moreover, we provide a Bayesian statistical analysis to assess that the proposed method presents a statistically significant improvement while never underperforming, with at worst, matching the performance of the best standard model.

The remainder of this article is organized as follows. ‘Decision Trees for Classification’, provides a generic discussion regarding DTs, their construction, and their splitting criteria. ‘Generalized Entropy’, introduces the 5-parameter expression for the generalization of entropy and its formulation into a target function for its use as a DT splitting criterion. ‘Description of Computational Experiments’, describes the datasets used for computational experiments, experimental methodology, and result evaluation metrics. ‘Results of Computational Experiments’, provides the summarized results obtained from the computational experiments and discusses the implementation of a Bayesian statistical analysis to validate the quality of the introduced method. ‘Discussion’, interprets the obtained results and statistical analysis. ‘Conclusions’, provides the concluding remarks regarding the introduced method. ‘Proofs of Known Entropies’, shows the mathematical proof of the generalizations possible to achieve with the presented parametric entropy expression. ‘Complete Results for Proposed Method’, provides additional tabular results.

Decision trees for classification

Multiple versions of the DT algorithm exist; however, most notable are perhaps their most “traditional” forms as seen in the case of the original ID3 (Quinlan, 1986), its successor C4.5 (Quinlan, 1993), and the Classification and Regression Tree (CART) (Breiman et al., 1984). In these forms, the structure of a DT consists of internal and terminal nodes. The former represent logical tests (splits) where each split has a binary outcome (true or false); while the latter are the leaves of the tree and correspond to output predictions (Costa & Pedreira, 2023). DTs in formulations such as CART are compatible with classification and regression tasks, hence leaves can correspond to either labels or constant numbers. These values are selected during the induction of the tree, where, starting at a root node that contains the full training dataset, this input space is recursively partitioned into homogeneous regions with respect to a given target variable. This forms the basis for the splitting process.

In the case of ID3, C4.5, and CART, they all take on greedy approaches to constructing DTs (Han, Kamber & Pei, 2012), meaning that they will evaluate all possible splits across all features and select the one that best separates the data according to a predefined criterion. The most commonly used criteria include information gain (based on information-theoretic measures), the Gini index, and the gain ratio. These measures assess the purity of the resulting subsets, aiming to maximize class purity in the child nodes, where a pure partition would mean that every element would have the same label.

Information gain (IG), used in ID3, is calculated as the reduction in entropy, e.g., using the Shannon entropy, before and after the split. Mathematically, this corresponds to computing (Mienye & Jere, 2024)

(1) $I G (S, A) = H (S) - \sum_{v \in V a l u e s (A)} \frac{| S_{v} |}{| S |} H (S_{v}),$ where S is our dataset, A is an attribute of S, $V a l u e s (A)$ are the unique values in attribute A, $S_{v}$ the subset of S for which attribute A takes on value $v$ , and H the entropy function, which in the case of Shannon’s entropy is given by Shannon (1948)

(2) $H (S) = - \sum_{i = 1}^{n} P_{i} \log_{2} P_{i},$ where $n$ is the number of unique classes in the set S and $P_{i}$ is the proportion of the samples in the set that belong to class $i$ . Essentially, the higher the information gain, the more adequate an attribute is in partitioning the data, as the subsets become more homogeneous. In more advanced DT algorithms, such as C4.5, Gain Ratio is used, a method where IG is normalized in order to correct its bias toward attributes with many distinct values (Quinlan, 1993; Han, Kamber & Pei, 2012). The gain ratio for a dataset S and an attribute A is given by

(3) $G a i n R a t i o (A) = \frac{I G (A)}{S p l i t I n f o (S, A)},$ where

(4) $S p l i t I n f o (S, A) = - \sum_{v \in V a l u e s (A)} \frac{| S_{v} |}{| S |} \log_{2} (\frac{| S_{v} |}{| S |}) .$

Lastly, the final of the most common splitting criteria is the Gini index (or value or impurity), employed by the CART algorithm. This metric measures the probability of misclassifying a randomly chosen instance from the dataset and is defined as Jost (2006) and Kotsiantis (2013)

(5) $G i n i (S) = 1 - \sum_{i = 1}^{n} P_{i}^{2} .$

Once the best split is identified, the dataset is divided accordingly, and the algorithm recurses on each subset. This process continues until a stopping criterion is met, such as reaching a maximum tree depth, achieving a minimum number of samples per node, or obtaining pure nodes (Nowozin, 2012). In classification tasks, each leaf node is assigned the majority class of the samples it contains.

Currently, DTs are seldom used in isolation, being much more commonly employed as a component of ensemble models. The most well-known of these is perhaps Random Forests (RFs), a method based on the induction of multiple DTs and then combining their outputs to improve predictive accuracy and control overfitting (Breiman, 2001). This algorithm operates by training each tree on a different bootstrap sample (random sample with replacement) of the dataset and using a random subset of features at each split to ensure diversity among trees. For classification, predictions are made by majority voting across trees; for regression, predictions are averaged. This combination of bagging (bootstrap aggregating) and random feature selection makes RFs more robust to noise, resistant to overfitting, and effective for high-dimensional data. A more comprehensive insight into these methods can be found in works such as Mienye & Jere (2024) and Costa & Pedreira (2023).

The focus of this article lies in the development of novel criteria for the splitting of tree-based algorithms and on surpassing some of the limitations that remain within the standard methods. In this sense, there have been previous attempts at this issue. For instance, in Leroux, Boussard & Dès (2018), a balanced gain ratio is discussed, aimed at addressing the bias towards unbalanced splits in the GR method used by Quinlan (1986), essentially by correcting the split information by a constant value. Empirically evaluated, this approach is seen to limit the depth of resulting trees with an improvement in classification accuracy. In other settings, splitting criteria are adapted to serve particular tasks. For instance, in Ayllón-Gavilán et al. (2025), a splitting criterion is defined for use with ordinal classification. Another example can be found in Hwang, Yeo & Hong (2020) where the main goal is not to achieve the best performance, but rather to lead to the creation of the most interpretable tree possible.

More in line with the motivation of this article is the work done by Nowozin (2012), where the classical entropy estimators, such as Shannon entropy or Gini index, are replaced, in this case, with the Grassberger entropy estimator, enabling an increase in predictive performance during classification tasks. We can further see the use of nonclassical entropies for the induction of DTs in Gajowniczek, Zabkowski & Orłowski (2015) and Maszczyk & Duch (2008). In both of these past examples, the authors employed Rényi and Tsallis parametric entropies. Introduced by Rényi (1961), the Rényi entropy of order $α$ of a set S, and with $0 < α < \infty$ and $α \neq 1$ , is defined as

(6) $H_{α} (S) = \frac{1}{1 - α} \log (\sum_{i = 1}^{n} P_{i}^{α}) .$

In case $α \in {0, 1, \infty}$ , it is defined as

(7) $H_{α} (S) = lim_{x \to α} H_{x} (S) .$

Rényi’s entropy generalizes various other notions of entropy. For instance, as $α \to 1$ (notation meaning: when $α$ tends to $1$ by a valid branch), the Shannon entropy is recovered. Alternatively, the Tsallis entropy of a set S, introduced by Tsallis (1988), is defined as

(8) $S_{q} (S) = k \frac{1}{q - 1} (1 - \sum_{i = 1}^{n} P_{i}^{q}),$ where $k \in R^{+}$ and $q \in R$ . Tsallis’ entropy is also capable of recovering other entropic definitions. For example, as $q \to 1$ , the Boltzmann–Gibbs entropy is obtained.

Advancements are still occurring in this field, namely in the use of dynamically adjustable criteria at split time (Loyola-González, Ramírez-Sáyago & Medina-Pérez, 2023), or in regards to the use of deformed entropies for the improvement of target functions. One such recent example of the latter can be found in the work by Ignatenko, Surkov & Koltcov (2024), where the potential use of nonclassical entropies for the computation of information gain on RFs under classification and regression tasks was studied. In this work, the nonstandard entropies enabled substantial performance gains (in terms of accuracy) of models, for the application of Rényi, Tsallis, and Sharma–Mittal entropies. The Sharma–Mittal entropy (Akturk, Bagci & Sever, 2007) is defined as

(9) $S_{S M} (S) = \frac{1}{1 - r} [{(\sum_{i = 1}^{n} P_{i}^{q})}^{\frac{1 - r}{1 - q}} - 1],$ and is further capable of retrieving the Rényi entropy for $r \to 1$ and the Tsallis entropy for $q \to r$ .

Although these previous works are comparable in motivation to our work, in the sense that the goal is improving the splitting process of tree-based algorithms through the use of nonstandard entropies, the implementation substantially differs. Here, the approach is to develop a new parametric expression that encompasses the classical methods (e.g., retrieving Shannon entropy or Gini index); however, by the nature of additional degrees of freedom, it enables the deduction of additional criteria that may be more suitable for separating data at each split.

Generalized entropy

This section will now introduce the 5-parameter expression for the generalization of entropy, first introduced in the authors’ past work in the scope of data complexity estimation (Costa, Rocha & Ferreira, 2024). Consider a set of probabilities $P = (P_{1}, \dots, P_{n}) \in [0, 1]^{n}$ with $\sum_{i = 1}^{n} P_{i} = 1$ , the generalized entropy $\hat{E}$ is defined by

(10) ${\hat{E}}_{ζ_{1}, ζ_{2}}^{α, β, γ} (P) = ({\hat{φ}}_{ζ_{1}, ζ_{2}}^{α, β, γ} (P) - k_{0}) (k_{1} - k_{0})^{- 1},$ where $(α, β, γ, ζ_{1}, ζ_{2}) \in (R_{0}^{+})^{5}$ are adequate parameters,

(11) ${\hat{φ}}_{ζ_{1}, ζ_{2}}^{α, β, γ} (P) = ζ_{2} - \ln (\sum_{i = 1}^{n} P_{i}^{α} {[- ζ_{1} - \ln (P_{i}^{β})]}^{γ}),$ and the $q$ -logarithm, for any $q \in R_{0}^{+} \ {1}$ and $x > 0$ , is given by

(12) $q - \ln (x) = \frac{x^{1 - q} - 1}{1 - q} .$

For $q = 1$ , the $q - \ln (x)$ coincides with the natural logarithm $\ln (x)$ , by computing the limit as $q \to + 1$ , using the L’Hôpital’s rule, and the fact that $d x^{1 - q} / d q = - \ln (x) x^{1 - q}$ . Constants $k_{0}$ and $k_{1}$ represent the minimum and maximum theoretical values of $\hat{φ}$ , respectively. Accordingly, $k_{0} = m i n (\hat{φ})$ , obtained whenever $\exists_{j} : P_{j} = 1$ . Alternatively, $k_{1} = m a x (\hat{φ})$ , occurring whenever each event is equally probable, i.e., $P_{i} = \frac{1}{n} .$ Table 1 showcases the required parameter re-configuration to achieve a normalized version of the most common generalizations of impurity and entropy used as criteria for splitting in DTs. Proofs for the retrieval of these entropic measures using $\hat{E}$ are given in the Supplemental File ‘Proofs of Known Entropies’. Note, however, that the list shown in Table 1 is not comprehensive, in the sense that there are further entropic definitions that can still be retrieved. Besides generalizing to other entropic estimators, $\hat{E}$ is capable of further obtaining other entropy estimations through parametric variation. In Fig. 1, some of these behaviors are plotted.

Table 1:

Required parameter configurations to obtain the most common generalizations of impurity/entropy.

Parameter					Generalization
α	β	γ	$ζ_{1}$	$ζ_{2}$	Generalization
0	2	1	0	0	Gini impurity
1	1	1	1	0	Shannon entropy
w	0	0	0	1	Rényi entropy R_w
w	0	0	0	0	Tsallis entropy T_w
q	0	0	0	$\frac{r - q}{1 - q}$	Sharma–Mittal entropy

DOI: 10.7717/peerj-cs.3319/table-1

Illustration of the behavior of the proposed generalized entropy function Ê for a random variable with two possible outcomes against
${P_{i}}$Pi
, where
$P = [{P_{i}},1 - {P_{i}}]$P=[Pi,1−Pi]
, showing how the entropy value varies with different parameter settings. — Figure 1: Illustration of the behavior of the proposed generalized entropy function Ê for a random variable with two possible outcomes against $P_{i}$ , where $P = [P_{i}, 1 - P_{i}]$ , showing how the entropy value varies with different parameter settings.
In (A), parameters ${γ = 1; β = 1; ζ_{1} = 1; ζ_{2} = 0}$ are fixed, and $α$ is varied; this includes the Shannon entropy as the special case $α = 1$ . In (B), parameters ${α = 1; β = 1; ζ_{1} = 1; ζ_{2} = 0}$ are fixed, and $γ$ is varied; again recovering Shannon entropy at $γ = 1$ . In (C), parameters ${γ = 1; β = 0; ζ_{1} = 1; ζ_{2} = 0}$ are fixed, and $α$ is varied with $α > 1$ ; the Shannon entropy is plotted alongside for reference. These visualizations demonstrate the behavior of the proposed entropy function with changing data distributions and its ability to recover classical entropy measures as special cases.

Download full-size image

DOI: 10.7717/peerj-cs.3319/fig-1

To employ the measure $\hat{E}$ as a splitting criterion for DTs in classification tasks, it is first necessary to formulate a target function. Considering (1), the target function will be formulated in a similar way as seen in the work by Ignatenko, Surkov & Koltcov (2024). In the case $S_{j}$ is the set of data points falling into node $j$ , then $I G_{j} = I G (S_{j}, S_{j}^{L}, S_{j}^{R})$ , where $S_{j}^{L}$ and $S_{j}^{R}$ are the subsets which fall into the left and right subtrees, respectively. Therefore, information gain is given by

(13) $I G_{j} = \hat{E} (S_{j}) - \sum_{i \in {L, R}} \frac{| S_{j}^{i} |}{| S_{j} |} \hat{E} (S_{j}^{i}) .$

As such, when employing this definition of information gain, each of the five parameters accepted by $\hat{E}$ will become a new hyperparameter of the DT. To help guide hyperparameter tuning, a sensitivity analysis was conducted, focusing on parameters $α$ , $β$ , and $γ$ . Parameters $ζ_{1}$ and $ζ_{2}$ were defined as $ζ_{1}, ζ_{2} \in {0, 1}$ , as this would simplify the analysis whilst maintaining the generic properties of $\hat{E}$ intact. The analysis was two-stage. First, the Morris screening method (Morris, 1991; Campolongo, Cariboni & Saltelli, 2007) was applied to estimate the mean absolute effect, $μ *$ , of each parameter in overall importance, and the standard deviation, $σ$ , to quantify nonlinearity and interaction effects. Secondly, a Sobol variance-based sensitivity analysis (Sobol, 2001; Saltelli, 2002; Saltelli et al., 2010) was performed to compute both first-order (S1) indices to measure main effects and total-effect (ST) indices to capture the combined impact of main and interaction effects. Both of these methods were applied using the implementation made available in the SALib library (Herman & Usher, 2017; Iwanaga, Usher & Herman, 2022) (SALib version 1.4.7 (https://github.com/salib/salib)).

The comparison of the results yielded from both methods (shown in Figs. 2 and 3) enables the identification of the parameters with strong and stable influences vs. those whose effects are driven primarily by interactions. In this case, no single parameter exerts a predominant influence through main effects, as indicated by the low first-order indexes. While both $α$ and $γ$ exhibit the highest first-order influences, these values remain considerably less than the total-effect indexes. Suggesting that the behavior of $\hat{E}$ is mostly driven by interaction between parameters. The results of the Morris analysis for $β$ appear to suggest a high degree of influence; however, this assumption is contradicted by the Sobol method, thus suggesting that its interactions are less pervasive or more localized.

Morris screening results showing mean absolute effect
$\mu^*$μ∗
vs. standard deviation
$\sigma$σ
for parameters
$\alpha$α
,
$\beta$β
, and
$\gamma$γ
. — Figure 2: Morris screening results showing mean absolute effect $μ^{*}$ *vs.* standard deviation $σ$ for parameters $α$ , $β$ , and $γ$ .
$μ^{*}$ indicates overall parameter influence, while $σ$ reflects nonlinearity and interaction strength. Parameters in the top-right are both influential and involved in interactions.

Download full-size image

DOI: 10.7717/peerj-cs.3319/fig-2

Sobol sensitivity indices for parameters
$\alpha$α
,
$\beta$β
, and
$\gamma$γ
, showing first-order effects (S1) and total effects (ST). — Figure 3: Sobol sensitivity indices for parameters $α$ , $β$ , and $γ$ , showing first-order effects (S1) and total effects (ST).
S1 measures the proportion of output variance explained by each parameter alone, while ST captures both main and interaction effects. Large gaps between ST and S1 indicate strong interactions.

Download full-size image

DOI: 10.7717/peerj-cs.3319/fig-3

Description of computational experiments

The proposed splitting criterion was evaluated for classification tasks across two distinct groups of datasets, namely datasets retrieved from open repositories and synthetically generated datasets. In the case of the former, these were retrieved from both the UCI (https://archive.ics.uci.edu) and OpenML (https://www.openml.org) repositories, and their characteristics are shown in Table 2. In the case of the latter, they were generated using the make_classification function found in scikit-learn (Pedregosa et al., 2011) (scikit-learn version 1.6.1 (https://scikit-learn.org/1.6/api/index.html)), and their description is shown in Table 3. Dataset selection took into account the following requirements: (i) be reasonably broad in terms of application/problem areas; (ii) have a reasonably comprehensive combination between number of features/number of instances across datasets; (iii) validate for both binary and multiclass datasets; (iv) validate for both balanced and imbalanced datasets (in the case of imbalanced datasets, with varying degrees of imbalance); and, (v) contain both numerical (continuous), ordinal, and categorical features across datasets.

Table 2:

Description of datasets sourced from open repositories.

The shown ID is relative to their respective repositories.

Repository	Name	ID	# of features	# of instances	# of classes	Feature type	Class proportions (# of instances/class)
UCI	Breast Cancer Wisconsin	15	9	699	2	Ordinal	458:241
	Iris	53	4	150	3	Continuous	50:50:50
	Spambase	94	57	4,601	2	Continuous	2,788:1,813
	Statlog (Shuttle)	148	7	58,000	7	Continuous	45,586:8,903:3,267:171:50:13:10
	Wine Quality	186	11	4,898	11	Continuous	2,836:2,138:1,079:216:193:30:5
	Students’ Dropout and Academic Success	697	36	4,424	3	Continuous, Categorical, Ordinal	2,209:1,421:794
OpenML	mfeat-morphological	18	7	2,000	10	Continuous, Categorical	200:200:200:200:200:200:200:200:200:200
	diabetes	37	9	768	2	Continuous, Ordinal	500:268
	wdbc	1510	31	569	2	Continuous	357:212
	wilt	1570	6	4,839	2	Continuous	4,578:261
	Titanic	40704	4	2,201	2	Categorical, Ordinal	1,490:711
	dataset_31_credit-g	42633	21	1,000	2	Categorical	700:300
	phoneme	44087	6	3,172	2	Continuous	1,586:1,586

DOI: 10.7717/peerj-cs.3319/table-2

Table 3:

Description of synthetically generated datasets.

All of the synthetic datasets are class-balanced and contain only continuous features.

Name	# of features	# of informative features	# of instances	# of classes
synth_1	10	2	100	2
synth_2	10	5	100	4
synth_3	5	2	1,000	2
synth_4	10	5	1,000	2
synth_5	10	5	1,000	4

DOI: 10.7717/peerj-cs.3319/table-3

Two algorithms were implemented: (i) a (classification) DT, based on the conventional ID3 architecture, which supported both numeric and categorical features and used as its target function the equation defined in Eq. (13); and (ii) a RF based on an ensemble of DTs with feature bagging and voting based on the most common label values. The source code for the algorithms used in these experiments was made available (https://doi.org/10.5281/zenodo.15241909), and was developed with standardization in mind, following the conventions of the popular Python package of scikit-learn, making it compatible with its methods. The fixed parameters for DTs were: 2 samples as the minimum number required to split an internal node; and, minimum impurity decrease set as 0. The foundational DTs for the RFs were constructed with these same fixed parameters.

In terms of non-fixed parameters, for the tests focused on the DTs, their maximum depth was varied as $m a x_d e p t h \in {10, 50, 100, 200}$ . In the case of RFs, the number of estimators in each forest was varied as $n_e s t i m a t o r s \in {10, 25, 50}$ , and the maximum depth of individual trees was varied as $m a x_d e p t h \in {10, 25}$ . The search space for the parameters in $\hat{E}$ was constructed considering preliminary empirical findings, and was defined as: $α \in {0, 1, 2, 3}$ ; $β \in [0, . . ., 3]$ , with a step size of $0.1$ ; $γ \in [0, . . ., 3]$ , with a step size of $0.1$ ; $ζ_{1} \in {0, 1, 2, 3}$ ; and, $ζ_{2} \in {0, 1}$ . These search spaces were selected based on the prior sensitivity analysis, where the effect of parameter $α$ was comparatively stable while $β$ and $γ$ produce large variability in effects predominantly through interactions. Consequently, a more coarse grid was used for $α$ to concentrate computational effort where the proposed entropic function $\hat{E}$ is most interaction-sensitive, reducing the risk of missing narrow high-performance regions while keeping total evaluations manageable.

The search for the best-performing hyperparameters for the proposed splitting criterion (i.e., for parameters $α, β, γ, ζ_{1}, ζ_{2}$ ) was conducted for all combinations of maximum depth and number of estimators in the DTs and RFs algorithms, where applicable. Due to the elevated number of test combinations and datasets, a Bayesian search was conducted over the search space, using the BayesSearchCV class in scikit-optimize (https://scikit-optimize.github.io) (scikit-optimize version 0.10.2), with a set number of 100 iterations (i.e., 100 parameters sampled per dataset per maximum depth, and, in the case of RFs, per number of estimators). Furthermore, these parameters were evaluated through 5-fold cross-validation with the target metric set to weighted F1-score, given by

(14) $F 1_{w e i g h t e d} = \sum_{i = 1}^{N} (\frac{n_{i}}{\sum_{j = 1}^{N} n_{j}} \times F 1_{i}),$ where N is the number of classes, $n_{i}$ represents the support for the $i$ -th class, and F1_i is the F1-score for the $i$ -th class, computed through

(15) $F 1 = \frac{2 T P}{2 T P + F P + F N},$ where $T P$ is the number of true positives, $F P$ is the number of false positives, and $F N$ is the number of false negatives. In addition to the weighted F1-score used for model selection during the hyperparameter search, complementary evaluation metrics were computed, namely Balanced Accuracy, Precision, and Recall. Balanced Accuracy is defined as

(16) $B a l a n c e d A c c u r a c y = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F N_{i}},$ where $T P_{i}$ and $F N_{i}$ are the true positives and false negatives for the $i$ -th class, respectively. The Precision and Recall for each class are computed as:

(17) $P r e c i s i o n = \frac{T P}{T P + F P},$

(18) $R e c a l l = \frac{T P}{T P + F N} .$

These metrics were also averaged in a weighted manner according to class support, ensuring that the performance evaluation reflects the influence of class distribution in the dataset.

The usage of the weighted F1-score is uncommon in comparable works found in the literature, which tend to favor reporting accuracy measurements (see, for instance, the experimental methodology in Nowozin (2012) or Ignatenko, Surkov & Koltcov (2024)). However, this is disingenuous as the datasets selected for testing frequently show some degree of class imbalance. For example, the ‘Statlog (Shuttle)’ dataset, used in testing in multiple comparable works (including the two previously mentioned), is a seven-class set with the majority class representing nearly 80% of instances. Hence, the weighted F1-score is selected as the performance evaluation metric for this work, and thus any mention of F1-score will be regarding its weighted implementation. The weighted version was selected (instead of macro or micro extensions) as a compromise, somewhat attenuating the effect of very low support classes negatively skewing results, while still providing enough sensitivity for penalizing misclassifications in imbalanced datasets.

The introduced parametric entropies were benchmarked against the known splitting criteria given by the Gini index, Shannon entropy, Rényi entropy, and Tsallis entropy, as these are the most utilized in comparable works. The benchmarks were obtained using the same tree structures and parameters as the introduced method. In the case of Rényi, parameter $α$ in Eq. (6) was set as $α = 2$ , and, in the case of Tsallis, parameters $k$ and $q$ in Eq. (8) were set as $k = 1$ , and $q = 2$ .

Results of computational experiments

To aid in interpreting results, in this section only a summary presentation is given (shown in Table 4), where, for each result obtained, two additional metrics were computed: Lowest Improvement (LI), and Highest Improvement (HI). Respectively, these correspond to the lowest and highest percentual gain in F1-scoring of the parametric entropies obtained using $\hat{E}$ compared to the known entropic measures. The complete results for the computational experiments can be seen in the Supplemental File ‘Complete Results for Proposed Method’, these include: results for DTs (shown in Tables S1, S2, S3 and S4); results for RFs using a maximum depth of 10 for the DTs (shown in Tables S5, S6, and S7); and, results for RFs using a maximum depth of 25 for the DTs (shown in Tables S8, S9, and S10).

Table 4:

Summarized results obtained for both the DT and RF algorithms.

Shown are the percentage of Lowest Improvement (LI) and of Highest Improvement (HI) of the mean results for the weighted F1-score over the 5 folds (for each dataset). In this sense, LI is the worst-case scenario improvement, comparing the proposed method

\hat{E}

, with the best performing of the classical measures. Comparably, HI represents the best-case scenario improvement (compared with the lowest-performing classical measure).

Dataset		Decision Tree								Random Forest
	Depth estimators	10		50		100		200		10						25
										10		25		50		10		25		50
		LI (%)	HI (%)	LI (%)	HI (%)	LI (%)	HI (%)	LI (%)	HI (%)	LI (%)	HI (%)	LI (%)	HI (%)	LI (%)	HI (%)	LI (%)	HI (%)	LI (%)	HI (%)	LI (%)	HI (%)
BreastCancer		1.463	1.985	1.463	1.985	1.463	1.985	1.463	1.985	1.030	1.648	0.617	1.233	0.206	1.336	0.414	1.344	0.617	1.028	0.515	1.133
Iris		0	2.976	0	2.976	0	2.976	0	2.976	0	3.594	1.458	2.917	1.469	3.673	1.469	4.302	0	3.673	0.735	2.099
Spambase		0.325	0.650	1.816	3.312	0.863	2.373	1.289	2.793	1.732	3.580	0.449	1.798	0.114	1.362	1.606	4.358	1.575	2.925	0.454	2.497
Statlog		1.807	4.570	0	0.502	0	0.401	0	0.401	5.079	12.817	10.508	14.640	9.211	11.842	9.032	15.778	5.216	12.320	10.560	15.194
Wine Quality		3.473	10.603	5.960	13.411	2.689	11.092	3.477	12.252	13.688	37.262	18.333	36.667	18.116	38.768	20.398	33.167	22.742	40.323	23.344	40.536
Students		1.910	3.138	3.626	5.718	3.905	5.718	3.905	5.718	2.806	5.613	2.541	3.587	0.607	2.580	3.254	4.882	1.783	2.972	0.448	2.990
mfeat-morphological		1.429	4.143	3.730	6.743	3.736	7.615	1.903	5.857	3.698	18.182	2.115	7.855	4.405	7.636	3.762	7.837	1.997	7.373	3.598	8.546
diabetes		4.110	6.164	5.772	8.322	3.306	5.923	5.391	7.951	1.955	6.917	2.392	5.381	0.459	6.279	5.365	10.283	3.907	7.236	2.534	5.067
wdbc		0.528	2.218	0.842	2.526	0.946	2.629	0.946	2.629	0	1.572	0.415	1.350	1.037	1.971	0.210	2.306	0	0.935	0.208	1.249
wilt		0.204	0.714	0.204	0.613	0.204	0.613	0.204	0.613	0.217	0.217	0	0	0	0	0	0.108	0	0.108	0	0.108
Titanic		0	0	0	0	0	0	0	0	4.836	11.807	5.299	10.190	2.286	2.714	4.911	9.686	4.959	6.336	2.680	3.385
dataset_31_credit-g		2.374	4.050	2.917	4.306	3.719	5.096	3.851	5.227	1.495	9.716	1.468	2.936	1.000	2.667	3.324	10.983	1.286	4.341	2.114	2.764
phoneme		0	1.425	0.476	2.143	0.476	2.143	1.065	2.722	1.225	3.554	0.845	1.932	1.202	2.644	2.589	4.192	1.086	2.654	0.601	1.803
synth_1		4.040	4.040	4.040	4.040	4.040	4.040	4.040	4.040	3.061	8.673	3.000	6.000	3.000	5.000	4.040	7.172	2.020	3.030	1.000	5.000
synth_2		18.584	27.655	22.851	31.027	22.851	31.027	22.851	31.027	0	19.770	1.342	24.385	14.675	36.688	12.646	27.635	5.139	27.409	0	26.389
synth_3		0.209	1.885	0	1.157	0	1.157	0	1.157	0.735	2.731	0.520	1.873	0.208	0.833	0.950	1.795	0.730	1.564	0.418	1.461
synth_4		0	1.896	0.589	2.473	0.236	2.128	0.354	2.243	3.496	5.868	1.695	4.722	0.957	3.469	2.270	5.422	2.881	3.962	1.437	2.156
synth_5		2.358	5.548	2.107	6.320	1.135	5.390	1.554	5.791	4.249	8.801	4.403	7.812	4.172	9.179	5.113	15.038	3.026	7.349	3.329	4.993

DOI: 10.7717/peerj-cs.3319/table-4

To more effectively assess the method’s performance, the parameter combination yielding the highest weighted F1-score for each dataset in the hyperparameter search was selected. This optimal configuration was subsequently compared with the known entropy values. The evaluation was carried out using a class-wise stratified 5-fold cross-validation procedure. In this approach, the dataset is split into five equally sized folds while preserving the proportion of classes in each fold. For each iteration, the model is trained on four folds and tested on the remaining one. This process is repeated five times so that each fold serves once as the test set, and the results are then averaged to obtain the final performance metrics. The results are presented in terms of the Highest Improvement and Lowest Improvement observed. For the DT algorithms, these results are shown in Tables 5, 6, and 7, which correspond to Balanced Accuracy, Precision, and Recall, respectively. The equivalent results for the RF algorithms are reported in Tables 8, 9, and 10.

Table 5:

Summarized results of balanced accuracy obtained for the DT algorithm, when using the best found parameter configuration after tuning for the weighted F1-score.