A feature selection method utilizing path accumulation cost, redundancy minimization, and interaction maximization for the diagnosis of coronary heart disease

Jiayao Jiang; Zheng Yue; Hongling Zhu; Yan Wang; Hongsen Cai; Wenguang Hou

doi:10.7717/peerj-cs.2834

A feature selection method utilizing path accumulation cost, redundancy minimization, and interaction maximization for the diagnosis of coronary heart disease

Jiayao Jiang¹, Zheng Yue¹, Hongling Zhu², Yan Wang², Hongsen Cai ¹, Wenguang Hou ¹

1School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China

2Division of Cardiology, Tongji Hospital, Huazhong University of Science and Technology, Wuhan, China

DOI: 10.7717/peerj-cs.2834

Published: 2025-04-28
Accepted: 2025-03-25
Received: 2024-12-19

Academic Editor: Armin Mikler

Subject Areas: Cardiology, Public Health, Data Mining and Machine Learning, Data Science
Keywords: Coronary artery disease, Feature selection, Feature redundancy, Interaction information, Feature relevance

Copyright: © 2025 Jiang et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits using, remixing, and building upon the work non-commercially, as long as it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Jiang J, Yue Z, Zhu H, Wang Y, Cai H, Hou W. 2025. A feature selection method utilizing path accumulation cost, redundancy minimization, and interaction maximization for the diagnosis of coronary heart disease. PeerJ Computer Science 11:e2834 https://doi.org/10.7717/peerj-cs.2834

The authors have chosen to make the review history of this article public.

Abstract

Background

Coronary heart disease (CHD) is a major cause of mortality worldwide, with an increasing trend of affecting younger populations. The asymptomatic early stages and rapid progression of CHD make diagnosis challenging, necessitating efficient diagnostic approaches.

Methods

We propose a novel algorithm that focuses on accumulating soft path costs to discern crucial indicators from extensive diagnostic tests, aiming to improve early CHD identification. Our approach emphasizes feature interaction using an interaction accumulation evaluation function to identify features with maximal interaction and minimal redundancy. A new stopping criterion based on information gain ratio is also introduced.

Results

Experimental outcomes demonstrate that our algorithm outperforms several classical algorithms in terms of classification accuracy and feature dimension reduction, while also identifying highly correlated feature subsets.

Conclusion

The proposed approach offers an efficient solution for early detection of CHD by identifying critical indicators, reducing diagnostic complexity, and improving predictive accuracy, thus potentially leading to more effective CHD management.

Introduction

Coronary heart disease (CHD) presents a significant global health challenge, accounting for approximately 17.9 million deaths annually. Myocardial infarctions and cerebrovascular accidents are primary contributors to CHD mortality, affecting individuals often without apparent symptoms in the early stages (World Health Organization, 2024). Increasing stress, unhealthy habits such as obesity, hypercholesterolemia, hypertension, and smoking have exacerbated the prevalence of CHD in younger populations (Gonçalves et al., 2022).

Early identification of CHD risk is crucial to prevent severe outcomes. However, conventional diagnostic methods, which rely on numerous overlapping markers, can be resource-intensive and inefficient (Shahid et al., 2020). Feature selection techniques offer promising approaches by reducing diagnostic markers, leading to improved screening models (Li et al., 2020).

Related work

Recent advancements in machine learning have made it a powerful tool for the early diagnosis and prediction of CHD, particularly in handling high-dimensional medical datasets. Machine learning models, such as decision trees, random forests, and support vector machines, have been widely applied in feature selection for CHD prediction, as they are capable of capturing complex, non-linear relationships within the data.

Feature selection methodologies can be broadly categorized into filter, wrapper, and embedded approaches (Zeng et al., 2015). In early CHD prediction, wrapper and embedded methods have gained prominence. Lin et al. (2022) employed a combination of gradient boosting tree feature selection and recursive feature elimination to predict acute coronary syndrome risk, successfully extracting 25 key variables from 430 complex medical features, culminating in an impressive 98.8% accuracy in ACS risk prediction. Similarly, Javeed et al. (2019) integrated a random search algorithm (RSA) with an optimized random forest (RF) model, achieving a commendable 93.33% accuracy on the Cleveland datase. Mohamed et al. (2020) developed a novel Parasite-Predator Algorithm (PPA) by amalgamating the strengths of Cat Swarm Optimization (CSO), Cuckoo Search (CS), and Crow Search Algorithm (CSA). After PPA, fitness is used to select the feature subset that maximizes the accuracy and minimizes the number of selected features. This approach achieved an 86.17% accuracy on the Statlog dataset, ultimately selecting only four key features (Mohamed et al., 2020).

(1) $F i t n e s s = m a x i m i z e (A c c + w_{f} \cdot (1 - \frac{L_{f}}{L_{t}}))$ where $A c c$ is the classification accuracy, $L_{f}$ is the length of the selected feature subset, and $L_{t}$ is the total number of features.

Additionally, Bharti et al. (2021) conducted experiments on the UCI Heart dataset and found that combining feature selection with outlier detection yielded the best results. They applied LASSO for feature selection and trained models using machine learning and deep learning techniques. In the end, the deep learning model achieved an accuracy of 94.2%, while the KNN model achieved 84.8% (Bharti et al., 2021). Fajri, Wiharto & Suryani (2023) proposed a hybrid feature selection method combining Bee Swarm Optimization (BSO) and Q-learning (QBSO-FS), which improved heart disease detection accuracy, achieving 84.86% accuracy under the KNeighbors model, with faster convergence and better performance than traditional methods.

(2) $F C_{s} = F_{s} - θ C_{s}$ where $F C_{s}$ is the evaluation value of feature selection, $F_{s}$ is the accuracy of feature selection, $C_{s}$ is the cost of feature subset, and $θ$ is the adjustment coefficient.

Despite their targeted nature, wrapper and embedded methods have obvious limitations. Particularly, wrapper methods require extensive model training and evaluation in the search for an optimal feature subset, a process that becomes especially laborious in scenarios involving large datasets or an abundance of features. Conversely, embedded methods heavily rely on specific models, which may limit their generalizability.

In contrast, filter methods, which rapidly identify features related to the target variable through statistical theory and are computationally simple and independent of any machine classifier model, have emerged as a practical and efficient alternative for managing complex medical datasets, such as those used in early screening prediction for CHD. Muhammad et al. (2020) introduced a machine learning-based CHD detection model, employing four feature selection techniques with ten classification algorithms, significantly improving classification accuracy. Reddy et al. (2023) enhanced the performance of various classifiers in cardiac risk prediction by combining the Cleveland and Statlog heart datasets and employing principal component analysis (PCA) and correlation-based feature selection (CFS). Peng, Long & Ding (2005) proposed the mRMR method based on information theory, aimed at using mutual information between features to discern relevant and redundant features. Yu & Liu (2003) introduced a FCBF method, focusing on selecting main features and eliminating highly redundant features through the computation of SU and Markov blankets.

However, these filter methods primarily evaluate the association of individual features with classification, disregarding the synergistic impact of combined features on the prediction. This oversight could potentially lead to the erroneous exclusion of important features or the retention of irrelevant ones, particularly in complex applications like CHD risk prediction. To address this challenge, Fleuret (2004), Bennasar, Hicks & Setchi (2015), Zeng et al. (2015), and Wang, Jiang & Jiang (2021) introduced various feature selection criteria (CMIM, JMIM, NJMIM, IWFS, MRMI), focusing on enhancing feature relevance, interaction, and redundancy removal, significantly improving the quality of feature selection and prediction accuracy. These advancements have collectively enhanced the quality of feature selection and prediction accuracy to a considerable extent. Notwithstanding the success of the above methods, they do not fully consider the relationships among all existing features and candidate features, potentially leading to the exclusion of relevant features or the inclusion of too many redundant features.

Contribution

Existing early screening predictive models for coronary heart disease (CHD) often struggle with managing the intricate relationships between existing and candidate features. This issue may lead to the omission of important features or the inclusion of unnecessary redundancies. To address this issue, this article introduces an early screening predictive model based on a novel CHD feature selection algorithm. Our model emphasizes the interactions among multiple diagnostic markers and their linkage to CHD. Further, it selects a feature subset where the strong associations between features yield more valuable information combinations, overcoming the above limitations and laying the groundwork for developing more sophisticated and effective CHD screening models. The main contributions are as follows:

1)

We propose a feature selection approach based on soft path accumulative cost. This approach identifies the feature with the highest interaction by calculating the interaction of each current feature with all previous features. The final outcome is a subset of features with the strongest interrelationships, which best represent CHD.
2)

To comprehensively evaluate the interactions among multiple features, we propose a novel interaction metric, Rel, and an accumulative interaction evaluation function SPA that measures the relationships between all existing features and the current candidate feature.
3)

We introduce a novel stop criterion evaluation function to balance model evaluation performance and feature subset size. This function determines when to halt feature selection by measuring the increase in information gain and setting hyperparameters to balance the number of features with the final model evaluation performance.

The article is structured as follows: “Prior Knowledge of Feature Selection” introduces the background theoretical knowledge relevant to our discussion; “Methods” outlines the development process of our proposed early screening predictive model; “Results” describes the experimental setup and analyzes the findings. Finally, in “Discussion and Conclusion” we examines the implications of our study for improving CHD early screening and proposes directions for future advancements.

Prior knowledge of feature selection

Before introducing the algorithm, we define the relevant formulas.

For a pair of discrete features $F_{i}, F_{j}$ , the relationship between features is defined by Mutual Information (MI) $I (F_{i}, F_{j})$ :

(3) $I (F_{i}, F_{j}) = \sum_{F_{i}} \sum_{F_{j}} p (F_{i}, F_{j}) \log (\frac{p (F_{i}, F_{j})}{p (F_{i}) p (F_{j})})$ where $p (F_{i}, F_{j})$ , $p (F_{i})$ , and $p (F_{j})$ are the probability density functions of $F_{i}$ , $F_{j}$ , $F_{i}$ and $F_{j}$ , respectively.

Normalizing MI results in Symmetrical Uncertainty (SU), which is given by the following formula:

(4) $S U (F_{i}, F_{j}) = \frac{2 \cdot I (F_{i}, F_{j})}{H (F_{i}) + H (F_{j})}$ where $H (F_{i})$ and $H (F_{j})$ are the entropies of $F_{i}$ and $F_{j}$ , respectively.

Given that our proposed algorithm is based on feature interaction, we use the three-way interaction mutual information $I (F_{i}, F_{j}, C)$ to measure the relationship between a pair of discrete features ${F_{i}, F_{j}}$ and the label C (Tang, Dai & Xiang, 2019).

(5) $I (F_{i}, F_{j}, C) = \sum \sum p (F_{i}, F_{j}, C) \log (\frac{p (F_{i}, F_{j}) p (F_{i}, C) p (F_{j}, C)}{p (F_{i}) p (F_{j}) p (C) p (F_{i}, F_{j}, C)}) .$

For $I (F_{i}, F_{j}, C)$ , the selected feature $F_{j}$ should meet the condition of being relevant to the label C and, in combination with the already selected feature $F_{i}$ , providing more effective information (Wang, Jiang & Jiang, 2021), that is,

(6) $I (F_{i}, F_{j}, C) > I (F_{i}, C) + I (F_{j}, C) o r I (F_{i}, F_{j}, C) > 0 .$

Markov Blanket is a criterion for judging the redundancy of a given feature $F_{i}$ in a feature set $F = {F_{1}, F_{2}, F_{3}, \dots, F_{m}}$ , defined as $M_{i} \in F (F_{i} \notin M_{i})$ . Mi is called the Markov Blanket of $F_{i}$ if and only if

(7) $P (F - M_{i} - {F_{i}}, C ∣ F_{i}, M) = P (F - M_{i} - {F_{i}}, C ∣ M_{i}) .$

If a given feature $F_{i}$ in the current feature set F has its corresponding Markov Blanket $M_{i}$ , it is considered redundant to the remaining features in the set.

For MI and SU, solely using mutual information $I (F_{i}, F_{j})$ to assess the relationships between features can inadvertently bias the algorithm towards selecting features with higher entropy, which indicates more significant variability or complexity in their distribution. Although features with high entropy are information-rich, this does not inherently mean they share the closest relationship with the target variable. The normalization of MI through symmetrical uncertainty (SU) accounts for the entropies of both variables involved, thus ensuring that the metric no longer depends on the intrinsic information content of the variables themselves. By applying SU to normalize MI, the inherent bias associated with selecting features based on MI is rectified, shifting the focus towards the relative importance of features in relation to the target (Yu & Liu, 2003). Additionally, this normalization introduces an increased amount of feature information into the selection process.

On the other hand, unlike MI, the $I (F_{i}, F_{j}, C)$ does not need to be positive. If $F_{i}, F_{j}$ have a strong positive interaction, this means they can provide more information for the label C than $F_{i}, F_{j}$ can each provide on their own. Moreover, for $F_{i}, F_{j}$ , missing any one of the features would reduce the accuracy of the final classification result. If the interaction is negative or zero, it implies that $F_{i}$ and $F_{j}$ may offer redundant information or information irrelevant to the label C. Hence, an effective interaction between features should be where $I (F_{i}, F_{j}, C)$ is greater than zero and at its maximum.

An information-theoretic perspective on feature selection

The fundamental goal of feature selection is to extract a key subset from the original feature set such that a classifier trained on this subset achieves classification performance comparable to, or even better than, that obtained using the full feature set. This goal can be formalized by the following conceptual function: under the constraint of classification performance, one seeks the smallest feature subset. That is,

(8) $min_{S \subseteq F} | S | s u b j e c t t o f (S) \geq f (F) - ε,$ where $f (S)$ denotes the accuracy of a classifier trained on the subset S and $ε$ represents the permissible degradation in performance. This formulation not only encapsulates the requirement to maintain performance close to that of the full feature set but also underscores the objectives of dimensionality reduction and redundancy minimization.

Filter-based feature selection methods grounded in information theory assess two key aspects: the redundancy among features and the relevance of each feature to the target labels. Traditionally, mutual information (MI) is used for both purposes, with $I (F_{i}, F_{j})$ measuring the redundancy between features $F_{i}$ and $F_{j}$ , and $I (F_{i}, C)$ quantifying the relevance between a feature $F_{i}$ and the target C. However, MI’s inherent limitations in scale and normalization have led to the adoption of symmetric uncertainty (SU) as a normalization tool, thereby enabling more consistent comparisons. Despite these improvements, the use of bivariate MI remains insufficient when trying to capture the additional joint gain achieved by combining a candidate feature with an already selected feature to explain the target. This deficiency motivates the introduction of three-way interaction mutual information, which is designed to account for the cooperative contribution of feature pairs in relation to the target.

The three-way interaction mutual information, denoted by $I (F_{i}, F_{j}, C)$ , quantifies the additional information provided by the candidate feature $F_{j}$ in the presence of a selected feature $F_{i}$ when explaining the target C. Nevertheless, directly using this three-variable measure presents two issues. On one hand, it does not distinguish the redundancy that may exist between $F_{i}$ and $F_{j}$ . On the other hand, the candidate feature $F_{j}$ might already exhibit high direct relevance to the target C, leading to potential double counting of its contribution.

To address these issues, we introduce two modulation terms based on symmetric uncertainty. The first modulation term is

(9) $\frac{I (F_{i}, F_{j}, C)}{S U (F_{i}, F_{j})},$ which adjusts for the redundancy between $F_{i}$ and $F_{j}$ . A high value of $S U (F_{i}, F_{j})$ indicates substantial redundancy between the two features, thereby reducing the value of this term and reflecting that a large portion of the joint information is redundant.

The second modulation term is

(10) $\frac{I (F_{i}, F_{j}, C)}{S U (F_{j}, C)},$ which accounts for the direct relevance of the candidate feature $F_{j}$ to the target C. When $F_{j}$ is highly correlated with C, the high value of $S U (F_{j}, C)$ implies that much of the joint information is already captured by $F_{j}$ ’s individual relevance, and this contribution should not be overemphasized in assessing its unique gain.

Compared to conventional methods that rely solely on bivariate MI or SU to evaluate feature relevance and redundancy, this approach—employing three-way interaction mutual information modulated by SU offers a more comprehensive framework. It enables a refined assessment of a candidate feature’s joint gain, facilitating the selection of a compact feature subset that both preserves or enhances classification performance and reduces redundancy.

Methods

We proposed a feature selection algorithm based on feature correlation, redundancy, and interaction. We named this algorithm Soft Path Feature Selection (SPFS), which iteratively selects features with the highest interaction through path accumulation cost while minimizing redundancy as much as possible.

As illustrated in Fig. 1, the SPFS algorithm introduces an interaction metric, $R e l (F_{i}, F_{j}, C)$ , which quantifies the interactions between features to balance their redundancy and relevance. Furthermore, the algorithm introduces an evaluation function called soft path accumulation (SPA). While ensuring the relevance of candidate features to the class label, it comprehensively considers relationships between all the selected features and each candidate feature, aiming to find the optimal balance between feature count and information.

Figure 1: Flowchart of the proposed method.
This flowchart illustrates the SPFS algorithm, which uses interaction metrics and soft path accumulation (SPA) to iteratively select features, balancing redundancy, relevance, and information to optimize the feature subset.

Download full-size image

DOI: 10.7717/peerj-cs.2834/fig-1

The algorithm inputs a subset of candidate features and utilizes the Rel and the SPA to comprehensively assess each candidate feature’s interaction with the previously selected best feature subset. It then selects the candidate with the strongest interaction to add to the current best subset. The feature selection process is iterative. At each step, the algorithm uses the selection stop scoring function $s c o r e$ to determine whether to continue or stop. If a negative score appears, the algorithm records the current best feature subset and begins a count. If this count exceeds the predefined threshold, k_stale, the algorithm stops iteration and selects the subset with the highest $s c o r e$ from the recorded best subsets as the final best feature subset.

Quantitative evaluation of feature interactions and SPA evaluation function

From Eq. (4), it is evident that a newly added feature $F_{j}$ , when combined with a previously selected best feature, should provide more information than each feature individually. However, as described in the section An Information-Theoretic Perspective on Feature Selection, highly correlated overlapping features might introduce excessive redundancy, which can be mitigated using symmetrical uncertainty (SU) to balance redundancy and relevance. Specifically, for a candidate feature $F_{j}$ and an already selected feature $F_{i}$ , if $S U (F_{j}, C) < S U (F_{i}, F_{j})$ , this indicates that the relationship between $F_{j}$ and $F_{i}$ is stronger than that between $F_{j}$ and the label C. In such cases, $F_{j}$ and $F_{i}$ are considered to be highly redundant, indicating an overlap of information between $F_{j}$ and $F_{i}$ .

Building on the characteristics of SU, we further propose an interaction metric, $R e l (F_{i}, F_{j}, C)$ . $R e l (F_{i}, F_{j}, C)$ is defined as:

(11) $R e l (F_{i}, F_{j}, C) = \frac{I (F_{i}, F_{j}, C)}{S U (F_{i}, F_{j})} - \frac{I (F_{i}, F_{j}, C)}{S U (F_{j}, C)} .$

This metric evaluates the interactions between features by considering the three-way mutual information $I (F_{i}, F_{j}, C)$ and their symmetrical uncertainty SU, aiding in identifying feature combinations that may contribute limited predictive efficacy due to high redundancy.

Through $R e l (F_{i}, F_{j}, C)$ , we can identify features $F_{j}$ that are highly related to $F_{i}$ yet exhibit low redundancy. However, focusing solely on $R e l (F_{i}, F_{j}, C)$ may lead to an overemphasis on the relationships among features without adequately considering the direct relevance of the candidate feature $F_{j}$ to the label C. Neglecting the direct association between candidate features and the label may result in a selected feature set that fails to establish effective predictive relationships, thereby impacting the model’s predictive performance.

To address this issue, the SPA evaluation function is designed to consider both the interactions among features and their direct information contributions to the label. Specifically, the SPA scoring criterion incorporates the interaction measure $R e l (F_{i}, F_{j}, C)$ and explicitly includes the information contribution measure between the candidate feature $F_{j}$ and the label C. The SPA function evaluates the composite effect of a candidate feature on the entire best feature subset by cumulatively calculating the weighted interactions between the candidate feature and all the features already selected in the best subset. The SPA scoring criterion is defined as:

(12) $S P A = \sum_{F_{i} \in b e s t} \sqrt{S U (F_{j}, C)} \cdot R e l (F_{i}, F_{j}, C)$ where best is the subset of already selected best features, $F_{j}$ is the candidate feature, and $F_{i}$ is a previously selected best feature.

Based on the SPA evaluation function, we defined the SPFS selection strategy as follows:

(13) $S P F S = \arg max_{F_{j} \in c a n d i d a t e} (\sum_{F_{i} \in b e s t} \sqrt{S U (F_{j}, C)} \cdot R e l (F_{i}, F_{j}, C))$ where candidate is the subset of candidate features. Through the SPFS scoring criterion, by considering the relevance of the current candidate feature $F_{j}$ to the label C (weighted by $S U (F_{j}, C)$ ), we accumulate the interactive effects between the candidate feature and all previously selected features to select the candidate feature with the highest interaction with any feature in the best subset. This means that for the selected best features, the chosen candidate feature must demonstrate an unparalleled interaction with at least one feature in the best subset.

Stop scoring function

Given the endless process of finding the highest interacting features, a reasonable stop scoring function is needed to identify an optimal feature subset that balances the number of features and prediction accuracy. We proposed that a feature should be included in the best subset if its addition significantly increased the information gain of the subset. The information gain ratio is extensively used to differentiate the contributions of features, because it reflects both the feature’s entropy and the information gain contribution to the target variable, allowing us to precisely evaluate the effectiveness of different feature subsets.

The $s c o r e$ is defined in Eq. (9), where $g_{R} (D, b e s t)$ and $g_{R} (D, b e s t_b e f o r e)$ are the information gain ratios of the current best feature subset and the previous best feature subset for the entire dataset D, respectively.

(14) $\begin{aligned} s c o r e = & g_{R} (D, b e s t) - g_{R} (D, b e s t_b e f o r e) \\ = & \frac{g (D, b e s t)}{H_{b e s t} (D)} - \frac{g (D, b e s t_b e f o r e)}{H_{b e s t_b e f o r e} (D)} . \end{aligned}$

The $s c o r e$ is calculated to determine whether the current best feature subset provides a more significant information gain boost compared to the previous best feature subset. The conditions are as follows:

(15) $s c o r e \Rightarrow {\begin{matrix} T h e n e w l y s e l e c t e d f e a t u r e b r i n g s u s e l e s s i n f o r m a t i o n, & i f s c o r e < 0, \\ T h e n e w l y s e l e c t e d f e a t u r e b r i n g s b e n e f i c i a l i n f o r m a t i o n, & i f s c o r e > 0 . \end{matrix}$

More precisely, Eq. (12) is explained in detail as follows:

If $s c o r e < 0$ , it indicates that the newly selected feature does not aid in the prediction accuracy and only increases the feature complexity.
If $s c o r e > 0$ , it suggests that the newly selected feature brings more beneficial information, aiding the model in making accurate predictions.

However, in some cases, the newly selected feature may not initially provide a significant boost in information gain, but as subsequent features are combined, it may produce better effects. To address this, we introduced a hyperparameter k_stale to delay the stop. When the number of times $s c o r e < 0$ reaches the threshold specified by k_stale, the selection of new features will be stopped. This approach allows for more flexible handling of early stopping issues caused by local minima in information gain.

Procedures of soft path feature selection

As illustrated in Algorithm 1, given a dataset D containing $n$ samples, $m$ features, and a class feature C, the SPFS algorithm identifies the best feature subset best. The SPFS feature selection process involves four primary stages: initialization, k_stale tuning, pre-computation, and feature selection.

Algorithm 1 :

Feature selection algorithm.

1: Input: Dataset D with a full feature set F, the class C,

F = {F_{1}, F_{2}, \dots, F_{m}}

, and the range of

k_s t a l e

(k_range)

2: Output: Selected feature subset

3: Initialization

4: candidate

\leftarrow F

; best

\leftarrow \emptyset

; count

\leftarrow 0

; score

\leftarrow - \infty

; best_score

\leftarrow - \infty

;

S_{S U} \leftarrow \emptyset

;

S_{R e l} \leftarrow \emptyset

; gain_least

\leftarrow []

;

6: k_stale Tuning

7: for each

k_s t a l e_v a l u e

k_r a n g e

8: Perform the Feature-selection step using current

k_s t a l e_v a l u e

;

9: Update

b e s t_k_s t a l e

if performance improves

10: end for

11: Pre-computation

12: for each

F_{i}, F_{j} \in F

and

F_{i} \neq F_{j}

13: Calculate

S_{S U} [i] = S U (F_{i}, C)

;

14: if

I (F_{i}, F_{j}, C) \geq 0

then

15: Set

S_{R e l} [i, j] = R e l (F_{i}, F_{j}, C)

16: else

17: Set

S_{R e l} [i, j] = - \infty

18: end if

19: end for

20: Order candidate by

S_{S U}

in descending order

21: Feature-selection

22: while candidate is not empty and count

< k_s t a l e

23:

F_{i} \leftarrow

getFirstElement(candidate)

24: Remove F_i from candidate

25: best ← best

\cup {F_{i}}

26: for each

F_{j} \in

candidate do

27: Calculate SPA for F_j considering current best

28: Select the feature

F_{b e s t}

with the SPFS selection strategy, and best ← best

\cup {F_{b e s t}}

29: Remove

F_{b e s t}

from candidate

30: Calculate score for current best

31: if score

< 0

then

32: count ← count + 1

33: end if

34: if count

> k_s t a l e

then

35: Select the best feature subset best with the largest score

36: end if

37: end for

38: end while

39: End with the best feature subset best

DOI: 10.7717/peerj-cs.2834/table-6

In the initialization stage, the candidate feature subset, the best feature subset, and relevant algorithm parameters are initialized. During the pre-computation stage, the SU values for each feature and the mutual interaction Rel table between every pair of features and the label are calculated. The features are then ranked in descending order of their SU values to form the candidate feature subset.

The third stage focuses on tuning the hyperparameter $k_s t a l e$ . Each $k_s t a l e_v a l u e$ in $k_r a n g e$ is applied during the feature selection phase, and the performance of each is evaluated to select the best $k_s t a l e$ for the final feature selection stage.

The fourth stage is a key step in the SPFS algorithm, and it’s divided into two steps:

Select individual features: First, when the best feature subset is empty, the top-ranked feature from the sorted candidate feature subset is selected as the first feature of the best feature subset. In each iteration, the SPA evaluation function accumulates the interaction score of each candidate feature $F_{i}$ with the selected subset. Then, by using the SPFS selection strategy, the highest-scoring $F_{b e s t}$ feature is chosen as the new feature in the best subset.
Select the final best feature subset: After each feature selection, the score is calculated; if the score is positive, iteration continues. When a negative score first appears, iteration continues, but it enters a delayed stopping step, counting the number of times the score is below zero and recording the best subset selected. When the number of times the score is below zero exceeds $k_s t a l e$ , the addition of new features stops, and the algorithm selects the highest-scoring feature subset from these $k_s t a l e$ candidate best subsets as the final determined best subset.

Moreover, our algorithm also focuses on the stability of features. By observing whether the selected features remain consistent across multiple iterations and different data partitions, the stability of the feature selection process is assessed. A highly stable feature selection process indicates that the selected features have a genuine and reliable association with the target variable.

Ultimately, the SPFS algorithm identifies an optimal feature subset best that is not overly complex but can effectively predict the class feature C, derived after considering feature importance, interaction, and feature stability. Through this approach, the algorithm not only enhances the prediction accuracy of the model but also avoids unnecessary model complexity, thereby providing an efficient and reliable feature selection solution for practical applications.

Results

In this section, we first design some experiments to evaluate the performance of the SPFS algorithm by making a contrast with other representative feature selection algorithms and then report the empirical results.

Experiment setup

This research is dedicated to the development of an early screening and prediction model for coronary heart disease (CHD) that is particularly tailored for the young and middle-aged demographic, aiming to minimize the number of diagnostic markers needed. For this purpose, we analyzed four datasets associated with cardiac health and CHD. The first dataset, Heart, from the UCI Machine Learning Repository, features simplified binary classification of heart disease severity and excludes records with missing values. The Z-Alizadeh Sani dataset contains data from 303 patients at the Shaheed Rajaei Cardiovascular Center (Alizadehsani et al., 2013). Following this, Cardiovascular (Lin et al., 2022), from a tertiary hospital in Fujian, China, includes 2,702 patient records and focuses on predicting all-cause mortality, with all data anonymized. Lastly, CHD (Cao et al., 2022), approved by the ethics committee and anonymized, contains records from 715 patients at a tertiary hospital in Anhui, China. For both the Cardiovascular and Coronary Heart Disease datasets, they are open-source datasets from published articles that have already undergone ethical review in the original studies, and thus, this research does not require additional ethical approval.

Table 1 details the datasets used, including the number of features, cases, and categories. In the data preprocessing stage, since the obtained datasets had already undergone certain preprocessing steps by their original authors, we only implemented discretization to the datasets. Discretization aims to convert continuous numerical variables into discrete categorical values (such as 0 and 1), facilitating subsequent feature selection and classification prediction. Specifically, we utilized the Minimum Description Length Principle (MDLP), a widely recognized discretization method proposed by Fayyad & Irani (1993), to discretize all continuous numerical features. Additionally, for the imbalanced Cardiovascular dataset, the SMOTE technique was applied to address data imbalance issues during the classification stage, consistent with the methodology described in the original article. The implementation code for SPFS can be found at https://github.com/YUkiJiang559/SPFS.

Table 1:

Summary of benchmark datasets.

Dataset	Features	Samples	Positive	Negative	Classes
Heart	13	297	137	160	2
Z-Alizadeh Sani	54	303	216	87	2
Cardiovascular	87	2,702	121	2,582	2
Coronary heart disease	43	725	262	453	2

DOI: 10.7717/peerj-cs.2834/table-1

To validate the efficacy of the proposed method, this study compared various widely used supervised feature selection algorithms, including FCBF (Yu & Liu, 2003), Consistency (Dash & Liu, 2003), mRMR (Peng, Long & Ding, 2005), CFS (Hall, 2000), Relief-F (Urbanowicz et al., 2018), CMIM (Fleuret, 2004), JMIM (Bennasar, Hicks & Setchi, 2015), IWFS (Zeng et al., 2015), and MRMI (Wang, Jiang & Jiang, 2021). These algorithms aim to identify the most relevant features, eliminate redundancies, and consider interactions among features to obtain a best feature subset. If the original articles of the datasets provided related feature selection results, these were also included for comparison. Among them, the CFS, FCBF, and mRMR algorithms obtain the best feature subset by finding the most relevant features and removing redundant features; the CMIM, JMIM, and IWFS algorithms focus more on the interaction between features to obtain the best feature subset, while MRMI uses the interaction between features while considering the relevance and redundancy of the features. Specifically, to ensure a fair comparison with the SPFS algorithm, CFS, Consistency, FCBF, and Relief-F were implemented using Weka’s built-in library with default parameters (Eibe Frank & Witten, 2016). Similarly, CMIM, JMIM, and IWFS were implemented through the ITMO-FS library with default parameter settings (Computer Technologies Laboratory, 2024). It should be noted that the feature selection methods chosen for this study represent a set of well-established, conventional techniques that have been widely adopted in high-quality research publications as standard benchmarks for comparing feature selection performance.

Empirical results

In line with the experimental design outlined earlier, we evaluated the performance of feature selection algorithms by examining the number of selected features and their classification accuracy. This evaluation involved classifying the best feature subsets with XgBoost, naive Bayes, Linear-SVM, and random forest algorithms, all configured with default parameter settings (Chen & Guestrin, 2016; Cervantes et al., 2020; Antoniadis, Lambert-Lacroix & Poggi, 2021; van de Schoot et al., 2021). To ensure the stability and accuracy of the prediction results, ten-fold cross-validation was applied across all models. And two metrics were used to quantitative evaluation of feature selection algorithms: the algorithm’s classification accuracy and its effectiveness in reducing feature dimensionality. Ideally, an algorithm that significantly surpasses others in classification accuracy is considered to have the best performance. However, when the difference in classification accuracy among algorithms is not significant, those that more effectively reduce feature dimensionality are deemed superior.

Among the involved algorithms, mRMR, Relief-F, CMIM, JMIM, and IWFS use a ranking procedure to select features, while others select the best feature subset based on their specific stopping criteria. Therefore, this study’s experimental design aims to compare feature selection algorithms in two directions: firstly, for those algorithms determining the best feature subset based on their stopping criteria (CFS, Consistency, and FCBF), we compare their performance in reducing feature dimensions and the classification accuracy of their selected feature subset; secondly, for rank-based algorithms (mRMR, Relief-F, CMIM, JMIM, and IWFS), we evaluate the incremental gain in classification prediction accuracy for each feature selected in rank order. Considering the MRMI algorithm and the SPFS algorithm include both feature ranking and specific stopping criteria, this study evaluates their performance in both directions.

Performance comparison with CFS, consistency, FCBF, and MRMI

Table 2 showcases the reduced number of features after applying the CFS, Consistency, FCBF, MRMI, and SPFS algorithms, with the columns for MRMI and SPFS specifically indicating the best feature subsets arranged in ranking order.

Table 2:

Number of features selected with different algorithms.

Dataset	Full set features	CFS	Consistency	FCBF	MRMI	SPFS
Heart	13	7	8	5	6 (13,9,12,8,3,10)	3 (13,9,12)
Z-Alizadeh Sani	54	11	11	10	5 (25,53,39,14,15)	3 (25,53,1)
Cardiovascular	87	19	18	9	5 (74,6,26,45,29)	4 (74,6,18,71)
Coronary heart disease	43	11	13	11	5 (41,1,30,28,29)	5 (41,1,22,9,28)

DOI: 10.7717/peerj-cs.2834/table-2

The experimental outcomes from Table 2 reveal that all feature selection algorithms achieved varying degrees of feature dimensionality reduction compared to the original datasets, with SPFS achieving the most significant reduction. This is particularly notable in the Cardiovascular dataset, where SPFS’s reduction effect was the most pronounced. This suggests that SPFS can efficiently identify key features that improve the model’s ability to distinguish between high-risk and low-risk patients, which is crucial in clinical decision-making for cardiovascular disease.

Subsequent tables (Tables 3, 4) detail the performance evaluation results of four classification models for the Heart, Z-Alizadeh Sani, Cardiovascular and Coronary Heart Disease, respectively. These results are presented as the average classification accuracy $\pm$ standard deviation from ten-fold cross-validation. The term ”Full set features” in the first column denotes the performance results when all features of the original dataset are used for classification prediction, with the best feature selection algorithm results being highlighted in bold.

Table 3:

Classification accuracy (%) of heart and Z-Alizadeh Sani with selected features.

Algorithm	Dataset	Full set features	CFS	Consistency	FCBF	MRMI	SPFS
Xgboost	Heart	75.20 ± 4.75	76.60 ± 4.65	74.20 ± 3.52	75.00 ± 4.58	76.20 ± 4.85	79.00 ± 3.71
Xgboost	Z-Alizadeh Sani	81.37 ± 6.15	85.49 ± 4.98	88.04 ± 4.76	85.88 ± 4.37	75.88 ± 2.64	84.51 ± 4.84
Naive Bayes	Heart	74.40 ± 4.96	75.00 ± 5.88	74.60 ± 5.14	75.80 ± 5.02	76.00 ± 5.73	75.80 ± 5.02
Naive Bayes	Z-Alizadeh Sani	41.18 ± 6.44	49.80 ± 6.46	45.88 ± 7.19	49.41 ± 6.84	42.35 ± 21.36	77.84 ± 2.91
Linear SVM	Heart	76.60 ± 4.74	76.80 ± 4.40	77.80 ± 4.24	80.40 ± 3.67	79.00 ± 4.02	80.40 ± 3.67
Linear SVM	Z-Alizadeh Sani	86.08 ± 5.07	83.92 ± 4.94	84.12 ± 4.15	84.31 ± 4.64	76.47 ± 2.77	87.06 ± 4.74
Random Forest	Heart	78.00 ± 4.47	76.00 ± 4.10	76.20 ± 3.94	73.80 ± 5.90	76.40 ± 4.36	79.20 ± 3.82
Random Forest	Z-Alizadeh Sani	81.76 ± 4.48	83.92 ± 5.87	88.04 ± 4.15	85.29 ± 5.42	75.49 ± 3.19	84.51 ± 4.84
Average	Heart	76.05 ± 4.73	76.10 ± 4.76	75.70 ± 4.21	76.25 ± 4.79	76.90 ± 4.74	78.60 ± 4.05
Average	Z-Alizadeh Sani	72.60 ± 5.54	75.78 ± 5.56	76.52 ± 5.06	76.23 ± 5.32	67.55 ± 7.49	83.48 ± 4.33

DOI: 10.7717/peerj-cs.2834/table-3

Note:

The best results will be shown in bold.

Table 4:

Classification accuracy (%) of cardiovascular and coronary heart disease with selected features.

Algorithm	Dataset	Full set features	CFS	Consistency	FCBF	MRMI	SPFS
Xgboost	Cardiovascular	95.68 ± 0.61	91.71 ± 0.96	91.84 ± 1.23	85.83 ± 1.86	86.90 ± 2.63	94.32 ± 0.64
Xgboost	Coronary heart disease	95.75 ± 2.25	92.92 ± 1.46	93.25 ± 1.80	92.92 ± 1.46	81.67 ± 2.01	93.50 ± 2.52
Naive Bayes	Cardiovascular	93.75 ± 1.39	95.57 ± 0.47	95.45 ± 0.60	95.19 ± 0.79	96.85 ± 0.46	95.45 ± 0.91
Naive Bayes	Coronary heart disease	84.58 ± 2.39	83.50 ± 2.13	82.25 ± 2.77	83.50 ± 2.13	81.25 ± 2.18	80.42 ± 2.02
Linear SVM	Cardiovascular	92.35 ± 0.41	88.38 ± 0.90	88.31 ± 1.47	79.67 ± 1.96	83.48 ± 3.16	94.28 ± 0.63
Linear SVM	Coronary heart disease	84.50 ± 2.56	83.17 ± 1.86	82.42 ± 1.99	83.17 ± 1.86	81.25 ± 2.18	80.42 ± 2.02
Random Forest	Cardiovascular	96.34 ± 0.44	90.84 ± 1.40	92.35 ± 0.90	85.43 ± 1.80	87.27 ± 2.71	94.32 ± 0.64
Random Forest	Coronary heart disease	94.25 ± 2.12	92.83 ± 1.40	93.33 ± 2.36	92.92 ± 1.13	81.83 ± 2.13	93.42 ± 2.40
Average	Cardiovascular	94.53 ± 0.71	91.62 ± 0.93	91.99 ± 1.05	86.53 ± 1.60	88.63 ± 2.24	94.60 ± 0.71
Average	Coronary heart disease	89.77 ± 2.33	88.10 ± 1.71	87.81 ± 2.23	88.13 ± 1.64	81.50 ± 2.13	86.94 ± 2.24

DOI: 10.7717/peerj-cs.2834/table-4

Note:

The best results will be shown in bold.

In this experimental section, Tables 3 and 4 highlight the effectiveness of the SPFS algorithm on four datasets. Similarly, in Heart and Z-Alizadeh Sani, SPFS consistently recorded the highest average accuracy and, in most instances, achieved the highest classification accuracy. On the Heart dataset, SPFS achieved an average accuracy of 79.00%, surpassing CFS (76.60%), Consistency (74.20%), FCBF (75.00%), and MRMI (76.20%). Moreover, SPFS exhibited a remarkable capability for dimensionality reduction, achieving the highest reduction at 76.92%. This finding demonstrates that SPFS not only enhances accuracy but also helps streamline diagnostic processes, a significant advantage in the clinical setting where fewer but more informative features can lead to faster decision-making. Remarkably, in the Z-Alizadeh Sani dataset, SPFS achieved significantly higher classification accuracy in the Naive Bayes evaluation than other methods, with p-values all less than 0.001. Analysis of the feature subsets selected by the feature selection algorithms revealed that CFS, Consistency, FCBF, and MRMI inevitably included features less relevant or not recommended by the dataset’s original study (e.g., features 14, 29, 28). Additionally, the performance of the feature subset from the original Z-Alizadeh Sani study in Xgboost, Naive_bayes, and Linear-svm models was inferior to SPFS, with a similar outcome in Random-forest. (Performance of the feature subset from the original study: Xgboost: 83.14%, Naive_bayes: 44.51%, Linear-svm: 83.92%, Random-forest: 85.10%) This further supports the notion that SPFS selects the most appropriate features that directly contribute to the accurate classification of patients, which is particularly beneficial in predicting coronary heart disease risk.

On the Cardiovascular dataset, SPFS outperformed the feature subsets selected in the original article across all evaluated models. Additionally, SPFS demonstrated p-values less than 0.001 for all models except naive Bayes. (Performance of the feature subset from the original study: Xgboost: 93.33%, Naive_bayes: 58.56%, Linear-svm: 87.83%, Random-forest: 93.37%) This highlights SPFS’s capability in managing large, complex datasets while still retaining essential predictive information, making it a useful tool for clinical diagnosis.

Finally, for the Coronary Heart Disease dataset, the SPFS algorithm achieved a classification accuracy of 93.50% using the Xgboost classifier, highlighting its notable advantage in feature selection, with p-values all less than 0.01 compared to other methods. The SPFS showed exceptional performance in models handling complex data relationships, such as XGBoost and Random Forest. However, models which assume independence between features (naive Bayes) and are designed to handle linear relationships (Linear SVM) generated reduced performance, which means that the SPFS emphasized feature interactions and nonlinear relationships rather than selecting bad features. Unlike other algorithms, which did not account for interactions between multiple features, SPFS’s consideration of feature interactions contributed to its superior performance. While SPFS may not have achieved the highest average accuracy, its ability to effectively reduce dimensionality and select the most relevant features underscores the critical role of feature selection strategies in enhancing model performance. The original article utilized 24 features, whereas SPFS opted for only 5, showcasing SPFS’s excellence in reducing feature count while preserving high performance. (Performance of the feature subset from the original study: Xgboost: 94.58%, Naive_bayes: 83.17%, Linear-svm: 83.83%, Random-forest: 93.50%) This demonstrates that SPFS can identify the core features crucial for diagnosing coronary heart disease while eliminating unnecessary ones, which is crucial for improving clinical efficiency. Furthermore, when compared to using all features in model predictions, SPFS consistently demonstrated superior accuracy across all datasets, except for the Coronary Heart Disease dataset.

Performance comparison of mRMR, Relief-F, CMIM, JMIM, IWFS, and MRMI

For algorithms that select features based on a ranking procedure, by specifying a required number of features N, the feature selection algorithm outputs the top N best features to form the final most important feature ranking subset. For the four datasets covered in the experiment, the required number of features N was set to 10. For these datasets, each algorithm ranked the features by importance and provided the corresponding most important feature ranking subset. Table 5 lists the different algorithms selected features in order of importance. We can observe that although the ranking orders vary, certain key features consistently appear at the top across the subsets of different algorithms. These commonly deemed important features form the core of the SPFS algorithm’s selection. This indicates that while each feature selection algorithm can capture the most important information among many features, it may also capture incorrect information or noise. Meanwhile, the SPFS algorithm exhibits significant efficiency in feature selection, adeptly identifying the most crucial features within complex information while disregarding irrelevant ones. This underscores the SPFS algorithm’s effectiveness in capturing essential data within datasets.

Table 5:

Top-10 ranked features by different algorithms.

Methods	Heart	Z-Alizadeh Sani	Cardiovascular	Coronary heart disease
mRMR	13, 12, 3, 11, 9, 2, 8, 10, 1, 4	25, 54, 1, 35, 53, 6, 28, 18, 24, 32	75, 49, 69, 29, 45, 72, 26, 81, 56, 74	9, 41, 22, 28, 16, 43, 29, 1, 23, 33
Relief-F	13, 2, 9, 12, 8, 3, 11, 10, 6, 1	25, 28, 7, 53, 26, 8, 1, 35, 4, 6	10, 56, 57, 40, 5, 4, 6, 8, 3, 2	22, 41, 1, 12, 28, 43, 16, 18, 38, 11
CMIM	13, 12, 3, 9, 11, 2, 10, 8, 7, 6	25, 54, 35, 53, 18, 6, 1, 55, 32, 38	75, 49, 45, 20, 29, 69, 26, 57, 44, 21	9, 41, 22, 28, 17, 16, 1, 33, 27, 18
JMIM	13, 12, 3, 9, 11, 2, 10, 8, 1, 7	25, 54, 28, 7, 53, 55, 35, 1, 29, 6	75, 49, 69, 45, 81, 71, 76, 70, 20, 29	9, 41, 22, 28, 43, 17, 23, 27, 16, 29
IWFS	13, 12, 3, 9, 10, 11, 2, 8, 1, 7	25, 54, 28, 53, 24, 35, 1, 29, 6, 32	74, 81, 32, 72, 75, 69, 80, 38, 22, 77	41, 22, 28, 9, 43, 29, 23, 33, 17, 16
MRMI	13, 9, 12, 8, 3, 10	25, 53, 39, 14, 15	74, 6, 26, 45, 29	41, 1, 30, 28, 29
SPFS	13, 9, 12	25, 53, 1	74, 6, 18, 71	41, 1, 22, 9, 28

DOI: 10.7717/peerj-cs.2834/table-5

The experimental results displayed in Fig. 2 indicate that the accuracy of almost all feature selection algorithms improves with an increase in the number of features. However, after a certain point, the performance improvement of the algorithms plateaus or even declines, highlighting the importance of selecting an optimal number of feature subsets to enhance model performance.

Figure 2: Average classification accuracy vs. number of top-ranked features.
(A)–(D) represent different data sets, and the label is followed by the name of the corresponding data set. Figure shows that SPFS achieves competitive or superior performance with fewer features, while other algorithms’ accuracy plateaus or declines after a certain point.

Download full-size image

DOI: 10.7717/peerj-cs.2834/fig-2

Comparing with algorithms that select features based on a ranking procedure, it can be seen that the results obtained using the SPFS method are very close or similar to those obtained with mRMR, Relief-F, CMIM, JMIM, IWFS, and MRMI methods, and sometimes even better.

In the Heart and Cardiovascular datasets, SPFS was almost the best-performing algorithm. Besides SPFS, other algorithms exhibited significant performance drops or sharp declines after initial improvements. The sudden decline and subsequent improvement in the performance of the SPFS algorithm on the Heart dataset may be because the second feature alone did not contribute much information. However, it has a strong mutual correlation with the third feature, leading to a significant increase in accuracy performance once the third feature was included. This can be attributed to the effectiveness of the $k_s t a l e$ parameter, which prevents premature stopping, allowing the algorithm to continue selecting features that provide additional useful information when combined with previous ones. As can be seen by the SPFS without $k_s t a l e$ in the line graph, Heart will stop at the second feature without $k_s t a l e$ . In the Z-Alizadeh Sani dataset, SPFS’s steady improvement was second only to the MRMR algorithm, while other algorithms required more features to achieve comparable or higher accuracy levels, contrary to our objective of reducing feature dimensions. In the Coronary Heart Disease dataset, the SPFS algorithm’s performance steadily increased with the addition of features. Although the final result was not the best, it was similar to the performances of other superior algorithms, which did not show much improvement or even deteriorated with additional features. In the Cardiovascular dataset, the $k_s t a l e$ parameter effectively stopped the feature selection before accuracy declined sharply, ensuring that the selected feature subset remained highly effective.

In summary, the SPFS algorithm can select the most important and accurate features with a lower required number of features, achieving rapid improvements in accuracy while balancing high accuracy and low feature dimensionality. Comparisons with the other six feature selection methods show that SPFS can achieve better or similar performance with fewer features, proving its superiority.

Discussion and Conclusion

In our study, we introduced SPFS, a pioneering approach to feature selection that intricately assesses the relevance, redundancy, and interactivity among features. By computing the path cumulative cost, SPFS adeptly minimizes redundancy while pinpointing the most interactively significant features, demonstrating exceptional efficacy across diverse datasets. One notable finding emerged when applying SPFS to large-scale, highly imbalanced cardiac datasets, where the algorithm not only maintained classification accuracy but also enhanced it by selecting a smaller subset of features. This success is likely due to SPFS’s unique ability to account for the complex interactions between features, ensuring that even in the presence of extreme class imbalances, the selected features retain predictive power. This finding highlights a gap in current medical data analytics, where methods for handling imbalanced datasets are often insufficient or underdeveloped. Despite the significant prevalence of class imbalances in medical datasets, many existing techniques are not well-suited to address this challenge. SPFS, however, demonstrates its potential by successfully handling such imbalances.

However, SPFS is not without its limitations, notably its dependence on precise parameterization. Its performance is markedly influenced by the critical parameter, $k_s t a l e$ , underscoring the importance of accurate parameter configuration for achieving the best feature subset selection. The parameter tuning process, often reliant on exhaustive search techniques, demands an in-depth understanding of the dataset, which can be daunting for users with limited domain expertise. Additionally, the algorithm’s sensitivity to parameter changes might hinder its broad applicability.

Looking ahead, we aim to expand the use of SPFS to a wider range of medical health datasets, especially those with extreme class imbalances. Furthermore, we plan to improve SPFS’s handling of such imbalances by explicitly incorporating information about the differences between positive and negative class distributions in the training data. This could involve developing strategies that adaptively select features based on these class disparities, further improving its performance in large-scale, imbalanced datasets. Additionally, we will integrate SPFS with cutting-edge machine learning techniques to enhance its potential for high-dimensional and complex data structure recognition and feature extraction. Future developments will focus on improving algorithm adaptability by automating parameter adjustments, potentially using reinforcement learning or meta-learning approaches. Structural improvements will also be made to more effectively process large-scale, complex datasets, allowing SPFS to better handle the diverse challenges posed by real-world medical data.

Supplemental Information

Codebook of Categorical Data.

DOI: 10.7717/peerj-cs.2834/supp-1

Download

[1] Alizadehsani R, Habibi J, Hosseini MJ, Mashayekhi H, Boghrati R. 2013. A data mining approach for diagnosis of coronary artery disease. Computer Methods and Programs in Biomedicine 111(1):52-61

[2] Antoniadis A, Lambert-Lacroix S, Poggi J-M. 2021. Random forests for global sensitivity analysis: a selective review. Reliability Engineering & System Safety 206:107312

[3] Bennasar M, Hicks Y, Setchi R. 2015. Feature selection using joint mutual information maximisation. Expert Systems with Applications 42(22):8520-8532

[4] Bharti R, Khamparia A, Shabaz M, Dhiman G, Pande S, Singh P. 2021. Prediction of heart disease using a combination of machine learning and deep learning. Computational Intelligence and Neuroscience 2021:e8387680

[5] Cao J, Zhang L, Ma L, Zhou X, Yang B, Wang W. 2022. Study on the risk of coronary heart disease in middle-aged and young people based on machine learning methods: a retrospective cohort study. PeerJ 10:e14078

[6] Cervantes J, Garcia-Lamont F, Rodríguez-Mazahua L, Lopez A. 2020. A comprehensive survey on support vector machine classification: applications, challenges and trends. Neurocomputing 408:189-215

[7] Chen T, Guestrin C. 2016. XGBoost: a scalable tree boosting system.

[8] Computer Technologies Laboratory. 2024. Ctlab/itmo_fs. (accessed 9 April 2024)

[9] Dash M, Liu H. 2003. Consistency-based search in feature selection. Artificial Intelligence 151(1):155-176

[10] Eibe Frank MAH, Witten IH. 2016. Data mining: practical machine learning tools and techniques (Fourth Edition). Burlington: Morgan Kaufmann.

[11] Fajri YAZA, Wiharto W, Suryani E. 2023. Hybrid model feature selection with the bee swarm optimization method and Q-learning on the diagnosis of coronary heart disease. Information 14(1):15

[12] Fayyad UM, Irani KB. 1993. Multi-interval discretization of continuous-valued attributes for classification learning.

[13] Fleuret F. 2004. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research 5(4941):1531-1555

[14] Gonçalves CM, Cabral M, Carvalho M, Vazão A, Fernandes S, Carvalho R, Saraiva F, Morais J. 2022. Poster no. 044 cardiovascular risk factors and coronary heart disease in the young—a single centre analysis. Cardiovascular Research 118(Supplement_2):cvac157.052

[15] Hall M. 2000. Correlation-based feature selection for discrete and numeric class machine learning.

[16] Javeed A, Zhou S, Yongjian L, Qasim I, Noor A, Nour R. 2019. An intelligent learning system based on random search algorithm and optimized random forest model for improved heart disease detection. IEEE Access 7:180235–180243

[17] Li C, Luo X, Qi Y, Gao Z, Lin X. 2020. A new feature selection algorithm based on relevance, redundancy and complementarity. Computers in Biology and Medicine 119:103667

[18] Lin H, Xue Y, Chen K, Zhong S, Chen L. 2022. Acute coronary syndrome risk prediction based on gradient boosted tree feature selection and recursive feature elimination: a dataset-specific modeling study. PLOS ONE 17(11):e0278217

[19] Mohamed A-AA, Hassan SA, Hemeida AM, Alkhalaf S, Mahmoud MMM, Baha Eldin AM. 2020. Parasitism—predation algorithm (PPA): a novel approach for feature selection. Ain Shams Engineering Journal 11(2):293-308

[20] Muhammad Y, Tahir M, Hayat M, Chong KT. 2020. Early and accurate detection and diagnosis of heart disease using intelligent computational model. Scientific Reports 10(1):19747

[21] Peng H, Long F, Ding C. 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(8):1226-1238

[22] Reddy KVV, Elamvazuthi I, Aziz AA, Paramasivam S, Chua HN, Pranavanand S. 2023. An efficient prediction system for coronary heart disease risk using selected principal components and hyperparameter optimization. Applied Sciences 13(1):118

[23] Shahid AH, Singh MP, Roy B, Aadarsh A. 2020. Coronary artery disease diagnosis using feature selection based hybrid extreme learning machine.

[24] Tang X, Dai Y, Xiang Y. 2019. Feature selection based on feature interactions with application to text categorization. Expert Systems with Applications 120:207-216

[25] Urbanowicz RJ, Meeker M, La Cava W, Olson RS, Moore JH. 2018. Relief-based feature selection: introduction and review. Journal of Biomedical Informatics 85:189-203

[26] van de Schoot R, Depaoli S, King R, Kramer B, Märtens K. 2021. Bayesian statistics and modelling. Nature Reviews Methods Primers 1(1):1

[27] Wang L, Jiang S, Jiang S. 2021. A feature selection method via analysis of relevance, redundancy, and interaction. Expert Systems with Applications 183(4):115365