Feature selection based on neighborhood rough sets and Gini index

Yuchao Zhang; Bin Nie; Jianqiang Du; Jiandong Chen; Yuwen Du; Haike Jin; Xuepeng Zheng; Xingxin Chen; Zhen Miao

doi:10.7717/peerj-cs.1711

Feature selection based on neighborhood rough sets and Gini index

Yuchao Zhang, Bin Nie , Jianqiang Du, Jiandong Chen, Yuwen Du, Haike Jin, Xuepeng Zheng, Xingxin Chen, Zhen Miao

School of Computer Science, Jiangxi University of Chinese Medicine, NanChang, JiangXi, China

DOI: 10.7717/peerj-cs.1711

Published: 2023-12-12
Accepted: 2023-10-30
Received: 2023-09-05

Academic Editor: Khursheed Aurangzeb

Subject Areas: Algorithms and Analysis of Algorithms, Artificial Intelligence, Data Mining and Machine Learning
Keywords: Neighborhood rough set, Gini index, Feature selection

Copyright: © 2023 Zhang et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Zhang Y, Nie B, Du J, Chen J, Du Y, Jin H, Zheng X, Chen X, Miao Z. 2023. Feature selection based on neighborhood rough sets and Gini index. PeerJ Computer Science 9:e1711 https://doi.org/10.7717/peerj-cs.1711

The authors have chosen to make the review history of this article public.

Abstract

Neighborhood rough set is considered an essential approach for dealing with incomplete data and inexact knowledge representation, and it has been widely applied in feature selection. The Gini index is an indicator used to evaluate the impurity of a dataset and is also commonly employed to measure the importance of features in feature selection. This article proposes a novel feature selection methodology based on these two concepts. In this methodology, we present the neighborhood Gini index and the neighborhood class Gini index and then extensively discuss their properties and relationships with attributes. Subsequently, two forward greedy feature selection algorithms are developed using these two metrics as a foundation. Finally, to comprehensively evaluate the performance of the algorithm proposed in this article, comparative experiments were conducted on 16 UCI datasets from various domains, including industry, food, medicine, and pharmacology, against four classical neighborhood rough set-based feature selection algorithms. The experimental results indicate that the proposed algorithm improves the average classification accuracy on the 16 datasets by over 6%, with improvements exceeding 10% in five. Furthermore, statistical tests reveal no significant differences between the proposed algorithm and the four classical neighborhood rough set-based feature selection algorithms. However, the proposed algorithm demonstrates high stability, eliminating most redundant or irrelevant features effectively while enhancing classification accuracy. In summary, the algorithm proposed in this article outperforms classical neighborhood rough set-based feature selection algorithms.

Introduction

In data mining and machine learning, the goal of feature selection is to choose the most representative and valuable features from the original dataset to improve the performance and interpretability of models. In classification problems, a crucial feature selection step is establishing an effective feature evaluation function for candidate feature subsets. Common feature evaluation functions currently include consistency (Shin & Miyazaki, 2016), correlation (Gao et al., 2018; Malhotra & Jain, 2022), information gain (Zhang et al., 2020; Prasetiyowati, Maulidevi & Surendro, 2022; Shu et al., 2023), mutual information (Gao, Hu & Zhang, 2020; Lall et al., 2021), classifier error rate (Got, Moussaoui & Zouache, 2021; Moslehi & Haeri, 2020; Solorio-Fernández, Carrasco-Ochoa & Martínez-Trinidad, 2016; Li et al., 2021), distance (Lee & Oh, 2016), Gini index (Park & Kwon, 2011; Manek et al., 2017; Liu, Zhou & Liu, 2019), etc.

Rough set theory, proposed by Pawlak (1982) and continuously improved by subsequent researchers, is a mathematical tool for dealing with uncertain and inexact information. Recently, this theory has been widely applied to feature selection in data mining and machine learning (Sang et al., 2022; Huang, Li & Qian, 2022; Zhang et al., 2023; Wan et al., 2023; Yang et al., 2023). The rough set theory divides the dataset into equivalence classes to reveal the dependency relationships among attributes and the process of generating decision rules. For classification problems, rough set theory uses features to induce binary relations and divides samples into different information granules based on these binary relations. These information granules are then used to approximate decision variables and represent upper and lower approximations of decisions. Based on this, a feature evaluation function called the dependency function is defined. Different types of binary relations lead to different granulation mechanisms, resulting in various rough set models, such as classical rough set (Pawlak, 1982), similarity relation-based rough set (Dai, Gao & Zheng, 2018), dominance relation-based rough set (Greco, Matarazzo & Slowinski, 1999; Shao & Zhang, 2004), fuzzy rough set (Pawlak, 1985), and other rough set models (Sang et al., 2018).

The neighborhood rough set (NRS) (Hu et al., 2008) is one of the most crucial rough set models proposed to address the challenges of handling continuous features in classical rough set theory. Since its application to feature selection (Hu et al., 2008), NRS has gained widespread attention in data mining and machine learning (Liu et al., 2016; Zeng, She & Niu, 2014). Many scholars have proposed different feature evaluation functions based on this model and developed corresponding feature selection algorithms. Hu et al. (2011) proposed neighborhood information entropy to address the fact that Shannon entropy cannot directly evaluate uncertainty on continuous features. Wang et al. (2018) explored some neighborhood distinguishability measures to assess data uncertainty. They proposed the K-nearest neighbor NRS by combining the advantages of neighborhood and K-nearest neighbors while focusing on data distribution (Wang et al., 2019b). Wang et al. (2019a) proposed neighborhood self-information to utilize deterministic and uncertain information better. Sun et al. (2019a, 2019b) introduced the Lebesgue measure into NRS, enabling feature selection on infinite sets (Sun et al., 2019a) and incomplete sets (Sun et al., 2019b). Li, Xiao & Wang (2018) extended discernibility matrices to NRS and applied them to a power system stability assessment.

Many of the evaluation methods above extend the evaluation metrics for discrete features to continuous random variables through neighborhood relations. For example, neighborhood information entropy extends Shannon entropy to continuous random variables, neighborhood discernibility matrix extends rough set discernibility matrix, and neighborhood self-information extends self-information to continuous random variables. The Gini index (GI) (Breiman et al., 1984), first introduced by Breiman in 1984 and applied to node splitting in decision trees, accurately quantifies the impurity of a dataset. Especially in classification problems, it effectively measures the contribution of features to classification results and has been widely used in feature selection in data mining and machine learning (Breiman, 2001; Wang, Deng & Xu, 2023).

This article combines NRS with GI from two perspectives and proposes two unique feature importance evaluation metrics. First, from the standpoint of sample neighborhoods, the Neighborhood Gini index is proposed to measure the importance of features through neighborhood information. Second, from the standpoint of class neighborhoods, the Neighborhood Class Gini index is proposed to reveal the differences in features among different classes. The properties of these two evaluation metrics and their relationships with attributes are discussed. Based on these evaluation metrics, two forward greedy algorithms are designed for feature selection. Finally, the effectiveness and stability of the proposed algorithms are validated through experiments.

The structure of this article is as follows: In the “Materials and Methods” section, a review of the fundamental concepts of NRS and GI is provided, and the combination of NRS and GI is used to propose two distinct feature importance evaluation indicators. The properties of these two evaluation indicators and their relationships with attributes are discussed. Subsequently, the importance of candidate features is defined based on the two evaluation indicators. Building upon this, two separate forward greedy feature selection algorithms are formulated. In the “Experimental Analysis and Discussion” section, the effectiveness and stability of the proposed algorithms are verified. In the “Conclusions” section, we concluded the article with possible directions for future research.

Materials and Methods

Neighborhood rough set

In rough sets, information tables are often represented by $< U, A >$ , where $U = {x_{1}, x_{2}, . . ., x_{n}}$ is a non-empty finite set of samples and $A = {a_{1}, a_{2}, . . ., a_{m}}$ is a non-empty finite set of attributes used to describe those samples.

Let $< U, A >$ be a table of information, $B \subseteq A$ and $d_{B}$ is a binary functional relation defined on $U$ with an attribute subset B, that is, $d_{B} : U \times U \to R^{+}$ . Then, $d_{B}$ is said to be a distance metric on $U$ when it satisfies the following relation:

(1) $d_{B} (x_{1}, x_{2}) \geq 0$ , $d_{B} = 0$ if and only if $x_{1} = x_{2}$ , $\forall x_{1}, x_{2} \in U$ ;

(2) $d_{B} (x_{1}, x_{2}) = d_{B} (x_{2}, x_{1})$ , $\forall x_{1}, x_{2} \in U$ ;

(3) $d_{B} (x_{1}, x_{3}) \leq d_{B} (x_{1}, x_{2}) + d_{B} (x_{2}, x_{3})$ , $\forall x_{1}, x_{2}, x_{3} \in U$ .

The Euclidean distance is a commonly used distance measure, and all the subsequent distance references in this article are in terms of the Euclidean distance. For any two samples, the calculation of the Euclidean distance is as follows:

(1) $d_{B} (x_{i}, x_{j}) = \sum_{a \in B} \sqrt{{(x_{i}^{a} - x_{j}^{a})}^{2}}$

In an information table $< U, A >$ , for any sample $x \in U$ , attribute subset $B \subseteq A$ , the neighborhood similarity relation $R_{B}^{σ}$ is defined as follows:

(2) $R_{B}^{σ} = {(x, y) \in U \times U | d_{B} (x, y) \leq σ}$ where $σ \geq 0$ is a user-defined constant. For any $x \in U$ , its neighborhood similarity class $[x]_{B}^{σ}$ is defined as follows:

(3) $[x]_{B}^{σ} = {y \in U : (x, y) \in R_{B}^{σ}}$

In an information system, neighborhood similarity classes are also referred to as neighborhood information granules, abbreviated as neighborhood granules. Here, $σ$ is called the radius of the neighborhood granules. In the information table $< U, A >$ , the neighborhood granule family ${[x_{i}]_{B}^{σ} | i = 1, 2, . . ., n}$ forms a covering of $U$ . The neighborhoods of all objects in the domain constitute the granulation of the domain, and the neighborhood granule family constitutes the fundamental concept system in the domain space. Through these fundamental concepts, we can approximate any concept in the space.

For any sample set $X \subset U$ , its lower approximation ${\underline{R}}_{B}^{σ}$ and upper approximation ${\bar{R}}_{B}^{σ}$ are defined as follows:

(4) ${\underline{R}}_{B}^{σ} = {x \in U : [x]_{B}^{σ} \subseteq X}$

(5) ${\bar{R}}_{B}^{σ} = {x \in U : [x]_{B}^{σ} \cap X \neq Ø}$

Let $D$ be a classification decision attribute defined on $U$ , and $A \cap D = Ø$ . In this case, the triple $< U, A, D >$ is referred to as a decision table. In the decision table $< U, A, D >$ , attribute D divides U into $r$ decision classes, denoted as $U / D = {E_{1}, E_{2}, . . ., E_{r}}$ . Here, $E_{i} (i = 1, 2, . . ., r)$ is called a general equivalence class, meaning that all the samples in $E_{i}$ have the same class labels. In a decision table $< U, A, D >$ , where $B \subseteq A$ , and $R_{B}^{σ}$ is a neighborhood similarity relation defined on the attribute set $B$ in $U$ with a neighborhood radius of $σ$ , the upper approximation ${\bar{R}}_{B}^{σ} (D)$ and lower approximation ${\underline{R}}_{B}^{σ} (D)$ of the decision attribute $D$ with respect to the attribute set $B$ at a neighborhood granule size of $σ$ are defined as follows:

(6) ${\bar{R}}_{B}^{σ} (D) = \cup_{k = 1}^{r} {\bar{R}}_{B}^{σ} (E_{k})$

(7) ${\underline{R}}_{B}^{σ} (D) = \cup_{k = 1}^{r} {\underline{R}}_{B}^{σ} (E_{k})$

The positive domain of the decision table is written as:

(8) $P O S_{B}^{σ} (D) = \cup_{E_{k} \in U / D} {\underline{R}}_{B}^{σ} (E_{k})$

The boundary domain of the decision table is written as:

(9) $R n_{B}^{σ} (D) = U_{B} (D) - P O S_{B}^{σ} (D)$

The dependency function $γ_{B}^{σ} (D)$ of D associated with B is formulated as:

(10) $γ_{B}^{σ} (D) = \frac{| P O S_{B}^{σ} (D) |}{| U |}$ where $| . |$ indicates the cardinality of a set.

Gini index

GI is a metric used to measure the impurity of a dataset and is commonly employed in feature selection for decision tree algorithms. The values of GI range from 0 to 1. When GI = 0, the dataset’s impurity is minimal, meaning all elements in the dataset are the same. Conversely, when GI = 1, the dataset’s impurity is maximal, indicating that all elements in the dataset are different. For a dataset $D$ with $r$ categories, where each category’s proportion of samples is denoted as $p_{i}$ , the formula for calculating GI is as follows:

(11) $G I (D) = 1 - \sum_{i = 1}^{r} p_{i}^{2}$

GI evaluates the impurity of a dataset based on the distribution of class probabilities to determine the importance of the corresponding features. A smaller GI indicates higher dataset purity and better discriminative power of the feature. However, in the same dataset, different feature subsets result in the same class probability distribution, making it unsuitable for directly evaluating the classification performance of different feature subsets. Therefore, a new influencing factor must be introduced to make the class probability distributions vary across feature subsets. For instance, in the Classification and Regression Trees, a tree-like structure is introduced to partition the dataset. This partitioning leads to substantial differences between data subsets created by different feature divisions, resulting in distinct class probability distributions. This makes the Gini index enable the measurement of feature importance. In NRS, when the neighborhood radius is consistent, different attribute sets lead to distinct neighborhoods for samples. Conversely, when the attribute set is constant, various neighborhood radii correspond to different neighborhoods. Different neighborhoods could lead to varying class probability distributions. GI values would also differ. This makes GI applicable for feature selection in NRS.

Next, two different feature importance evaluation metrics integrating NRS and GI will be proposed.

The proposed method

Neighborhood Gini index

Samples with certain similarities should be grouped into the same category, and samples within the neighborhood range of a sample are considered similar from a distance perspective. Their features determine the similarity of samples. However, for reasons such as data collection, some features may be redundant or irrelevant to class labels. Therefore, the class labels of samples within a neighborhood range may not be consistent under a subset of features. It is necessary to select features that can effectively represent the characteristics of all categories so that the class labels within the neighborhood range of samples are as consistent as possible.

The more consistent the class labels of samples within the neighborhood range, or the higher the purity of classes within the neighborhood range, the better the corresponding subset of features can represent that class. At this point, the subset of features can represent the local characteristics of that class well. If a subset of features can represent all the local characteristics of all classes well, i.e., the class purity within the neighborhood range of all samples in the dataset is high, then the subset of features can distinguish all classes well. In this case, the importance of features is also higher.

Based on this idea, we use GI to represent the impurity of the dataset and propose the Neighborhood Gini index (NGI). NGI evaluates the importance of a subset of features by assessing the purity of all samples’ neighborhood ranges under that feature subset. The definition of NGI is given below:

Definition 1: Given a decision table $< U, A, D >$ , for any $x_{i} \in U$ , $B \subseteq A$ , the impurity of $σ$ neighborhood $R_{B}^{σ} (x_{i})$ is defined as:

(12) $G I_{B}^{σ} (x_{i}) = 1 - \sum_{j = 1}^{r} p_{j}^{2}$ where $r$ represents the number of categories, and $p_{j}$ signifies the proportion of the jth category within the $σ$ neighborhood of $x_{i}$ .

Throughout the decision table, the impurity of the decision table is the mean value of the impurity within the neighborhood of each sample:

(13) $N G I_{B}^{σ} (D) = \frac{1}{n} \sum_{i = 1}^{n} G I_{B}^{σ} (x_{i})$

From Definition 1, it can be observed that NGI is influenced by two parameters: the feature subset $B$ and the neighborhood radius $σ$ . As GI focuses on the distribution of classes, changes in the number of samples within the neighborhood range can lead to class distribution changes, thereby affecting GI’s magnitude. However, the behavior of GI to changes in the feature subset and neighborhood radius is not strictly monotonic. The following will analyze the variations of NGI to changes in the feature subset and neighborhood radius separately.

Impact of feature on NGI

For any subset of features $B_{1} \subset B_{2} \subseteq A$ , adding one or more features to $B_{1}$ to obtain $B_{2}$ does not necessarily guarantee that $N G I_{B_{2}} (D)$ will always be smaller than $N G I_{B_{1}} (D)$ , and the process of its change is not completely monotonous, as shown in Fig. 1.

Figure 1 shows the relationship between different sizes of feature subsets and NGI under the same neighborhood. The x-axis is the number of features, and the y-axis is the NGI of the corresponding feature. The smaller feature subset is a proper subset of the larger feature subset. In Fig. 1A, the feature subset consists of 18 continuous features, namely [17, 59, 1, 18, 51, 30, 21, 10, 52, 20, 50, 39, 53, 55, 29, 54, 48, 25], and the data is the “Sonar” dataset from the UCI Machine Learning Repository (Kelly, Longjohn & Nottingham, 1998). The neighborhood radius in Fig. 1A is set to 0.15. In Fig. 1B, the feature subset comprises 18 discrete features, namely [4, 7, 15, 10, 9, 13, 14, 6, 17, 16, 0, 12, 3, 2, 5, 8, 11, 1], and the data is the “anneal” dataset from the UCI Machine Learning Repository. The neighborhood radius in Fig. 1B is set to 0.4.

When the features are continuous, the variation in the samples within the neighborhood range is small, leading to minor changes in the class distribution. As a result, the variation curve is relatively smooth, as shown in Fig. 1A. However, when the features are discrete, introducing new features might drastically reduce the number of samples within the neighborhood range, causing larger changes in the class distribution. This results in a fluctuating variation curve, as depicted in Fig. 1B.

From Fig. 1, it is evident that the variation of NGI is not strictly monotonic at a local level, yet it generally exhibits a descending trend as a whole. This phenomenon can be attributed to the overarching effect that, with an increase in the number of features, the data provides a more precise portrayal of the samples, making their inherent characteristics more prominent. When the features a sample emphasizes are more aligned with its class attributes, the purity of the sample’s neighborhood increases, leading to a smaller NGI. Conversely, when the emphasized features deviate from the class attributes, the neighborhood’s purity decreases, resulting in a larger NGI. As the number of features expands, characteristics relevant to the class gradually come into sharper focus, consequently contributing to the observed overall decreasing trend.

It is worth noting that not all continuous feature subsets follow smooth and monotonic variation curves, and not all discrete features yield fluctuating curves. Continuous features might also exhibit fluctuations, while discrete features can exhibit smooth and monotonic behaviors. However, regardless of whether the features are continuous or discrete, the overall tendency is characterized by a decrease.

The subsequent explanation illustrates the variation of NGI through changes in the class distribution within the feature space neighborhood of sample $x_{i}$ :

We utilize the change in GI within the neighborhood feature subspace of sample $x_{i}$ to symbolize the overall changes in NGI across the entire dataset. The distribution of samples within this localized neighborhood feature subspace is depicted in Fig. 2. Among them, the hollow circle class accounts for $\frac{1}{3}$ , and the solid circle class accounts for $\frac{2}{3}$ . At this time, $N G I_{B}^{σ} (x_{i}) = 1 - {\frac{1}{3}}^{2} - {\frac{2}{3}}^{2} = \frac{4}{9}$ . Let $a_{i} \in A - B$ , $N G I_{B \cup {a_{i}}}^{σ}$ change relative to $N G I_{B}^{σ}$ as follows:

Figure 2: Sample distribution in original neighborhood feature subspace.

Download full-size image

DOI: 10.7717/peerj-cs.1711/fig-2

1. NGI increases when the number of samples in the neighborhood feature subspace with a large proportion of categories decreases proportionally more than the number of samples with a small proportion of categories. As shown in Fig. 3A, the number of samples in the hollow circle category decreases by 1, at which point the percentage is $\frac{2}{4}$ , and the number of samples in the solid circle category decreases by 4, at which point the percentage is $\frac{2}{4}$ , $N G I_{B}^{σ} = \frac{4}{9} < N G I_{B \cup {a_{1}}}^{σ} = 1 - {\frac{2}{4}}^{2} - \frac{2}{4} = \frac{1}{2}$ ;

Figure 3: (A–C) Sample distribution in the neighborhood feature subspace after adding one feature.

Download full-size image

DOI: 10.7717/peerj-cs.1711/fig-3

2. NGI remains unchanged when there is no change in the samples in the neighborhood feature subspace or when the samples in the neighborhood, according to the proportion of categories in equal, are reduced. As shown in Fig. 3B, at this time the proportion of the hollow circle category is still $\frac{1}{3}$ , and the proportion of the solid circle category is $\frac{2}{3}$ , $N G I_{B \cup {a_{2}}}^{σ} = N G I_{B}^{σ} = \frac{4}{9}$ ;

3. NGI decreases when the number of samples in the neighborhood feature subspace with a large proportion of categories decreases proportionally less than the number of samples with a small proportion of categories. As shown in Fig. 3C, the hollow circle category samples are reduced by 2, at this time the proportion of $\frac{1}{5}$ , the solid circle category samples are reduced by 2, at this time the proportion of $\frac{4}{5}$ , $N G I_{B}^{σ} = \frac{4}{9} > N G I_{B \cup {a_{3}}}^{σ} = 1 - {\frac{1}{5}}^{2} - {\frac{4}{5}}^{2} = \frac{8}{25}$ .

Impact of neighborhood radius on NGI

In addition to the influence of feature subsets on NGI, the size of the neighborhood radius also affects the changes in the distribution of classes within the neighborhood feature subspace, consequently impacting the magnitude of NGI. So, we have delved into the impact of varying neighborhood radius sizes on NGI. We set the neighborhood radius to range from 0 to 1 with a step size of 0.025, and the relationship between the neighborhood radius and NGI is depicted in Fig. 4.

Figure 4 illustrates the relationship between different sizes of the neighborhood radius and NGI for the same feature subset. The x-axis is the size of the neighborhood radius, and the y-axis is the NGI of the corresponding neighborhood radius. In Fig. 4A, the feature subset consists of 10 continuous features, namely [46, 8, 3, 2, 44, 53, 59, 24, 25, 42], sourced from the “Sonar” dataset in the UCI Machine Learning Repository. In Fig. 4B, the feature subset comprises five discrete features, namely [7, 14, 10, 1, 17], sourced from the “anneal” dataset in the UCI Machine Learning Repository.

As the value of $σ$ gradually increases, the number of samples within the neighborhood range also increases, leading to a rise in impurity. When $σ$ is relatively small, the change in the number of samples within the neighborhood is small, and the newly added samples are mostly from the same category. Consequently, the change curve remains relatively stable. When $σ$ exceeds a certain threshold (as depicted in Fig. 4A, e.g., 0.195), the category labels of the newly added samples start to deviate from those of the original samples. This leads to a change in NGI, eventually converging to the GI of the entire dataset. While an overall trend increases as the neighborhood radius gradually enlarges, this change is not necessarily monotonic. The reasons behind the variation of NGI with $σ$ are analogous to the reasons for its variation with the size of the feature subset. These reasons will not be reiterated here.

Neighborhood Class Gini index

In the context of neighborhood rough sets based on decision tables, the upper approximation of a category refers to the set of samples within the neighborhood range of that category. This sample set includes all samples from the current category and some from others. It is obvious that the fewer categories in the upper approximation and the fewer samples from other categories, the higher the purity of the upper approximation of the category. A higher purity indicates that the corresponding feature provides a more accurate description of that category, making it easier to distinguish it from others. If the upper approximations of all categories have higher purity, all types within the dataset can be better distinguished, and the corresponding features are more important. Based on this principle, this article proposes the Neighborhood Class Gini index (NCGI). It evaluates features’ importance by assessing the upper approximation’s impurity under different feature subsets. The definition of NCGI is provided below:

Definition 2: Given a decision table $< U, A, D >$ , let $B \subseteq A$ , $E_{k} \in U / D (k = 1, 2, . . ., r)$ , and ${\bar{R}}_{B}^{σ} (E_{k})$ is the upper approximations of $E_{K}$ , so the impurity of ${\bar{R}}_{B}^{σ} (E_{k})$ is defined as:

(14) $G I_{B}^{σ} ({\bar{R}}_{B}^{σ} (E_{k})) = 1 - \sum_{i = 1}^{r} {p_{i}}^{2}$

In the entire decision table, the impurity of the decision table is the average of the impurities of all category upper approximations:

(15) $N C G I_{B}^{σ} (D) = \frac{1}{r} \sum_{k = 1}^{r} G I_{B}^{σ} ({\bar{R}}_{B}^{σ} (E_{k}))$

Similar to NGI, the magnitude of NCGI is also influenced by the neighborhood radius $σ$ and the feature subset $B$ . The following comparison illustrates the changes in the two evaluation metrics concerning the number of features and the neighborhood radius. The data in Fig. 5 corresponds to the data in Fig. 1, while the data in Fig. 6 corresponds to that in Fig. 4.

Figure 5: Impact of neighborhood radius on two evaluation metrics.
(A) Continuous feature. (B) Categorical feature.

Download full-size image

DOI: 10.7717/peerj-cs.1711/fig-5

Figure 6: Impact of feature on two evaluation metrics.
(A) Continuous feature. (B) Categorical feature.

Download full-size image

DOI: 10.7717/peerj-cs.1711/fig-6

In the case of continuous features, the trend of NCGI with changing $σ$ is closely similar to that of NGI, displaying relatively smooth changes. NCGI exhibits a lower overall sensitivity to variations in the number of features and the neighborhood radius yet displays higher sensitivity within certain intervals, such as when $σ$ ranges from 0.2 to 0.375 in Fig. 5A. This phenomenon stems from NGI being rooted in the sample neighborhood, with the class distribution altering as the neighborhood radius expands. Conversely, NCGI assesses feature importance from a class neighborhood perspective. As the neighborhood radius expands, the number of samples within the neighborhood increases. However, when $σ$ is small, the newly added samples within the neighborhood share the same category as the current sample. Consequently, the category distribution in the upper approximation remains unchanged. When $σ$ is big enough, the upper approximation of the class encompasses all samples within the dataset, resulting in NCGI equating to the overall GI and ceasing to change with variations in the neighborhood radius.

In the case of discrete data, as the neighborhood radius varies, a sudden influx of samples within the neighborhood range can significantly alter the class distribution, causing larger fluctuations in the change curve, particularly noticeable in Fig. 5B. However, overall, NCGI experiences smaller changes in amplitude compared to NGI.

The number of samples within the neighborhood range gradually decreases with increased features. In the context of continuous features, the reduction in sample count is relatively smooth, as shown in Fig. 6A. Consequently, the class distribution alteration of neighborhoods is similarly gradual. With the increase of purity within the neighborhood, GI decreases until it converges to 0.

For categorical features, the introduction of new features can exert a substantial influence on the class distribution within the neighborhood, leading to larger fluctuations, as shown in Fig. 6B. This is particularly evident upon the inclusion of the 11th feature, where both NGI and NCGI exhibit a sharp decline. This decline implies that adding this feature enhances the purity within the neighborhood, facilitating the differentiation of various categories. Beyond the 11th feature, sample neighborhoods and class neighborhoods’ results diverge. These features can decrease the impurity within the sample neighborhood but paradoxically lead to an increase in the impurity within the class neighborhood.

Feature selection

Definition 3: Given a decision table $< U, A, D >$ , $B \subseteq A$ , $a_{i} \in A - B$ , the importance of $a_{i}$ with respect to $B$ is calculated as follow:

(16) $S I G (a_{i}, B, D) = G I_{B}^{σ} (D) - G I_{B \cup {a_{i}}}^{σ} (D)$ where $G I_{B}^{σ}$ stands for NGI and NCGI proposed in this article.

In Definition 3, we defined the importance of feature $a_{i}$ relative to a given feature subset B. In the case where feature subset B is known, it is adding a feature $a_{i}$ to B and observing its GI (which refers to either NGI or NCGI). If GI decreases, it indicates that $a_{i}$ is a crucial feature relative to B. Conversely, if the GI remains unchanged or increases, it suggests that $a_{i}$ is a redundant feature relative to B or even an irrelevant feature with respect to the decision table $< U, A, D >$ .

To achieve better classification performance, we aim to select each feature $a_{i}$ in such a way that it is the most crucial feature relative to B. Therefore, we have designed heuristic algorithm based on Neighborhood Gini index (HANGI) and heuristic algorithm based on Neighborhood Gini index (HANCGI) feature selection algorithms using a forward greedy approach to select the optimal feature subset. The two algorithms differ only in calculating $S I G (a_{i}, B, D)$ , and their processes are illustrated in Fig. 7.

Figure 7: Flowchart of HANGI and HANCGI.

Download full-size image

DOI: 10.7717/peerj-cs.1711/fig-7

In HANGI and HANCGI, the algorithm starts by taking as input a decision table $< U, A, D >$ , a neighborhood radius $σ$ , and a minimum threshold $β$ for the relative importance of candidate features with respect to the reduced subset. Subsequently, the reduced subset and the candidate feature subset are initialized. An evaluation is made to determine if the candidate feature subset is empty. The current reduced subset is directly output if the candidate feature subset is empty. Conversely, if the candidate feature subset is not empty, all candidate features are iterated through. Each candidate feature’s importance concerning the reduced subset is computed using NGI or NCGI, denoted as $S I G (a_{i}, r e d, D)$ . The feature with the highest importance is selected, and its importance is marked as $S I G_{m a x}$ . Following this, an assessment is carried out to determine whether $S I G_{m a x} > β$ . If true, the feature with the highest importance is removed from the candidate feature subset and incorporated into the reduced subset. The candidate feature subset is then revisited. Output the reduced subset until the candidate feature subset is empty or $S I G_{m a x} \leq β$ .

Assuming a dataset contains n samples, m features, and r categories, the best feature in each iteration is the one with the longest search time, with a worst-case search time of $(m^{2} + m) / 2$ . Calculate the time $n (n - 1) / 2$ required to determine the neighborhood relationship between samples in the dataset. The time to compute the Gini index within the neighborhood range is also nr. Therefore, the time complexity of the NGI and NCGI forward greedy feature selection algorithms is both $O (m^{2} n^{2})$ .

In HANGI and HANCGI, two parameters, $σ$ and $β$ , are present. Parameter $σ$ controls the neighborhood radius, which determines the granularity of the neighborhood particles. The parameter $β$ is a threshold that stops the algorithm when the reduction of the GI is less than a particular value. Theoretically, the optimal values for these two parameters should be searched from the entire range of the dataset's space. Fortunately, as discussed in Hu et al. (2008, 2011), for algorithms with two parameters, such as the neighborhood rough set model, it is possible to approximate the optimal performance of the algorithm if one parameter is fixed at a particular value and the optimal value of the other parameter is searched across the entire space. Since the meaning of the same-sized evaluation metric in different algorithms is not the same, in this case, all $β$ values in all algorithms are set to 0. This means that adding a new feature will not lead to any improvement. Based on this, in the experimental analysis section, the value of parameter $β$ is set as a constant 0, and the optimal value for the neighborhood radius $σ$ is searched within the interval [0, 1], with a step size of 0.025.

Experimental analysis and discussion

In this section, we conduct experiments to validate the effectiveness and stability of the proposed methods. We select four classic feature importance evaluation metrics based on NRS to form corresponding forward greedy feature selection algorithms: Neighborhood Rough Set Dependency (HANRS) (Hu et al., 2008), Neighborhood Entropy (HANRE) (Hu et al., 2011), Neighborhood Discrimination Index (HANDI) (Wang et al., 2018), and Neighborhood Self-Information (HANSI) (Wang et al., 2019b). We compare these algorithms with the two proposed methods. The stopping parameter $β = 0$ is employed as the termination condition for these algorithms.

All the datasets are sourced from the UCI Machine Learning Repository, and their specific descriptions are provided in Table 1. Where “Continuous” and “Categorical” represent the number of continuous and categorical features in each dataset. Before feature selection, all attributes are normalized to the interval [0, 1], and missing values are filled using the mean.

Table 1:

Description of datasets.

Datasets	Samples	Features	Continuous	Categorical	Classes
Anneal	798	19	1	18	5
Arrhythmia	452	263	32	231	13
Autos	205	27	5	22	6
Breast-cancer	286	10	1	9	2
DARWIN	174	452	429	23	2
Dermatology	366	35	1	34	6
HillValley	606	101	101	0	2
Horse_colic	300	28	2	26	2
Ionosphere	351	34	33	1	2
Musk1	476	169	85	84	2
Parkinsons	195	24	23	1	2
Sonar	208	61	61	0	2
Spambase	4,601	59	3	56	2
Toxicity	171	1,204	857	347	2
Voting_records	434	17	1	16	2
Wine	178	14	12	2	3

DOI: 10.7717/peerj-cs.1711/table-1

Note:

Continuous and categorical respectively represent the number of continuous and categorical features in each dataset.

We compare the selected feature count and the corresponding classification accuracy to evaluate the algorithms’ performance comprehensively. We employ four classical classifiers, support vector classifier (SVC), K-nearest neighbors (KNN), Extreme Gradient Boosting (XGBoost), and artificial neural network (ANN), to assess the performance of these feature selection algorithms. Since our primary focus is evaluating the feature selection algorithms, default parameter settings are used for SVC and ANN from the scikit-learn library. XGBoost also uses default parameters. For the KNN classifier, K is set to 3.

Ten-fold cross-validation is employed to perform feature selection on these datasets. Specifically, for a given neighborhood radius $σ$ stopping parameter $β$ and a dataset, the dataset is randomly divided into ten parts, with nine parts used as the training set and one used as the test set. During the training phase, feature selection is performed on the training set to identify an optimal feature subset. The optimal feature subset is then used to extract a sub-dataset from the original dataset. During the testing phase, ten-fold cross-validation is applied to the sub-dataset, computing the accuracy of the four classifiers. Finally, the mean of the output accuracy values obtained from four classifiers serves as the ultimate evaluation metric, providing a comprehensive assessment of feature selection effectiveness across the entire dataset.

Training parameters

In NRS-based models, the size of the neighborhood granule significantly impacts the model results. Determining the neighborhood granule’s size is essential to achieve optimal experimental outcomes. Thus, we employ ten-fold cross-validation with a step size of 0.025 in the range (0, 1) (Wang et al., 2019b) to obtain the optimal neighborhood radius parameter $σ$ for each algorithm. The search range in the “Spambase” dataset is (0, 0.225). Subsequently, we use four datasets and one algorithm to illustrate the selection process. Figure 8 displays the variation of classification accuracy with changing neighborhood radius for different datasets, using NGI as the evaluation metric.

Figure 8: Variation of classification accuracies with a neighborhood radius.

Download full-size image

DOI: 10.7717/peerj-cs.1711/fig-8

Evidently, the neighborhood radius has a pronounced impact on classification accuracy. As the parameter changes, the four datasets exhibit varying accuracy levels in all classifiers. We select the radius that corresponds to relatively higher accuracy in all classifiers as the optimal radius. For instance, in the “Anneal” dataset, $σ = 0.05$ is deemed the optimal neighborhood radius. Using the same training methodology, we determine the optimal neighborhood radius for each algorithm on various datasets, as presented in Table 2. In subsequent comparisons of algorithm performance, the neighborhood radius parameters are set based on this table.

Table 2:

Optimal neighborhood radius parameters.

Datasets	HANRS	HANMI	HANDI	HANSI	HANGI	HANCGI
Anneal	0.05	0.725	0.05	0.05	0.05	0.05
Arrhythmia	0.35	0.95	0.275	0.875	0.275	0.3
Autos	0.1	0.6	0.075	0.15	0.125	0.125
Breast-cancer	0.4	0.025	0.725	0.4	0.125	0.25
DARWIN	0.075	0.925	0.975	0.85	0.975	0.025
Dermatology	0.175	0.025	0.425	0.225	0.275	0.575
HillValley	0.225	0.525	0.2	0.425	0.425	0.425
Horse_colic	0.325	0.825	0.15	0.325	0.225	0.575
Ionosphere	0.175	0.525	0.175	0.2	0.15	0.125
Musk1	0.65	0.95	0.975	0.65	0.95	0.425
Parkinsons	0.1	0.375	0.1	0.1	0.1	0.375
Sonar	0.475	0.775	0.55	0.525	0.45	0.325
Spambase	0.175	0.15	0.1	0.175	0.15	0.125
Toxicity	0.975	0.075	0.05	0.95	0.075	0.925
Voting_records	0.025	0.025	0.875	0.25	0.875	0.425
Wine	0.025	0.95	0.025	0.025	0.025	0.05

DOI: 10.7717/peerj-cs.1711/table-2

Note:

The underlines represent that the results corresponding to all neighborhood radii under this algorithm are exactly the same.

In Table 2, the first column represents the dataset name, and each subsequent column header corresponds to the algorithm’s name. The values inside the table indicate the optimal neighborhood radius for each algorithm.

It is important to note that for the “Voting_records” dataset, HANRS cannot select features at any neighborhood radius. Therefore, we set it to the minimum value of 0.025 for subsequent comparisons.

Evaluation of feature validity

In the context of classification problems, feature selection algorithms aim to extract the most representative and discriminative features from the original feature set, creating a more compact subset. Constructing a classification model using the selected feature subset, achieving higher accuracy indicates that these features are more effective for the classification task on the given data. Based on the optimal neighborhood radius, the feature selection algorithms (HANRS, HANRE, HANDI, HANSI, HANGI, and HANCGI) were applied to 16 datasets, and the number of features selected is presented in Table 3. Where “Original” denotes the original dataset’s number of features, each subsequent column represents the average number of features selected by each algorithm over ten runs. Underscored numbers indicate the fewest selected features relative to other algorithms. Notably, HANRS did not select features in the “Voting_records” dataset and, therefore, is not included in the comparison.

Table 3:

Number of selected features.

Datasets	Original	HANRS	HANMI	HANDI	HANSI	HANGI	HANCGI
Anneal	19	7.20	3.20	8.50	8.00	7.70	7.60
Arrhythmia	263	34.00	6.80	17.10	115.60	16.40	28.20
Autos	27	9.10	1.10	7.20	10.00	9.10	8.60
Breast-cancer	10	1.00	1.00	5.80	1.00	8.10	1.00
DARWIN	452	4.80	9.60	36.90	48.60	45.40	3.40
Dermatology	35	10.10	1.00	12.60	11.20	10.30	18.80
HillValley	101	5.00	1.90	2.40	2.80	18.60	2.40
Horse_colic	28	16.00	3.00	10.90	15.90	13.30	1.00
Ionosphere	34	10.80	2.00	8.90	11.90	8.10	7.70
Musk1	169	34.90	9.50	64.60	35.50	63.70	16.60
Parkinsons	24	4.00	1.30	4.00	4.40	4.00	4.20
Sonar	61	21.70	3.00	24.90	25.60	18.40	11.10
Spambase	59	50.10	58.00	36.20	50.10	49.00	44.40
Toxicity	1,204	6.60	4.90	4.00	6.60	4.90	1.30
Voting_records	17	0.00	1.00	10.60	10.80	13.00	9.80
Wine	14	2.90	2.00	2.90	2.90	2.90	3.00
Mean	157.3125	13.64	6.83	16.09	22.56	18.31	10.57

DOI: 10.7717/peerj-cs.1711/table-3

Note:

Underscored numbers indicate the fewest selected features relative to other algorithms.

Comparing the number of selected features in Table 3, we observe that HANGI and HANCGI successfully achieve feature reduction. There is no significant difference in the average number of features reduced among the six algorithms. HANMI shows the strongest reduction capability, while HANSI demonstrates the weakest. Across the 16 datasets, the average number of features was reduced by HANGI to 18, ranking fifth on average among the six algorithms. HANCGI reduces the average number of features to 11, ranking second on average among the six algorithms.

Next, we employ SVC, KNN, XGBoost, and ANN to train the selected feature subsets and compare their classification accuracies, as presented in Tables 4–7. Table 8 presents the mean accuracy of each dataset across the four classifiers. In these tables, underscored numbers indicate the best classification accuracy achieved through feature reduction relative to other algorithms.

Table 4:

Average accuracy on SVC.

Datasets	Original	HANRS	HANMI	HANDI	HANSI	HANGI	HANCGI
Anneal	0.7619	0.7998	0.8344	0.8047	0.8046	0.8047	0.7997
Arrhythmia	0.6106	0.5961	0.5839	0.6306	0.6024	0.6534	0.6454
Autos	0.4105	0.3385	0.4418	0.4428	0.3665	0.4009	0.3997
Breast-cancer	0.6873	0.7622	0.7622	0.7449	0.7622	0.7405	0.7622
DARWIN	0.4647	0.7611	0.7178	0.5897	0.5654	0.5654	0.7631
Dermatology	0.7297	0.9520	0.3240	0.9209	0.9550	0.9566	0.9758
HillValley	0.5100	0.4774	0.4751	0.4777	0.4764	0.4745	0.4759
Horse_colic	0.6567	0.6730	0.6823	0.6613	0.6740	0.6637	0.7667
Ionosphere	0.9344	0.9378	0.8766	0.9387	0.9432	0.9312	0.9359
Musk1	0.7737	0.8075	0.7557	0.8482	0.8100	0.8382	0.7405
Parkinsons	0.8100	0.8250	0.7970	0.8250	0.8225	0.8250	0.8164
Sonar	0.6395	0.8104	0.6921	0.8095	0.7961	0.8108	0.7645
Spambase	0.9538	0.9940	0.9946	0.9940	0.9940	0.9940	0.9939
Toxicity	0.6500	0.6717	0.6717	0.6729	0.6717	0.6729	0.6711
Voting_records	0.9585	0.0000	0.6154	0.9636	0.9621	0.9641	0.9344
Wine	0.6810	0.8610	0.5833	0.9096	0.9096	0.9096	0.8576
Mean	0.7020	0.7042	0.6755	0.7646	0.7572	0.7628	0.7689

DOI: 10.7717/peerj-cs.1711/table-4

Note: