A SparseModeling Based Approach for Class Specific Feature Selection
 Published
 Accepted
 Received
 Academic Editor
 TzungPei Hong
 Subject Areas
 Bioinformatics, Data Mining and Machine Learning, Data Science
 Keywords
 Feature selection, Sparse coding, Bioinformatics, Dictionary learning, Ensemble learning
 Copyright
 © 2019 Nardone et al.
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
 Cite this article
 2019. A SparseModeling Based Approach for Class Specific Feature Selection. PeerJ Computer Science 5:e237 https://doi.org/10.7717/peerjcs.237
Abstract
In this work, we propose a novel Feature Selection framework called SparseModeling Based Approach for Class Specific Feature Selection (SMBACSFS), that simultaneously exploits the idea of Sparse Modeling and ClassSpecific Feature Selection. Feature selection plays a key role in several fields (e.g., computational biology), making it possible to treat models with fewer variables which, in turn, are easier to explain, by providing valuable insights on the importance of their role, and likely speeding up the experimental validation. Unfortunately, also corroborated by the no free lunch theorems, none of the approaches in literature is the most apt to detect the optimal feature subset for building a final model, thus it still represents a challenge. The proposed feature selection procedure conceives a twostep approach: (a) a sparse modelingbased learning technique is first used to find the best subset of features, for each class of a training set; (b) the discovered feature subsets are then fed to a classspecific feature selection scheme, in order to assess the effectiveness of the selected features in classification tasks. To this end, an ensemble of classifiers is built, where each classifier is trained on its own feature subset discovered in the previous phase, and a proper decision rule is adopted to compute the ensemble responses. In order to evaluate the performance of the proposed method, extensive experiments have been performed on publicly available datasets, in particular belonging to the computational biology field where feature selection is indispensable: the acute lymphoblastic leukemia and acute myeloid leukemia, the human carcinomas, the human lung carcinomas, the diffuse large Bcell lymphoma, and the malignant glioma. SMBACSFS is able to identify/retrieve the most representative features that maximize the classification accuracy. With top 20 and 80 features, SMBACSFS exhibits a promising performance when compared to its competitors from literature, on all considered datasets, especially those with a higher number of features. Experiments show that the proposed approach may outperform the stateoftheart methods when the number of features is high. For this reason, the introduced approach proposes itself for selection and classification of data with a large number of features and classes.
Introduction
Data analysis is the process of evaluating data, that is often subject to highdimensional feature spaces, i.e., where data are represented in, whatever the area of study, from biology to pattern recognition to computer vision. High dimensionality often translates into overfitting, large computational costs and poor performance thus getting a learning task in trouble. Consequently, highdimensional feature spaces need to be lowered since its feature vectors are generally uninformative, redundant, correlated to each other and also noisy. In this paper, we focus on feature selection, which is undertaken to identify discriminative features by eliminating the ones with little or no predictive information, based on certain criteria, in order to treat with data in low dimensional spaces.
Feature Selection (FS) is the process of selecting a subset of relevant features to use in model construction. FS plays a key role in computational biology, for instance, microarray data analysis involves a huge number of genes with respect to (w.r.t.) a small number of samples, and effectively identifying the most significant differentially expressed genes under different conditions is prominent (Xiong, Fang & Zhao, 2001). The selected genes are very useful in clinical applications such as recognizing diseased profiles (Calcagno et al., 2010; Staiano et al., 2013; Di Taranto et al., 2015; Camastra, Di Taranto & Staiano, 2015), nonetheless, because of its high costs, the number of experiments that can be used for classification purposes is usually limited due to the small number of samples compared to the large number of genes in an experiment, that gives rise to the Curse of Dimensionality problem (Friedman, Hastie & Tibshirani, 2001), which challenges the classification as well as other data analysis tasks (Staiano et al., 2004; Ciaramella et al., 2008). Furthermore, microarray data are usually not immune from several issues, such as sensitivity, accuracy, specificity, reproducibility of results, and noisy data (Draghici et al., 2006). For these reasons, it is unsuitable to use microarray data as they are; however, after several corrections, the relevant genes can be selected by FS approaches, and for instance use RealTime PCR (Xiong, Fang & Zhao, 2001) to validate the results.
Taking a look at the literature, by googling the keyword “feature selection”, one gets lost in an ocean of techniques (the reader may refer to classical reviews in Saeys, Inza & Larrañaga (2007), Guyon & Elisseeff (2003), Hoque, Bhattacharyya & Kalita (2014) on the topic), often designed to tackle a specific data set. The reasons for the abundance of techniques are in the heterogeneity of the available scientific data sets and also by the limitations dictated by no free lunch theorems (Wolpert & Macready, 1997), determining the existence of no generalpurpose technique which is well suited to a plethora of different kind of data. A typical taxonomy organizes FS techniques (Jović, Brkić & Bogunović, 2015) in three main categories, namely filter, wrapper and embedded methods, whose belonging algorithms select a single feature subset from a complete list of features. Another perspective instead, divides FS techniques in two classes, namely, Traditional Feature Selection (TFS) for all classes (that includes filter, wrapper and embedded methods mentioned so far), and ClassSpecific Feature Selection (CSFS) (Fu & Wang, 2002). Usually, a TFS algorithm selects one subset of features for all classes although it may be not the best one for some classes, thus leading to undesirable results. Differently, a CSFS policy permits to select a distinct subset of features for each class, and it can use any traditional feature selector, for choosing, given the set of classes of a classification problem, one distinct grouping of features for each class. Depending on the type of the feature selector, the overall process may slightly change. Nevertheless, it is worth pointing out that a CSFS scheme heavily depends on the use of a specific classifier, while its use should be independent of both the classifier of the classification step and the feature selector strategy. To this end, a General Framework CSFS has been proposed in (PinedaBautista, CarrascoOchoa & MartınezTrinidad, 2011) which allows using any traditional feature selector as well as any classifier.
In this paper, on the basis of the general framework for CSFS, we propose a novel strategy to FS, namely a SparseModeling Based Approach for ClassSpecific Feature Selection, consisting of a twostep procedure. Firstly, a sparse modeling based learning technique is used to find the best subset of features for each class of the training set. In doing so, it is assumed that a class is represented by using a subset of features, called representatives, such that each sample in a specific class, can be described as a linear combination of them. Secondly, the discovered feature subsets are fed to a classspecific feature selection scheme in order to assess the effectiveness of the selected features in classification tasks. To this end, an ensemble of classifiers is built by training a given classifier, one for each class, on its own feature subset, i.e., the one discovered in the previous step, and a proper decision rule is adopted to compute the ensemble responses. In this way, the dilemma of choosing specific TFS strategy and classifiers in the CSFS framework is effectively mitigated.
Methods
The sparsemodeling based approach for classspecific feature selection, is based on the concepts of sparse modeling and classspecific feature selection that need to be properly introduced.
Sparse Modeling fundamentals
An active developing field of statistical learning is focused around the notion of sparsity (Tibshirani, 1994; Ciaramella & Giunta, 2016). A Sparse Model (SM) is a model that can be much easier to estimate and interpret than a dense model. The sparsity assumption allows extracting meaningful features from large data sets. The aim of the first phase of the proposed approach is to use a sparse modeling for finding data representatives without any transformation and to be performed directly in the data space. In other words, we wish to find a ranking of the most representative features that best reconstruct the data collection. Most approaches are based on a l_{1}norm regularization such as LASSO (Tibshirani, 1994 and Sparse Dictionary Learning Elhamifar, Sapiro & Vidal, 2012). Formally, given a set of features in ℝ^{m} arranged as columns of a data matrix X = [x_{1}, …, x_{n}], the task is to find representative features given a fixed feature space belonging to a collections of data points (see Mairal et al., 2008; Aharon, Elad & Bruckstein, 2006; Engan, Aase & Husoy, 1999; Jolliffe, 1986; Ramirez, Sprechmann & Sapiro, 2010). That task can conveniently be described in the Dictionary Learning (DL) framework, where the aim is to simultaneously learn a compact dictionary D = [d_{1}, …, d_{k}] ∈ ℝ^{m×k} and coefficients C = [c_{1}, …, c_{n}] ∈ ℝ^{k×n}, with k ≪ n, that can well represent collections of data points (Ciaramella, Gianfico & Giunta, 2016). The best representation of the data is obtained by minimizing the following objective function (1)$\sum _{i=1}^{n}{\u2225{\mathbf{x}}_{i}\mathbf{D}{\mathbf{c}}_{i}\u2225}_{2}^{2}={\u2225\mathbf{X}\mathbf{DC}\u2225}_{F}^{2}$ w.r.t. the dictionary D and the coefficient matrix C, subject to appropriate constraints.
However, the dictionary learned atoms almost never correspond to the original feature space (Aharon, Elad & Bruckstein, 2006; Ramirez, Sprechmann & Sapiro, 2010; Mairal et al., 2009). In order to find a subset of features that best represent the entire feature space, the optimization problem in Eq. (1) is reformulated forcing the dictionary D to be the data matrix X (Elhamifar, Sapiro & Vidal, 2012): (2)$\sum _{i=1}^{n}{\u2225{\mathbf{x}}_{i}\mathbf{X}{\mathbf{c}}_{i}\u2225}_{2}^{2}={\u2225\mathbf{X}\mathbf{XC}\u2225}_{F}^{2},$ where F is the Frobenius norm. Equation (2) is minimized w.r.t the coefficient matrix C≜[c_{1}, …, c_{n}] ∈ ℝ^{n×n}, subject to additional constraints. In other words, the reconstruction error of each feature component is minimized by linearly combining all the components of the feature space. To choose k ≪ n representatives involved in the linear reconstruction of each component in Eq. (2), the following constraint is added to the model (3)${\u2225\mathbf{C}\u2225}_{0,q}\le k,$ where the mixed ℓ_{0}∕ℓ_{q} norm is defined as ${\u2225\mathbf{C}\u2225}_{0,q}\triangleq {\sum}_{i=1}^{N}I\left({\u2225{\mathbf{c}}^{i}\u2225}_{q}>0\right)$, c^{i} denotes the ith row of C, and I(⋅) denotes the indicator function. In a nutshell, ${\u2225\mathbf{C}\u2225}_{0,q}$ counts the number of nonzero rows of C. The indices of the nonzero rows of C correspond to the indices of the columns of X which are chosen as the representative features. Since the aim is to select k ≪ n representative features that can reconstruct each feature of the X matrix up to a fixed error, the optimization problem to solve is (4)$\underset{\mathbf{C}}{\text{minimize}}{\u2225\mathbf{X}\mathbf{X}\mathbf{C}\u2225}_{F}^{2}\text{subject to}{\u2225\mathbf{C}\u2225}_{0,q}\le k,{\mathbf{1}}^{T}\mathbf{C}={\mathbf{1}}^{T}$ where 1^{T}C = 1^{T} is the affine constraint for selecting representatives that are invariant w.r.t. a global translation of the data (as requested by dimensionality reduction methods). This is an NPhard problem as it implies a combinatorial calculation over every subset of the k columns of X. Therefore, relaxing ℓ_{0} to ℓ_{1} norm, the problem becomes (5)$\underset{\mathbf{C}}{\text{minimize}}{\u2225\mathbf{X}\mathbf{X}\mathbf{C}\u2225}_{F}^{2}\text{subject to}{\u2225\mathbf{C}\u2225}_{1,q}\le \tau ,{\mathbf{1}}^{T}\mathbf{C}={\mathbf{1}}^{T}$ where ${\u2225\mathbf{C}\u2225}_{1,q}\triangleq {\sum}_{i=1}^{N}{\u2225{\mathbf{c}}^{i}\u2225}_{q}$ is the sum of the ℓ_{q} norms of the rows of C and τ > 0 is an appropriate chosen parameter. The solution of the optimization (Eq. (5)) not only provides the representative features as the nonzero rows of the C, but also provides information about the ranking of the selected features. More precisely, a representative that has higher ranking takes part in the reconstruction process more than the others, hence, its corresponding row in the optimal coefficient matrix C has many nonzero elements with large values. Conversely, a representative with lower ranking takes part in the reconstruction process less than the others, hence, its corresponding row in C has a few nonzero elements with smaller values. Thus, the k representative features x_{i1}, …, x_{ik} are ranked as i_{1} ≥ i_{2} ≥ ⋯ ≥ i_{k}, whenever for the corresponding rows of C one gets (6)${\u2225{\mathbf{c}}^{{i}_{1}}\u2225}_{q}\ge {\u2225{\mathbf{c}}^{{i}_{2}}\u2225}_{q}\cdots \ge {\u2225{\mathbf{c}}^{{i}_{k}}\u2225}_{q},$
_______________________
Procedure SMBA ____
Input: X, N × M matrix where N is the number observations and
M is the number of features
θ = {α,δ,ρ,η}, parameters vector
Output: I, set of features selected
1 Variables initialization
3 3 while ϵ > δ and t > ρ do
4 βt+1 ← (XTX + ρI)−1
5 θt+1 ← (Sλ∕ρ(βt+1 + μt∕ρ))
6 μt+1 ← μt + ρ(βt+1 − θt+1)
7 ϵ ← compute_error(β,θ)
8 end
9 I ← find_representatives(θ,η)
From a practical point of view, the optimization problem (Eq. (5)) can be expressed by using the Lagrange multipliers (7)$\underset{\mathbf{C}}{\text{minimize}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\frac{1}{2}{\u2225\mathbf{X}\mathbf{XC}\u2225}_{F}^{2}+\lambda {\u2225\mathbf{C}\u2225}_{1,q}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\text{subject to}\phantom{\rule{10.00002pt}{0ex}}{\mathbf{1}}^{T}\mathbf{C}={\mathbf{1}}^{T}.$ In practice, the algorithm is implemented using an Alternating Direction Method of Multipliers (ADMM) optimization framework (Boyd et al., 2011). In particular, the features of a given data set are obtained considering representatives of small pairwise coherence features as in a sparse dictionary learning method. It is worth observing the resemblance with the Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani, 1994). The latter consists of an approach to regression analysis that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretation ability of the statistical model it produces. Recall that the objective of LASSO, in its basic form, is to solve (8)$\underset{\beta}{\text{minimize}}\frac{1}{N}{\u2225y\mathbf{X}\beta \u2225}_{2}^{2}\text{subject to}{\u2225\beta \u2225}_{1}\le t,$ where y = [y_{1}, …, y_{N}] is the Ndimensional vector of outcomes, X the covariate matrix, t is a free parameter that determines the amount of regularization and β is the sparse vector to estimate.
From Eq. (8), one can observe that a sparse matrix can be estimated as in Eq. (7) by considering X itself as outcome and adding the affine constraint. In the following, the LASSO will be used for classification tasks, adopting a sigmoid function, as it will be described in the experimental setup.
_______________________________________________________________________________________________________
Algorithm 1: SparseModeling Based Approach for ClassSpecific Feature
Selection ____
Input : X = {x1,...,xn}data set
y, class labels
θ, SMBA parameters
m, maximum number of features to select
C, classifier model (e.g., SVM, KNN, etc)
K, number of folds for performing KCross Validation
Output: _______ACM, Average Classification Metrics on K folds
1 X ← Data standardization
2 X ← Class balancing(X) by using SMOTE Chawla et al., 2002
3 X ← Random shuffling(X)
4 Divide X into K folds
5 foreach ki ∈ K folds do
6 Set the ki fold as the test set Xtest
7 Use the remaining K1 folds as the train set Xtrain
8 Perform the Classsample separation on the train set Xtrain
9 (Note that I is the subset of features selected for each class ci ∈
Xtrain)
10 foreach Xci ∈ Xtrain do
11 I = {Ici ...Icc}← SMBA(Xci, θ)
12 end
13 for j ← 1 to m do
14 Build an ensemble classifier Ej = {e1,j,...,ec,j} using the jth
selected feature ∈ Ici and the classifier C
15 foreach O ∈ Xtest do
16 (ACMj) ← Use Ej to classify the instance O
17 end
18 (ACM) ← (ACMj)
19 end
20 end
21 (________ACM) ← Average(ACM)
A SparseModeling Based Approach for Class Specific Feature Selection
A General Framework for ClassSpecific Feature Selection (GFCSFS) is described in (PinedaBautista, CarrascoOchoa & MartınezTrinidad, 2011). The proposed SparseModeling Based Approach for ClassSpecific Feature Selection (SMBACSFS) tries to best represent each classsample set of an input data set by only using few representative features. More specifically, the method is made up of the following steps:

Classsample separation: Unlike the GFCSFS, SMBACSFS does not employ the Class binarization stage to transform a cclass problem into c binary problems, instead it just uses a simple Classsample separation. Basically, it consists of differentiating the samples among all the classes of the training set for a given data set into several disjoint sets/configurations of samples, one for each class (See Fig. 1).

Class balancing: Once the class sample set of the training set has been split apart (by applying the above Classsample separation step), it may be possible that each classsubset results unbalanced. Therefore, the SMOTE (Chawla et al., 2002) resampling method is applied to balance each classsubset. Technically speaking, it is important to point out that steps 1–2 are interchangeable, meaning that there are no differences in doing the first one before the other.

IntraClassSpecific feature selection: The SparseModeling Based Approach is used for retrieving, minimizing Eq. (7), the most representative features for each classsample set of the training set that best represent/reconstruct the whole class of objects. In doing so, the approach takes advantage of the intraclass properties for selecting the best feature subset (describing each class) which is used to improve the classification accuracy against TFS and GFCSFS.

Classification: Since the training set gets split into different classsample subsets, we embraced the idea of using a wiseensemble procedure for training a classification model for discriminating new incoming instances. As in PinedaBautista, CarrascoOchoa & MartınezTrinidad (2011), given a class c_{i}, a classifier e_{i} is trained on the original data set only using the selected features for c_{i}, for i = 1, …, c. Overall, an ensemble classifier E = {e_{1}, …, e_{c}} is constructed. In order to classify a new instance O through the ensemble, the natural dimension of O needs to be lowered to the dimension d_{i} of the classifier e_{i}, i = 1…, c. This way, for determining to which class O belongs to, an adhoc majority rule is used: Finally, since a recursive tie may occur, in that case, the instance O would be classified as c_{i} by randomly choosing a class among all the tied classes. The algorithm in Fig. 1, illustrates the pseudocode describing the CSFSSMBA procedure. Basically, it first standardizes, classbalances and shuffles the data set X, then divides it into k folds, assigning the k_{i}th fold as test set X_{test} and the remaining K − 1 folds as train set X_{train}. The algorithm iteratively performs the task of classsample separation, to split the sample belonging to different classes X_{ci}, on which the algorithm 1 (illustrated in page 4) is performed to output the m most representative features for each class (line 12). The selected features are first used, one at time, for training an ensemble classifier E_{j}, and later for classifying each instance O belonging to the test set X_{test}. Finally, for all the ensemble models up to m selected features, the algorithm outputs the $\overline{ACM}$ matrix, storing several model evaluation metrics.

If a classifier outputs the same class for which the features, used for training e_{i} were selected, i.e., the e_{i} output is c_{i}, then O belongs to c_{i}. In case of a tie, i.e., when several classifiers respond c_{i}, a majority vote is needed among all classifiers to determine the class of O. If still a tie occurs, O will belong to the class that received more votes among the tied classes.

If no classifier outputs the class whose selected features are used for training e_{i} belongs to the class winning the majority voting. If there is a tie, then O will belong to the class that received more votes among the tied classes.

Experimental results
In the experiments, the SMBACSFS performance have been assessed on nine publicly available microarray data sets. The classifiers used to determine the goodness of the selected feature subsets are a Support Vector Machine (SVM) with a linear kernel and parameter C = 1, a Naive Bayes, a KNearest Neighbors (KNN) using k = 5, and a Decision Tree.
Data sets description
In order to validate the introduced approach, a number of data sets exemplifying the typical data processing in the biological field are used in the experiments. In the following, a brief description of all the data sets employed in the experiments.

The ALLAML data set (Golub et al., 1999) contains in total 72 samples in 2 classes: ALL and AML, which have 47 and 25 samples, respectively. Every sample contains 7, 129 gene expression values.

The LEUKEMIA data set (Golub et al., 1999) contains in total 72 samples in 2 classes: acute lymphoblastic and acute myeloid. It is a modified version of the original ALLAML data set, where the original baseline genes (7,129) were cut off before further analysis. The number of genes that are used in the binary classification task is 7, 070.

The CLL_SUB_111 data set (Haslinger et al., 2004) has gene expressions from high density oligonucleotide arrays containing genetically and clinically distinct subgroups of Bcell chronic lymphocytic leukemia (BCLL). The data set consists of 11, 340 attributes, 111 instances and 3 classes.

The GLIOMA data set (Nutt et al., 2003) contains in total 50 samples in 4 classes: cancer glioblastomas, noncancer glioblastomas, cancer oligodendrogliomas and noncancer oligodendrogliomas, which have 14, 14, 7, 15 samples, respectively. Each sample has 12, 625 genes. After a preprocessing, the data set has been shrunk to 50 samples and 4, 433 genes.

The LUNG data set (Bhattacharjee et al., 2001) contains in total 203 samples in 5 classes: adenocarcinomas, squamous cell lung carcinomas, pulmonary carcinoids, smallcell lung carcinomas and normal lung, with 139, 21, 20, 6, 17 samples, respectively. The genes with standard deviations smaller than 50 expression units were removed getting a data set with 203 samples and 3, 312 genes.

The LUNG_DISCRETE data set (Peng, Long & Ding, 2005) contains 73 samples in 7 classes where, each sample consists of 325 gene expressions. The cardinalities of each sample in the LUNG_DISCRETE data set are 6, 5, 5, 16, 7, 13, 21, respectively.

The DLBCL data set (Alizadeh et al., 2000) is a modified version of the original DLBCL data set. It consists of 96 samples in 9 classes, where each sample is defined by the expression of 4, 026 genes. The cardinalities of each sample in the DLBCL data set are 46, 10, 9, 11, 6, 6, 4, 2, 2, respectively.

The CARCINOM data set (Su et al., 2001) contains 174 samples in 11 classes: prostate, bladder/ureter, breast, colorectal, gastroesophagus, kidney, liver, ovary, pancreas, lung adenocarcinomas and lung squamous cell carcinoma, with 26, 8, 26, 23, 12, 11, 7, 27, 6, 14, 14 samples, respectively. After a preprocessing as described in Yang et al. (2006), the data set has been shrunk to 174 samples and 9, 182 genes.

The GCM data set (Ramaswamy et al., 2001) contains 190 samples in 14 classes: breast, prostate, lung, colorectal, lymphoma, bladder, melanoma, uterus, leukemia, renal, pancreas, ovary, mesothelioma and central nervous system, where each sample consist of 16,063 gene expression signatures. The cardinalities of each sample in the data set are 11, 11, 20, 11, 30, 11, 22, 10, 11, 11, 11, 10, 11, 10, respectively.
All data sets are available at the following data repository (Nardone, Ciaramella & Staiano, 2019a). All the information about the data sets are summarized in Table 1.
Experiment setup
To validate the effectiveness of the SMBACSFS model, it has been compared against several TFS and the GFCSFS proposed in PinedaBautista, CarrascoOchoa & MartınezTrinidad (2011). SMBACSFS is firstly compared against TFS methods and, since the framework in PinedaBautista, CarrascoOchoa & MartınezTrinidad (2011) can use any TFS method as base for performing CSFS, some experiments using both filter and wrapper methods (injection process) were made. In addition, the accuracy results were also compared against those obtained on the basis of all the features (BSL). The following TFS methods have been chosen for comparing purposes:
Size  # Features  # Classes  

ALLAML  72  7,129  2 
LEUKEMIA  72  7,070  2 
CLL_SUB_111  111  11,340  3 
GLIOMA  50  4,434  4 
LUNG_C  203  3,312  5 
LUNG_D  73  325  7 
DLBCL  96  4,026  9 
CARCINOM  174  9,182  11 
GCM  190  16,063  14 

LASSO (Tibshirani, 1994): LASSO method involves penalizing the absolute size of the regression coefficients and it is usually used for creating parsimonious models in presence of a large number of features. The model implemented is a modified version of the classical LASSO, adapted for classification purposes. In particular, in Eq. (8), the product Xβ is transformed by a sigmoid function in order to address the classification problem.

EN (Zou & Hastie, 2005): Elastic Net is a hybrid of ridge regression and LASSO regularization. Like LASSO, Elastic Net can generate reduced models by achieving zerovalued coefficients. Experimental studies have suggested that the Elastic Net technique can outperform LASSO on data with highly correlated features. As for LASSO, a modified version adapted for classification purposes has been implemented.

RFS (Nie et al., 2010): Robust Feature Selection method is a sparse basedlearning approach for feature selection which emphasizes the joint ℓ_{2,1} norm minimization on both loss and regularization function.

lsℓ_{2,1} (Tang, Alelyani & Liu, 2014): lsℓ_{2,1} is a supervised sparse feature selection method. It exploits the ℓ_{2,1}norm regularized regression model for joint feature selection, from multiple tasks where the classification objective function is a quadratic loss.

llℓ_{2,1} (Tang, Alelyani & Liu, 2014): llℓ_{2,1} is a supervised sparse feature selection method which uses the same concept of lsℓ_{2,1} but instead uses a logistic loss as classification objective function.

Fisher (Gu, Li & Han, 2012): Fisher is one of the most widely used supervised filter feature selection methods. It selects each feature as the ratio of interclass separation and intraclass variance, where features are evaluated independently and, the final feature selection occurs by aggregating the m top ranked ones.

ReliefF (Kira & Rendell, 1992; Kononenko, 1994): ReliefF is an iterative, randomized and supervised filter approach that estimates the quality of the features according to how well their values differentiate data samples that are near to each other; it does not discriminate among redundant features and performance decreases with few data.

mRMR (Peng, Long & Ding, 2005): MinimumRedundancyMaximumRelevance is a mutual information filter based algorithm which selects features according to the maximal statistical dependency criterion.

MI (Kraskov, Stögbauer & Grassberger, 2004; Ross, 2014): Mutual Information is a nonnegative value, which measures the dependency between the variables. Features are selected in a univariate way. The function relies on nonparametric methods based on entropy estimation from knearest neighbors distances.

SMBA: SparseModeling Based Approach is nothing else that our SMBACSFS model but that only takes into account the SDL strategy for selecting a subset of features considering all the classes in the feature selection process.
We preprocessed all the data sets by using the Zscore (Kreyszig, 2010) normalization. To fairly compare the considered supervised feature selection methods, we have firstly tuned the parameters for all methods by using a “gridsearch” strategy (Tang, Alelyani & Liu, 2014) and finally, for evaluating the performance of all the methods, it has been considered a number of features ranging from 1 to 80 by performing a 5fold Cross Validation (CV).
The performance of the classification algorithms among all the methods have been evaluated by using the metrics of Accuracy along with the standard deviations (ACC ± STD), Precision (P), Recall (R) and Fmeasure (F), which are computed as illustrated in Sokolova & Lapalme (2009). In addition, to give a better and summarized understanding between the performance of the models, we also computed the Area Under the Curve (AUC) and the Receiver Operating Characteristic (ROC) curves, where the former is a useful tool for evaluating the quality of class separation for a classifier while the latter makes it easier to compare the ROC curve of one model to another.
Discussion
The experiments have been performed on a workstation with a dual Intel(R) Xeon(R) 2.40 GHz and 64GB RAM. The developed code is available at Nardone, Ciaramella & Staiano (2019b). For the sake of readability, all the results presented here account only for the SVM classifier, since the performance proved that the proposed approach is a little sensitive to the choice of a specific classifier (indeed, the performance of each classifier are rather comparable). Nevertheless, the interested reader may refer to the Supplemental Material for details on additional results concerning all the used classifiers. The experimental results on 5fold CV for the SVM classifier are summarized in Tables 2–5. Figures 2–5 show all the accounted model evaluation metrics for the ten feature selection methods on the nine considered data sets.
Average Accuracy of top 20 features (%)  

ALLAML  LEUKEMIA  CLL_SUB_111  GLIOMA  LUNG_C  LUNG_D  DLBCL  CARCINOM  GCM  
Fisher  96.84 ± 0.04(19)  98.95 ± 0.02(16)  75.20 ± 0.1(19)  80 ± 0.04(13)  91.94 ± 0.02(19)  91.24 ± 0.1(20)  97.11 ± 0.02(19)  65.33 ± 0.05(20)  94.9 ± 0.00(20) 
Relief  95.78 ± 0.04(8)  97.89 ± 0.03(12)  76.45 ± 0.03(15)  80 ± 0.07(19)  97.12 ± 0.01(20)  95.2 ± 0.03(14)  99.76 ± 0.00(20)  86.52 ± 0.03(18)  97.14 ± 0.01(20) 
mRMR  66.14 ± 0.13(12)  98.95 ± 0.02(9)  71.27 ± 0.1(20)  66.67 ± 0.1(17)  95.68 ± 0.013(19)  95.22 ± 0.02(20)  99.03 ± 0.01(16)  89.57 ± 0.04(20)  97.79 ± 0.01(20) 
MI  96.84 ± 0.042(15)  98.95 ± 0.02(10)  81.03 ± 0.06(17)  78.33 ± 0.04(12)  97.41 ± 0.014(17)  94.53 ± 0.03(18)  98.79 ± 0.01(19)  93.25 ± 0.05(20)  95.58 ± 0.01(20) 
ls21  71.34 ± 0.14(19)  59.42 ± 0.2(12)  60.30 ± 0.14(19)  55 ± 0.07(20)  92.66 ± 0.05(19)  93.86 ± 0.04(20)  92.52 ± 0.01(20)  66.99 ± 0.03(20)  96.56 ± 0.01(20) 
ll21  83 ± 0.11(15)  88.36 ± 0.06(20)  73.12 ± 0.06(15)  0.75 ± 0.12(17)  98.27 ± 0.015(16)  93.24 ± 0.04(16)  94.44 ± 0.02(19)  83.49 ± 0.03(20)  97.69 ± 0.01(20) 
RFS  87 ± 0.01(15)  74.33 ± 0.1(18)  64.73 ± 0.09(15)  66.67 ± 0.07(17)  94.10 ± 0.022(20)  89.77 ± 0.02(19)  91.06 ± 0.03(18)  81.85 ± 0.07(18)  96.77 ± 0.01(20) 
LASSO  98.95 ± 0.02(17)  71.3 ± 0.08(21)  68.02 ± 0.06(20)  83.33 ± 0.05(17)  97.99 ± 0.012(16)  92.51 ± 0.03(12)  99.52 ± 0.01(16)  82.14 ± 0.05(18)  97.07 ± 0.01(20) 
EN  98.95 ± 0.02(17)  71.3 ± 0.08(21)  68.02 ± 0.06(20)  83.33 ± 0.05(17)  97.99 ± 0.012(16)  92.51 ± 0.03(12)  99.52 ± 0.01(16)  82.14 ± 0.05(18)  97.07 ± 0.01(20) 
SMBA  93.68 ± 0.084(16)  88.36 ± 0.06(20)  70.60 ± 0.10(19)  71.67 ± 0.134(17)  97.84 ± 0.00(20)  92.55 ± 0.03(20)  99.28 ± 0.01(20)  83.49 ± 0.03(20)  97.69 ± 0.01(20) 
SMBACSFS  88.24 ± 0.04(20)  81.93 ± 0.02(20)  75.53 ± 0.06(20)  73.34 ± 0.18(16)  98.41 ± 0.014(19)  97.93 ± 0.03(19)  98.30 ± 0.02(13)  94.95 ± 0.02(19)  99.2 ± 0.01(20) 
BSL  97.89 ± 0.04  98.95 ± 0.021  84.26 ± 0.06  85 ± 0.1  99.57 ± 0.00  98.62 ± 0.02  100 ± 0.00  98.65 ± 0.01  100 ± 0.00 
ALLAML  LEUKEMIA  CLL_SUB_111  GLIOMA  LUNG_C  LUNG_D  DLBCL  CARCINOM  GCM(14)  

P  R  F  P  R  F  P  R  F  P  R  F  P  R  F  P  R  F  P  R  F  P  R  F  P  R  F  
Fisher  0.98(18)  0.98(18)  0.98  0.99(15)  0.99(15)  0.99  0.75(11)  0.75(11)  0.75  0.68(20)  0.67(14)  0.67  0.92(19)  0.92(19)  0.92  0.89(20)  0.88(15)  0.88  0.9(17)  0.99(20)  0.93  0.9(19)  0.89(20)  0.89  0.64(20)  0.64(20)  0.64 
Relief  0.96(12)  0.96(12)  0.96  0.99(4)  0.99(4)  0.99  0.75(17)  0.75(17)  0.75  0.77(19)  0.77(19)  0.77  0.97(20)  0.97(20)  0.97  0.95(20)  0.95(15)  0.95  0.89(18)  1.0(20)  0.94  0.89(18)  0.88(18)  0.88  0.8(20)  0.8(20)  0.8 
mRMR  0.8(19)  0.8(19)  0.8  0.98(6)  0.98(17)  0.98  0.64(14)  0.66(14)  0.65  0.7(12)  0.7(12)  0.7  0.97(20)  0.97(20)  0.97  0.96(19)  0.95(19)  0.95  0.95(20)  0.99(14)  0.92  0.88(20)  0.91(20)  0.89  0.85(20)  0.85(20)  0.85 
MI  0.98(12)  0.98(12)  0.98  0.98(2)  0.98(2)  0.98  0.76(16)  0.76(16)  0.76  0.74(20)  0.73(17)  0.73  0.97(20)  0.97(20)  0.97  0.95(20)  0.95(20)  0.95  0.95(17)  0.99(19)  0.9  0.95(17)  0.95(17)  0.83  0.69(20)  0.69(20)  0.69 
ls_l21  0.83(18)  0.81(18)  0.82  0.84(20)  0.82(20)  0.83  0.7(20)  0.7(20)  0.7  0.7(16)  0.7(17)  0.7  0.97(20)  0.97(20)  0.97  0.89(19)  0.88(19)  0.88  0.81(19)  0.93(17)  0.87  0.81(19)  0.81(20)  0.81  0.76(20)  0.76(20)  0.76 
ll_l21  0.92(15)  0.91(15)  0.91  0.83(20)  0.83(20)  0.83  0.69(20)  0.69(20)  0.69  0.65(9)  0.65(9)  0.65  0.98(18)  0.98(18)  0.98  0.94(20)  0.93(20)  0.93  0.92(18)  0.96(19)  0.92  0.9(17)  0.86(20)  0.88  0.84(20)  0.84(20)  0.84 
RFS  0.86(18)  0.84(19)  0.85  0.84(20)  0.76(20)  0.8  0.63(12)  0.64(12)  0.63  0.71(12)  0.7(12)  0.7  0.96(19)  0.96(19)  0.96  0.88(18)  0.86(18)  0.87  0.89(19)  0.93(16)  0.84  0.89(18)  0.84(19)  0.86  0.77(20)  0.77(20)  0.77 
LASSO  0.84(20)  0.84(13)  0.84  0.77(20)  0.77(20)  0.77  0.71(6)  0.71(10)  0.71  0.79(14)  0.78(14)  0.78  0.94(20)  0.94(19)  0.94  0.93(19)  0.9(20)  0.91  0.84(18)  0.97(19)  0.9  0.84(18)  0.84(18)  0.84  0.8(20)  0.8(20)  0.8 
EN  0.84(20)  0.84(13)  0.84  0.77(20)  0.77(20)  0.77  0.71(6)  0.71(10)  0.71  0.79(14)  0.78(14)  0.78  0.94(20)  0.94(19)  0.94  0.91(19)  0.9(20)  0.9  0.84(18)  0.97(19)  0.9  0.84(18)  0.84(18)  0.84  0.8(20)  0.8(20)  0.8 
SMBA  0.9(13)  0.89(16)  0.89  0.83(20)  0.83(20)  0.83  0.7(11)  0.7(11)  0.7  0.68(15)  0.68(15)  0.68  0.97(18)  0.97(18)  0.97  0.91(19)  0.9(19)  0.9  0.92(19)  0.99(17)  0.92  0.9(19)  0.86(20)  0.88  0.84(20)  0.84(20)  0.84 
SMBACSFS  0.83(16)  0.83(16)  0.83  0.86(20)  0.86(20)  0.86  0.67(20)  0.68(20)  0.67  0.8(20)  0.77(20)  0.78  0.98(15)  0.98(15)  0.98  0.99(19)  0.99(19)  0.99  1.0(20)  1.0(20)  1.0  0.99(20)  0.98(20)  0.98  0.97(20)  0.97(20)  0.97 
BSL  1  1  1  1  1  1  0.74  0.74  0.74  0.92  0.92  0.92  0.93  0.93  0.93  0.8  0.8  0.8  1  1  1  0.98  0.98  0.98  1  1  1 
Average Accuracy of top 20 features (%)  

ALLAML  LEUKEMIA  CLL_SUB_111  GLIOMA  LUNG_C  LUNG_D  DLBCL  CARCINOM  GCM  
Fisher  95.90 ± 0.03(13)  98.57 ± 0.03(18)  80.41 ± 0.02(7)  82 ± 0.16(17)  95.09 ± 0.03(20)  86.38 ± 0.14(16)  100 ± 0.00(14)  90.86 ± 0.08(20)  98.98 ± 0.0(18) 
Relief  92.95 ± 0.04(5)  95.81 ± 0.03(10)  82.41 ± 0.05(12)  80 ± 0.19(12)  91.63 ± 0.02(20)  86.39 ± 0.07(20)  100 ± 0.00(11)  89.68 ± 0.03(17)  98.71 ± 0.0(20) 
mRMR  75.14 ± 0.09(16)  98.57 ± 0.03(11)  70.69 ± 0.07(12)  62 ± 0.12(14)  89.16 ± 0.03(20)  86.48 ± 0.09(17)  99.52 ± 0.01(15)  81.61 ± 0.07(20)  98.71 ± 0.0(20) 
MI  94.38 ± 0.03(18)  97.14 ± 0.03(4)  81.03 ± 0.05(20)  82 ± 0.21(19)  95.07 ± 0.015(11)  79.90 ± 0.18(14)  100 ± 0.00(19)  90.86 ± 0.06(11)  98.67 ± 0.0(19) 
ls21  76.47 ± 0.13(6)  65.52 ± 0.08(3)  63.44 ± 0.03(20)  46 ± 0.21(7)  73.88 ± 0.04(19)  75.43 ± 0.07(18)  93.46 ± 0.03(20)  39.68 ± 0.04(19)  97.59 ± 0.0(19) 
ll21  82.1 ± 0.05(16)  80.67 ± 0.09(15)  74.58 ± 0.07(20)  68 ± 0.13(18)  91.15 ± 0.02(15)  67.24 ± 0.12(15)  96.38 ± 0.02(17)  72.40 ± 0.05(17)  96.87 ± 0.0(20) 
RFS  79.24 ± 0.168(17)  74.95 ± 0.09(6)  71.94 ± 0.10(19)  68 ± 0.21(13)  82.79 ± 0.05(17)  68.67 ± 0.07(18)  96.62 ± 0.01(20)  58.03 ± 0.18(20)  96.97 ± 0.01(20) 
LASSO  95.73 ± 0.02(6)  70.3 ± 0.08(15)  71.29 ± 0.05(18)  81.67 ± 0.08(19)  96.26 ± 0.00(18)  93.22 ± 0.021(20)  100 ± 0.00(10)  87.88 ± 0.03(18)  96.09 ± 0.0(20) 
EN  95.73 ± 0.04(10)  70.3 ± 0.08(15)  68.73 ± 0.10(19)  81.67 ± 0.08(19)  95.97 ± 0.012(18)  93.22 ± 0.021(20)  100 ± 0.00(10)  88.56 ± 0.03(19)  96.09 ± 0.0(20) 
SMBACSFS  88.24 ± 0.04(20)  81.93 ± 0.02(20)  75.53 ± 0.06(20)  73.34 ± 0.18(16)  98.41 ± 0.014(19)}  97.93 ± 0.03(19)  98.30 ± 0.02(13)  94.95 ± 0.02(19)  99.2 ± 0.01(20) 
BSL  97.89 ± 0.04  98.95 ± 0.021  84.26 ± 0.06  85 ± 0.1  99.57 ± 0.00  98.62 ± 0.02  100 ± 0.00  98.65 ± 0.01  100 ± 0.00 
ALLAML  LEUKEMIA  CLL_SUB_111  GLIOMA  LUNG_C  LUNG_D  DLBCL  CARCINOM  GCM(14)  

P  R  F  P  R  F  P  R  F  P  R  F  P  R  F  P  R  F  P  R  F  P  R  F  P  R  F  
Fisher  0.96(15)  0.96(14)  0.96  0.97(2)  0.97(2)  0.97  0.84(4)  0.84(4)  0.84  0.76(8)  0.75(8)  0.75  0.96(18)  0.96(18)  0.96  0.97(16)  0.97(16)  0.97  1.0(17)  1.0(17)  1.0  0.95(13)  0.95(13)  0.95  0.93(18)  0.93(18)  0.93 
Relief  0.98(16)  0.98(16)  0.98  0.97(8)  0.97(8)  0.97  0.82(5)  0.82(5)  0.82  0.72(19)  0.7(15)  0.71  0.95(19)  0.95(19)  0.95  0.96(9)  0.95(9)  0.95  1.0(10)  1.0(10)  1.0  0.96(17)  0.96(17)  0.96  0.91(20)  0.91(20)  0.91 
mRMR  0.69(8)  0.69(8)  0.69  0.97(13)  0.97(4)  0.97  0.84(15)  0.84(15)  0.84  0.77(20)  0.77(20)  0.77  0.97(18)  0.97(18)  0.97  0.97(17)  0.97(17)  0.97  1.0(11)  1.0(11)  1.0  0.97(15)  0.97(15)  0.97  0.91(20)  0.91(20)  0.91 
MI  0.99(17)  0.99(17)  0.99  0.98(2)  0.98(17)  0.98  0.8(13)  0.8(13)  0.8  0.75(3)  0.75(3)  0.75  0.94(18)  0.94(18)  0.94  0.97(11)  0.97(11)  0.97  1.0(12)  1.0(12)  1.0  0.97(17)  0.97(16)  0.97  0.91(19)  0.91(19)  0.91 
ls_l21  0.82(18)  0.78(18)  0.8  0.92(17)  0.91(17)  0.91  0.7(14)  0.69(14)  0.69  0.67(20)  0.67(20)  0.67  0.96(20)  0.96(20)  0.96  0.9(16)  0.9(16)  0.9  0.91(19)  0.91(19)  0.91  0.77(18)  0.77(18)  0.77  0.83(19)  0.83(19)  0.83 
ll_l21  0.91(19)  0.9(19)  0.9  0.87(14)  0.86(14)  0.86  0.76(20)  0.76(20)  0.76  0.73(19)  0.73(19)  0.73  0.96(16)  0.96(16)  0.96  0.91(18)  0.9(18)  0.9  0.97(17)  0.97(17)  0.97  0.85(20)  0.85(20)  0.85  0.78(20)  0.78(20)  0.78 
RFS  0.87(14)  0.85(14)  0.86  0.96(19)  0.96(19)  0.96  0.68(12)  0.69(12)  0.68  0.69(20)  0.67(20)  0.68  0.95(20)  0.95(20)  0.95  0.93(19)  0.91(19)  0.92  0.94(20)  0.93(20)  0.93  0.85(19)  0.85(19)  0.85  0.79(20)  0.79(20)  0.79 
LASSO  0.87(16)  0.87(16)  0.87  0.72(16)  0.71(16)  0.71  0.78(18)  0.78(18)  0.78  0.8(18)  0.78(18)  0.79  0.94(17)  0.94(17)  0.94  0.89(20)  0.88(20)  0.88  0.97(19)  0.97(19)  0.97  0.84(20)  0.85(20)  0.84  0.73(20)  0.73(20)  0.73 
EN  0.87(16)  0.87(16)  0.87  0.72(16)  0.71(16)  0.71  0.78(18)  0.78(18)  0.78  0.8(18)  0.78(18)  0.79  0.94(17)  0.94(17)  0.94  0.89(20)  0.88(20)  0.88  0.97(19)  0.97(19)  0.97  0.84(20)  0.85(20)  0.84  0.73(20)  0.73(20)  0.73 
SMBACSFS  0.83(16)  0.83(16)  0.83  0.86(20)  0.86(20)  0.86  0.67(20)  0.68(20)  0.67  0.8(20)  0.77(20)  0.78  0.98(15)  0.98(15)  0.98  0.99(19)  0.99(19)  0.99  1.0(20)  1.0(20)  1.0  0.99(20)  0.98(20)  0.98  0.97(20)  0.97(20)  0.97 
BSL  1  1  1  1  1  1  0.74  0.74  0.74  0.92  0.92  0.92  0.93  0.93  0.93  0.8  0.8  0.8  1  1  1  0.98  0.98  0.98  1  1  1 
We compared the performance of our method against TFS methods (see Tables 2–3) and GFCSFS framework (see Tables 4–5). By looking at accuracy, precision, recall and Fmeasure, SMBACSFS is able to better discriminate among the classes of the LUNG_C, LUNG_D, CARCINOM, DLBLC and GCM data sets in most of the cases, when top 20 and 80 features are considered. In this latter case, when SMBACSFS performs worse then its competitors, the corresponding performance tend to be comparable. On the remaining data sets, each with a number of classes less than 5, namely, ALLAML, LEUKEMIA, CLL_SUB_111 and GLIOMA, SMBACSFS is instead outperformed by some of the competitors. Consequently, we can assert that SMBACSFS behaves better when working with data sets with many classes (at least 5). One possible reason is due to the sparsemodeling approach in selecting the features and the use of an ensemble classifier. Indeed, since the ensemble is based on a majority voting schema, SMBACSFS is able to guess, with higher probability, the belonging of samples coming from data sets with many classes. Just think that, whenever our method draws from a sample of a twoclass data set, the probability of a right guess is proportional to a coin toss. Therefore if, on one hand, this leads to good performance when the data set consists of many classes, the probability of failure, on the other hand, increases in the case of data sets consisting of fewer classes. Anyhow, the local structure of data distribution which is crucial for feature selection, as stated in He, Cai & Niyogi (2005), may be a logical reason why the SBMA schema performs better on certain data set rather than others. In addition, as shown in Fig. 2, it is worth observing that SMBACSFS seems to perform better w.r.t. TFS competitors on a fewer number of features. This would suggest that SMBACSFS is able to identify/retrieve the most representative features that maximize the classification accuracy. To assert the previous results achieved, we computed the average ROC curves between SMBACSFS and the other TFS methods on a subset of 20 and 80 features, respectively. Looking at the AUC values in Fig. 3, it would suggest SMBACSFS as the best model to choose for identifying the most representative features in a classification task when dealing with data set with many classes. Concerning with the GFCSFS competitors, as shown in Fig. 4, it would suggest that the sparse modeling process, underlying the proposed SMBA scheme for feature selection, is more suitable for retrieving the best features for the purpose of classification, often leading to get satisfactory results. Such statement is also proved by the good balance between precision and recall shown in Table 5 and the average ROC curves shown in Fig. 5, where SMBACSFS still holds a candle w.r.t. GFCSF methods. The reader’s attention is drawn to the Supplemental Material for all the experimental results and consideration arisen on the top 80 features.
To statistically validate the results and compare all the competing classifiers against the proposed SMBACSFS, on both 20 and 80 feature subsets, we ran NonParametric multiple comparison tests (all vs all) (Demšar, 2006; RodríguezFdez et al., 2015) which sequentially performs a popular multiclass Friedman nonparametric test (Friedman, 1937) followed by a Nemenyi Posthoc multiple comparison (Dunn, 1961). The ranking of the classifiers, when the top 20 and 80 features are selected, along with the corresponding pvalues, are described in the Supplemental Material. Looking at the Cumulative Rank (CR) for each classifier, one can notice how SMBACSFS achieves optimal results, always finishing in the first three positions. However, it is worth emphasizing that our method ranks systematically on the top position when considering data sets consisting of five or more classes (named CR_{≥5}). These results prove again that SMBACSFS achieves good performance on data sets with many classes. Moreover, by using different classifiers we do not observe noteworthy differences in the results, meaning that the methodology is suitable for the classification of this kind of data, independently from the selected classifier. However, by looking at the pvalues, corresponding to the single ranking method, one can better verify which algorithms have significantly different performance w.r.t. SMBACSFS. For detailed information regarding the results, see the Supplemental Material. Concerning the computational complexity, from several conducted experiments we observed that the proposed methodology may be slower than other techniques (e.g., FS and Relief whose running times are in term of few seconds) but comparable with SMBA. Its running time, depending on several parameters involved, especially in the size of the number of instances and classes of the data sets, may vary from a couple of hours to at most one day (see Table S9 for details on the computational time). Nevertheless, SMBACSFS achieves appreciable performance when working on large data sets and number of classes, and sometimes, in the biological field, the accuracy in finding key features that are responsible for some biological processes is preferred to the execution time. However, since most of the time consumed by the proposed approach is due to the solution of the optimization problem by using the ADMM method, and because the methodology is based on an ensemble of classifiers, a parallel computing approach could be adopted to obtain a faster computational time (Deng et al., 2017).
Conclusions
We proposed a SparseModeling Based Approach for Feature Selection with emphasizing joint ℓ_{1,2}norm minimization and the ClassSpecific Feature Selection. Experimental results, on nine different data sets, validate the unique aspects of SMBACSFS and demonstrate the promising performance achieved against thestateofart methods. One of the main characteristics of our framework is that, by jointly exploiting the idea of Sparse Modeling and ClassSpecific Feature Selection, it is able to identify/retrieve the most representative features that maximize the classification accuracy in those cases where a given data set is made up of many classes. Based on our experimental results, we can conclude that, usually applying TFS allows achieving better results than using all the available features. However, in many cases, applying the proposed SMBACSFS method allows improving the performance of just TFS as well as GFCSFS injected with several TFS methods. It has to be stressed, that SMBACSFS seems actually suitable for large data sets consisting of many classes, while on data sets with less than five classes other methods appear to be more effective. Although SMBA, SMBACSFS and TFS performance slightly differ on the whole, it is worth highlighting that SMBACSFS achieves its best performance when considering fewer features (i.e., from 1 to 20) on data sets with many classes, which is an important goal when certain biological tasks are taken into account. However, we do believe that these techniques might be effectively used in a systematic way after a microarray analysis. Indeed, a better gene selection step could avoid the waste of many resources in postarray wet analysis (e.g., Real TimePCR) allowing researchers to focus their attention just on relevant features. Finally, we think this method demonstrated to be an interesting alternative among FS approaches on microarray data.
As future work, the focus will be moved towards the biologic interpretations of the SMBA framework behavior, by systematically studying the selected genes, especially taking into account the SMBACSFS approach which, as proved by the experimental results, is more effective in selecting genes of interest than the standard SMBA. Furthermore, we are planning to test our approach on EPIC data set (Demetriou et al., 2013), after a thorough analysis of prefiltering, and a parallel implementation to substantially reduce its computational time.
Supplemental Information
Friedman statistic test and Execution Time, performed on nine data sets, when SVM classifier is used
SVM. Accuracy, Precision, Recall, F1measure and Friedman statistic test result tables, when comparing the state of the art methods against our proposed methods, SMBA and SMBACSFS
The model evaluation metrics were computed taking into account the top 20 and 80 selected features with a 5 kfold CV.
Naive Bayes. Classification model evaluation metrics on 5fold CV
Accuracies plots and ROC curves comparison between the state of art methods and the proposed methods SMBA and SMBACSFS on nine data sets.
Logistic Regression. Classification model evaluation metrics on 5fold CV
Accuracies plots and ROC curves comparison between the state of art methods and the proposed methods SMBA and SMBACSFS on nine data sets.
KNN. Classification model evaluation metrics on 5fold CV
Accuracies plots and ROC curves comparison between the state of art methods and the proposed methods SMBA and SMBACSFS on nine data sets.
Decision Tree. Classification model evaluation metrics on 5fold CV
Accuracies plots and ROC curves comparison between the state of art methods and the proposed methods SMBA and SMBACSFS on nine data sets.
SVM. Accuracy, Precision, Recall, F1measure and Friedman statistic test result tables, when comparing the state of the art methods against our proposed methods, SMBA and SMBACSFS
The model evaluation metrics were computed taking into account the top 20 and 80 selected features with a 5 kfold CV.
Naive Bayes. Accuracy, Precision, Recall, F1measure and Friedman statistic test result tables, when comparing the state of the art methods against our proposed methods, SMBA and SMBACSFS
The model evaluation metrics were computed taking into account the top 20 and 80 selected features with 5 kfold CV.
Logistic Regression. Accuracy, Precision, Recall, F1measure and Friedman statistic test result tables, when comparing the state of the art methods against our proposed methods, SMBA and SMBACSFS
The model evaluation metrics were computed taking into account the top 20 and 80 selected features with a 5 kfold CV.
KNN. Accuracy, Precision, Recall, F1measure and Friedman statistic test result tables, when comparing the state of the art methods against our proposed methods, SMBA and SMBACSFS
The model evaluation metrics were computed taking into account the top 20 and 80 selected features with a 5 kfold CV.
Decision Tree. Accuracy, Precision, Recall, F1measure and Friedman statistic test result tables, when comparing the state of the art methods against our proposed methods, SMBA and SMBACSFS
The model evaluation metrics were computed taking into account the top 20 and 80 selected features with 5 kfold CV.