A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data

Cancer classification is a topic of major interest in medicine since it allows accurate and efficient diagnosis and facilitates a successful outcome in medical treatments. Previous studies have classified human tumors using a large-scale RNA profiling and supervised Machine Learning (ML) algorithms to construct a molecular-based classification of carcinoma cells from breast, bladder, adenocarcinoma, colorectal, gastro esophagus, kidney, liver, lung, ovarian, pancreas, and prostate tumors. These datasets are collectively known as the 11_tumor database, although this database has been used in several works in the ML field, no comparative studies of different algorithms can be found in the literature. On the other hand, advances in both hardware and software technologies have fostered considerable improvements in the precision of solutions that use ML, such as Deep Learning (DL). In this study, we compare the most widely used algorithms in classical ML and DL to classify the tumors described in the 11_tumor database. We obtained tumor identification accuracies between 90.6% (Logistic Regression) and 94.43% (Convolutional Neural Networks) using k-fold cross-validation. Also, we show how a tuning process may or may not significantly improve algorithms’ accuracies. Our results demonstrate an efficient and accurate classification method based on gene expression (microarray data) and ML/DL algorithms, which facilitates tumor type prediction in a multi-cancer-type scenario.

159 to learn how to classify by cancer type. The types of cancer and the number of patients for each 160 type are shown in table 1. The classes of each cancer type are unbalanced and remained so in the 161 experimentation. 162 163 Preparing the data 164 For the experiments, we divided the information into two groups; the first group corresponds to 165 the features (X) and the second group to the classes (Y). The features compose a matrix of size m 166 x n and the classes are a vector of size n x 1, where m is the number of samples and n is the 167 number of genes for each class (12,533). The dataset, containing 174 samples, is randomly 168 subdivided into two sub-sets (80% training and 20% validation), including 139 samples for 169 training and 35 samples for validation. Initial calibration of ML and DL algorithms (training) 170 was done using the training set; then, hyperparameter tuning was performed with the validation 171 set and measured the accuracy of the algorithms. We calculated the accuracy of each algorithm 172 using tuned hyperparameters with k-fold cross-validation and k=10 to avoid overfitting. 173 174 The dataset used in this paper has the curse of dimensionality since the number of characteristics 175 (12,533) is higher than the number of samples (174) (Powell, 2007). Therefore, the data are 176 dispersed and the results are not statistically stable or reliable, directly affecting the accuracy 177 achieved by ML and DL algorithms. Two preprocessing techniques were used to solve this 178 problem: scaling (Géron, 2017) and principal component analysis (PCA) (Wold, Esbensen & 179 Geladi, 1987). The first technique guarantees that the data are in a range of suitable values to 180 calibrate the model. With the second technique, the statistical significance is improved and the 181 noise introduced by irrelevant characteristics during model training decreases. In this paper, we 182 worked with several combinations of the preprocessing techniques mentioned above to find the 183 best performance. 184 185 Four different datasets were created for the training and validation of each ML or DL algorithm. 186 For the first dataset, we did not apply any preprocessing operations; for the second, we 187 performed a scaling process; for the third, we applied PCA with a retained variance of 96% to 188 reduce data dimensionality, obtaining a dimensional reduction from 12,533 to 83 features. 189 Finally, for the last dataset, we applied both scaling and PCA, obtaining a dimensional reduction 190 from 12,533 to 113 features (principal components). Classification performance is highly correlated with the degree of separability of a dataset; 195 therefore, we analyzed performance using clustering techniques. Based on data labels, we can 196 gain a priori insight into the algorithm that works best on the distribution of the gene expression 197 microarray dataset. Two types of networks were used for deep learning; the first is a fully connected neural network 221 and the second is a convolutional neural network. The FNN consists of three fully connected 222 layers of 100 neurons each and the Softsign activation function; then, a final layer of 11 neurons 223 is generated with the sigmoid activation function to generate the probability of the type of 224 cancer. The CNN consists of three convolutional layers with 128 filters each, with a kernel size 225 of 3 and a linear activation function; followed by a layer of 100 fully connected neurons with the 226 Softsign activation function and, finally, a layer of 11 neurons with the Softmax activation 227 function to generate the probability of the type of cancer. Figure 1 shows the architectures used 228 for the experiment, in which the top scheme is a FNN and the bottom scheme is a CNN.  (Table 2). With these results, we plotted the accuracy values using all datasets 234 created in the training and validation processes and also created confusion matrices. Finally, we 235 did a cross-validation of each algorithm to find the accuracy that was less affected by bias. 236 Additionally, in FNNs and CNNs, we performed a hyperparameter search with a grid search 237 method (GridSearchCV) from the sklearn module, considering the variables shown in Table 3. 238 Due to the high number of parameters, the process of tuning FNNs and CNNs involved choosing 239 the parameter values that achieved the best accuracy and, then, using these values to find others. 240 The process of finding the best parameter values is presented as follows: 1) batch size and 241 epochs 2) training optimization algorithm 3) learning rate and momentum 4) network weight 242 initialization 5) neuron activation function 6) dropout regularization and 7) number of neurons in 243 the hidden layers. We performed a test for difference in proportions to determine whether the difference between 248 accuracies of the algorithms is significant. We calculated the differences between the observed 249 and expected accuracies under the assumption of a normal distribution. Given the number of 250 correct test predictions and the number of test instances , accuracy is defined as follows: This test allowed determining if the accuracies of the algorithm change significantly after the 256 tuning process and also if there are significant differences between the two algorithms with the 257 highest average accuracies. Based on this, we evaluated whether the parameter tuning of the 258 algorithms was necessary or if the ML algorithm used was more relevant. Before evaluating the classification algorithms, we visualized the intrinsic groupings in the data 274 and determined how these groups are influenced by the different preprocessing methodologies 275 applied to our data ( Figure 2). Using the downloaded raw data, we created a hierarchical graph 276 (unsupervised learning) using different methodologies (Fig. S1) and concluded that Ward's 277 method produced the most balanced clusters (Figure 3). Then, using only Ward's method, we 278 performed additional analyses using different datasets, including raw data, scaled data, data 279 transformed by PCA, and data scaled and transformed by PCA. Finally, we created a 280 dendrogram and a heat map to find whether data can be clustered into groups without any given 281 class with the best results. Figure 4 shows four well-separated groups, but the heat map 282 demonstrated other well-conserved groups, which may indicate that the four main clusters could 283 be divided into subgroups.
284 285 Ward's method created four groups, while the other methods clustered the individuals into fewer 286 groups and, in most cases, these groups are largely unbalanced. On the other hand, the raw data 287 and data transformed by PCA performed better in the hierarchical clustering analysis. Employing 288 these datasets, we were able to obtain four and five clusters, respectively. Finally, the heatmaps 289 plotted in Figure 4 showed one group greatly distant from the others (green in Figure 4A and 290 light blue in Figure 4B). On the other hand, the other clusters showed low intra-cluster distances, 291 which is an ideal feature in classification problems (clear blue in Figure 4A and green in Figure  292 4B).
293 294 Based on a priori knowledge that the number of cancer types is eleven (11), we were interested 295 in determining how the hierarchical clustering algorithm created the cluster assignments. 296 Therefore, we applied the best parameters found previously (clustering method: ward, and input: 297 raw data and data reduced by PCA). The results shown in Figure 5 and Tables 4 and 5 298 demonstrate that, although the hierarchical clustering algorithm displays good performance, it 299 does not group the data into the correct number of groups. 300 301 Another unsupervised learning assessment involved the implementation of the K-means 302 algorithm. We used all datasets and changed the number of clusters iteratively from one to 303 eleven, increasing by one cluster at a time. Then, we calculated the accuracy in each iteration and 304 a confusion matrix was plotted with the best results ( Figure 6). Additionally, we calculated other 305 metrics, such as precision, recall, and f1-score for each class. Overall, the best results were 306 obtained by K-means using 11 clusters with input data processed by PCA, achieving an accuracy 307 of 68.34% (validation set, using the hold-out splitting method). Also, classes 6, 7 and 9 showed 308 precisions of 100% and class 5 of 91% (Table 5). The algorithms were tuned by setting several parameters between a given value range (Table 2) 313 to find the best behavior using all datasets. Through this, we aimed to calculate the best 314 hyperparameters for each algorithm and determine which dataset could be the most appropriate. 315 The results of the highest validation accuracies are shown in Table 6. To evaluate overfitting or 316 underfitting, we plotted the accuracy values of the training and validation processes on all 317 datasets described above (Figure 7). RF and DT were not plotted since more than one 318 hyperparameter were tuned. The best results were obtained using LG and raw data. We also 319 calculated a confusion matrix for these results, finding very good classification rates (Figure 8).  Table 7. The grid-search method showed the hyperparameter values that provided the best accuracy in 332 FNN and CNN architectures (Table 8)  We performed a test of significant differences, with a 95% confidence level, between the two 343 best-performing ML algorithms (LG and CNN). Accordingly, we found no significant 344 differences between the accuracies of these two algorithms (p-value=0.447).

346 Discussion
347 In this work, we show the application of unsupervised and supervised learning approaches of ML 348 and DL for the classification of 11 cancer types based on a microarray dataset. We observed that 349 the best average results using the training and validation data are obtained using the raw dataset 350 and the Logistic Regression (LR) algorithm, yielding an accuracy value of 100% (validation set, 351 using the hold-out splitting method). One could assume there is overfitting since the confusion 352 matrix showed an extremely good behavior; however, the comparison of the training and 353 validation accuracies between parameters using the entire dataset may indicate perfect accuracy 354 in both training and validation datasets. Additional tests with independent data should be done to 355 discard potential overfitting.   Hierarchical maps using Ward as clustering method and A) raw data B) scaled data, C) data reduced by PCA and D) data scaled and reduced by PCA.
Due to the large number of characteristics of the data set, it is recommended that you transform the data set to use only the most relevant and informative variables, which is called the preprocessing step.  Hierarchical maps using Ward method as the criterion for choosing the pair of clusters to merge at each step.
This hierarchical map was generated by data without transformation and deleting their labels. Clustering approaches demonstrate whether the data contain relevant patterns for grouping.

Algorith m Parameter Range
Step Description Manuscript to be reviewed