Windows malware detection based on static analysis with multiple features

View article
PeerJ Computer Science

Introduction

Malicious software, commonly called malware, can be classified into viruses, worms, Trojans, spyware, ransomware, logic bomb, etc. based on their behavior and characteristics (Gibert, Mateu & Planes, 2020). Computer malware pose a major threat to computer and network security. This is the reason that research on developing new systems to detect malware is a hot topic in data mining, machine learning, and deep learning. Our work is aimed at developing a static malware detection system to detect Portable Executable (PE) malware using multiple features. We not only extract multiple features from PE malware but also combine these features to create integrated features in a bid to improve the accuracy of our malware detection system. Presumably, the multiple and integrated features used in this work have never been considered together in detecting malware.

Most commercial anti-virus software rely on signature-based detection of malware, however, it is not effective against unknown malware or zero-day attacks. In the last decade or so, the research on malware detection has focused on finding generalized and scalable features to identify previously unknown malware and counter zero-day attacks effectively (Guo, 2023). There are two basic types of malware analyses; static analysis and dynamic analysis (Damaševičius et al., 2021). In static analysis, features are extracted from the code and structure of a program without actually running it whereas in dynamic analysis features are gathered after running the program in a virtual environment.

During the last decade, machine learning has solved many problems in different sectors including cyber security. It is now believed that AI-powered anti-virus tools can help in detecting zero-day attacks (Alhaidari et al., 2022). A typical machine learning workflow in detecting malware involves data collection, data cleaning and pre-processing, building and training models, validating, and deploying into production. In this regard, the success of supervised machine learning models depends on two factors: (1) the amount of labeled data used to train the model and (2) the features extracted from the malware. There have been numerous studies (Sharma, Rama Krishna & Sahay, 2019; Chowdhury, Rahman & Islam, 2017; Kim et al., 2021; Patidar & Khandelwal, 2019; Zhang, Kuo & Yang, 2019) on static malware analysis using machine learning but most of these studies train their models on one or two types of features and have their own limitations.

In this work, we propose a malware detection system for detecting Portable Executable (PE) malware based on static analysis with multiple features. We extract four types of feature sets and also merge them to create two additional feature sets. The research contributions made by the authors are listed below.

  1. Collection of latest samples to create a new dataset of PE malware and benign files.

  2. We extract four feature sets including the list of imported DLLs and API functions called by these samples, values of 52 attributes from the PE Header and 100 attributes of the PE Section.

  3. We merge extracted features for creating new integrated features of PE samples.

  4. Comprehensive analysis and evaluation of different machine learning classifiers, ensemble learning and feature selection techniques to maximize the malware detection rate.

The remainder of this article is organized as follows. We discuss the some previous studies on static malware detection in the Literature Review section. The Research Method section describes our main approach to detect malware in Windows environment and explains the process of data collection, feature extraction, and feature selection. We present our empirical results in the Experimental Results section along with the evaluation criteria, details of different experiments and our findings. We conclude our work in the Conclusion section.

Literature Review

In this section, we cover previous works done on detecting malware in Windows environment using machine learning methods. Mainly, we will cover some related works in this field which deals with malware detection using static analysis on Windows Portable Executables (PEs).

Several studies have applied machine learning for malware classification and detection. The authors (Catak, Yazi & Elezaj, 2020) proposed a long short-term memory (LSTM) method for classifying sequential data of Windows exe API calls. They also contributed to the ongoing research on malware detection by developing a new dataset that contains API calls made on the Windows operating system to represent the behavior of malware. They achieved an accuracy of up to 95%. Sharma, Rama Krishna & Sahay (2019) proposed a system based on the frequency of opcode occurrence for detecting malware. The authors used Fisher score, information gain, gain ratio, Chi-square and symmetric uncertainty for selecting top-20 features. They found that five machine learning methods namely random forest, LMT, NBT, J48 Graft and REPTree detect the malware with almost 100% accuracy. Naval et al. (2015) focus on proposing an evasion-proof solution that is not vulnerable to system-call injection attacks. They proposed an approach that characterizes program semantics using asymptotic equipartition property to extract information-rich call sequences. These call sequences are further quantified to detect malicious binaries. The results showed that the solution is effective in identifying real malware instances with 95.4% accuracy. Tang & Qian (2019) detected malicious code based on the API call sequence. They converted the API call sequence into a characteristic image that can represent the behavior of the malicious code. The convolutional neural network was used to classify the malicious code into nine families and achieved a true positive rate of 99%. The authors (Raff et al., 2018) introduced malware detection from raw byte sequences of the entire executable file using neural networks. In this initial work, they discussed many interesting challenges faced in building a neural network for processing raw byte sequences. Fuyong & Tiezhu (2017) proposed a new malware detection and classification method based on n-grams attribute similarity. We extract all n-grams of byte codes from training samples and select the most relevant as attributes. After calculating the average value of attributes in malware and benign separately, we determine a test sample is malware or benign by attribute similarity. The results of this study show that the proposed system outperforms traditional machine learning methods. Wojnowicz et al. (2016) developed a method to quantify the extent to which patterned variations in a fileâôs entropy signal make it suspicious. By extracting only string and entropy features from samples, they can obtain almost 99% detection of parasitic malware.

Zhang et al. (2020) explored function call graph vectorization representation (FCGV) as the input feature to machine learning algorithms for classification and noted that this representation loses some critical features of PE files due to the hash technique being used. They improved the classification accuracy of the FCGV-based machine learning model by applying both graph and non-graph features and achieved a maximum accuracy of 99.5% with non-graph, i.e., statistical features. The authors (Chowdhury, Rahman & Islam, 2017) used the n-gram approach on PE files. First, they extracted PE Header and 5-grams as features and then applied PCA to reduce and focus on the important features only. They achieved an accuracy of 97.7% using an artificial neural network. Cepeda, Chia Tien & Ordóñez (2016) found that nine features are enough to distinguish malware from benign files with an accuracy of 99.60%. Kim et al. (2021) proposed a static analysis automation technique for detecting malicious code using a portable executable structure. They extracted 12 attributes from 54 attributes of PE structure based on the importance score, however, the system achieved a maximum of 80% accuracy of malicious code classification. The Zero-Day Vigilante (ZeVigilante) system (Alhaidari et al., 2022) can detect the malware considering both static and dynamic analyses. They applied six different classifiers and observed that RF achieved the best accuracy for both static and dynamic analyses, 98.21% and 98.92%, respectively. Similarly, the studies (Patidar & Khandelwal, 2019; Gupta & Rani, 2018; Kumar & Singh, 2018; Venkatraman & Alazab, 2018) claim zero-day malware detection using machine learning techniques. The study (Zhang, Kuo & Yang, 2019) focuses on malware type detection or classification of malware family instead of binary classification. The work uses several machine learning models to build static malware type classifiers on PE-format files. The evaluation results show that random forest can achieve high performance with a micro average F1-score of 0.96 and a macro average F1-score of 0.89. The work (Pham, Le & Vu, 2018) proposes a static malware detection method by Portable Executable analysis and Gradient Boosting decision tree algorithm. The method reduces the training time by appropriately reducing the feature dimension and achieves 99.3% accuracy.

Zhang, Liu & Jiang (2022) argued that most malware solutions only detect malware families that were included in the training data. They proposed to use a soft relevance value based on multiple trained models. They used features such as file sizes, function call names, DLLs, n-grams, etc. When the models are trained, we try to predict which malware family from the dataset they belong to. By using the trained models, the soft relevance value is applied to find if the malware belongs to one of the original malware families or not. Singh & Singh (2020) proposed a behavior-based malware detection technique. Firstly, printable strings are processed word by word using text mining techniques. Secondly, Shannon entropy is computed over the printable strings and API calls to consider the randomness of API and finally, all features are integrated to develop the malware classifiers using the machine learning algorithms. Cannarile et al. (2022) presented a benchmark to compare deep learning and shallow learning techniques for API calls malware detection. They considered random forest, CatBoost, XGBoost, and ExtraTrees as shallow learning methods whereas TabNet and NODE (Neural Oblivious Decision Ensembles) were used as deep learning methods. Based on experimental results, they concluded that shallow learning techniques tend to perform better and converge faster(with less training time) to a suitable solution. Euh et al. (2020) propose low-dimensional but effective features for a malware detection system and analyze them with tree-based ensemble models. They extract the five types of malware features represented from binary or disassembly files. The experimental work shows that the tree-based ensemble model is effective and efficient for malware classification concerning training time and generalization performance. Amer & Zelinka (2020) introduced the use of word embedding to understand the contextual relationship that exists between API functions in the malware call sequence. Their experimental results prove that there is a significant distinction between malware and goodware call sequences. Next, they introduce a new system to detect and predict malware based on the Markov chain.

In conclusion, there is a vast amount of research on malware detection using machine learning and deep learning. Upon reviewing prior studies, a few key points can be noted. Firstly, most prior works only utilize one or two raw features in their malware detection efforts. Only a limited number of studies have combined raw features to create new ones. Secondly, feature selection is usually done through either Information Gain or principal component analysis, with few studies employing both methods. Thirdly, ensemble learning is not widely used in these studies. Our study, however, extracts four raw features and creates two integrated features. We also apply both Information Gain and principal component analysis for feature selection, use seven different classifiers for malware classification, and incorporate three ensemble learning techniques to increase classification accuracy.

Research Methods

In this section, we discuss our approach to detect PE malware in Windows environment. Our approach can be divided into two phases; malware collection and malware detection.

Malware collection

We collected the data from MalwareBazaar Database (https://bazaar.abuse.ch/) using its API. The MalwareBazaar Database offers a Comma Separated Values (CSV) file containing the basic information such as SHA256 hash, file name, file type, and signature of all the malware samples available in the database. It also provides an API to download the samples using the information given in the CSV file. We wrote a small script in Python and downloaded more than 30,000 samples of different types of malware. We targeted only PE files in our API calls. The motivation for using PE files was arrived at by monitoring the submissions received over different malware databases. For example, more than 26% malware samples in the malwarebazaar database are PE malware and make it a common file type for spreading malware. Similarly, 47.8% files submitted to Virustotal for analysis are PE files Kumar, Kuppusamy & Aghila (2019). We discarded samples with incorrect values of PE header and samples with code obfuscation O’Kane, Sezer & McLaughlin (2011). After discarding unwanted samples, we have a total of 27,920 samples divided into six categories in our dataset as described in Table 1. We also collected 1,878 benign files from various sources including files from Windows installation. We will make this dataset public very soon.

Table 1:
Description of KIET dataset.
Malware Type Count Description
RedLineStealer 5,090 This is a password stealer type of Spyware. It steals passwords, credit card information and other sensitive data and sends it to a remote location.
Downloader 5,047 This is a Trojan downloader used by attackers to distribute malware on a large scale. This dataset contains both GuLoader and SmokerLoader samples.
RAT 4,973 These are Remote Access Trojans (RAT) that allow an attacker to remotely control an infected computer. The samples include AveMariaRAT and njRAT.
BankingTrojan 4,864 This is a banking Trojan that targets both businesses and consumers for their data, such as banking information, account credentials and bitcoins etc. This dataset contains both TrickBot and QuakBot samples.
SnakeKeyLogger 4,240 This is a KeyLogger that keeps track of and records victim’s keystrokes as s/he types. It is also a spyware and send the recorded information to the hecker through a command and control server.
Spyware 3,706 This is AgentTesla Spyware that is used by attackers to spy on victims. It can record keystrokes and user interactions on supported programs and web browsers.
Total 27,920
Benign 1,877 Legitimate or goodware files collected from different sources including Windows installation files.
Grand Total 29,797
DOI: 10.7717/peerjcs.1319/table-1

Feature sets

We create four feature sets from our data. Moreover, we also create two integrated feature sets by combining these features.

Dynamic link libraries: The first set of features is a list of dynamic link libraries (or DLLs for short) used by each Windows executable. A DLL is a library that contains code, data, and resources that can be used by more than one program at the same time. Windows programs use DLLs to share functionality and resources between themselves. For example, the Comdlg32 DLL performs common dialog box related functions in Windows. A program’s certain characteristics can be inferred from the set of DLLs it uses. Therefore, we make a list of DLLs called by malware and benign files to help distinguish between them.

API functions: The second set of features is a list of API (Application Program Interface) function names called within the DLLs discovered in the first feature set. Windows APIs are implemented through DLLs and each DLL may contain hundreds of functions in it. A program can be distinguished from others based on the API functions it imports from a DLL. By collecting the list of API functions, we supplement our first feature set in the hope of further improving our ability to differentiate between benign and malware files. The list of API functions can reveal the behavior of the program.

PE Header: PE header contains useful information about the executable. A PE file contains a number of headers including MS-DOS Stub, COFF file header, an optional header, etc. They contain metadata about the file itself such as the number of sections, the size of the code, the characteristics of the file, etc. We collect the values of 52 fields of PE Header as our third feature set as detailed in Table 2. We get 17 fields from DOS Header, 7 from File Header and 28 from the Optional Header.

Table 2:
The list of 52 fields of PE Header in our 3rd feature set.
Header Name Field Name
DOS Header e_magic, e_cblp, e_cp, e_crlc, e_cparhdr, e_minalloc, e_maxalloc, e_ss, e_sp, e_csum, e_ip, e_cs, e_lfarlc, e_ovno, e_oemid, e_oeminfo, e_lfanew
File Header Machine, NumberOfSections, TimeDateStamp, PointerToSymbolTable, NumberOfSymbols, SizeOfOptionalHeader, Characteristics
Optional Header Magic, MajorLinkerVersion, MinorLinkerVersion, SizeOfCode, SizeOfInitializedData, SizeOfUninitializedData, AddressOfEntryPoint, BaseOfCode, ImageBase, SectionAlignment, FileAlignment, MajorOperatingSystemVersion, MinorOperatingSystemVersion, MajorImageVersion, MinorImageVersion, MajorSubsystemVersion, MinorSubsystemVersion, Reserved1, SizeOfImage, SizeOfHeaders, CheckSum, Subsystem, DllCharacteristics, SizeOfStackReserve, SizeOfHeapReserve, SizeOfHeapCommit, LoaderFlags, NumberOfRvaAndSizes
DOI: 10.7717/peerjcs.1319/table-2

PE sections: A PE file contains many sections such as executable code section (.text), data sections (.data, .rdata, .bss), and resource section (.rsrc), etc. These sections provide a logical and physical separation of the different parts of a program. Since different programs need different sections depending on their functionality, therefore, collecting information about PE sections could be useful in distinguishing files from each other. Each section in PE has properties such as VirtualAddress, VirtualSize, SizeofRawData, etc. We collect the values of ten properties of each of the ten sections (.text, .data, .rdata, .bss, .idata, .edata, .rsrc, .reloc, .tls, .pdata) as our fourth feature set. In a nutshell, this feature set contains 100 features of the PE section as detailed in Table 3.

Table 3:
Ten PE sections and their ten fields in our 4th feature set.
Section Name Description
.text This section contains the executable code. It also contains program entry point.
.data This section contains initialized data of a program.
.rdata It contains data that is to be only readable, such as literal strings, and constants.
.bss It represents uninitialized data to reduce the size of executable file.
.idata This section has data about imported functions.
.edata This section contains symbols related information that can be accessed through dynamic linking by other images.
.rsrc This resource-container section contains resource information.
.reloc Relocation information is saved in this section.
.tls TLS stands for Thread Local Storage. Each thread running in Windows uses its own storage called TLS.
.pdata It stores function table entries for exception handling.
Field Name Description
Name An 8-byte encoded string contains name of the section.
Misc_VirtualSize The total size of the section when loaded into memory.
VirtualAddress The address of the first byte of a section.
SizeOfRawData The size of the section.
PointerToRawData The file pointer to the first page of the section within the COFF file.
PointerToRelocations The file pointer to the beginning of relocation entries for the section.
PointerToLinenumbers The file pointer to the beginning of line-number entries for the section.
NumberOfRelocations The number of relocation entries for the section.
NumberOfLinenumbers The number of line-number entries for the section.Â
Characteristics The flags that describe the characteristics of the section.
DOI: 10.7717/peerjcs.1319/table-3

Integrated feature set1 (IFS1): We combine DLLs referred and API functions called by a sample to create our first integrated feature set. Since both the original feature sets contain names of the DLLs and API functions, therefore, we can simply merge them to create an integrated feature set.

Integrated feature set2 (IFS2): We combine the PE header and section feature sets to create our second integrated feature set as both the sets contain numeric values and hence can be merged efficiently.

Extracting raw feature

In static malware analysis, we can extract useful information from PE files without running the executable. Our PE extractor extracts all the information such as DLLs, API functions, PE Header and Section information and stores them in separate CSV files. We run our PE extractor on all 27,920 malware and 1,877 benign files and store the raw features in four CSV files, one CSV file per feature set.

DLLs imported: We pass all the PE files, both malware and benign, to our PE extractor and enlist the names of DLLs called by them. On average, a malware calls four DLLs whereas a benign file calls three DLLs in this study. Overall, malware files import 531 unique DLLs whereas benign files import 186 unique DLLs. The normalized frequency of the top 20 DLLs imported by malware and benign are given in Appendix A. (see Table A1).

To feed DLLs raw features to our classifiers, we apply a Bag-of-Words (BoW) approach for representing the DLLs for each sample. That is, we make a large list of unique names of the DLLs and construct a feature vector for each sample such that each index corresponds to a specific DLL whose value could be either 1 or 0, indicating whether that DLL was called in the file or not. The overall dimensionality, the names of DLLs in the bag, found for our data set was 629.

API functions: We extract the names of API functions called within the DLLs extracted above. On average, a malware calls seven API functions whereas a benign calls eight functions. Overall, malware files call 18,432 unique API functions whereas benign files call 4,256 unique API functions. The normalized frequency of the top 20 API functions called by malware and benign are given in Appendix A (see Table A2).

Similar to the DLLs feature set, we create a Bag-of-Words, i.e., a bag of API functions for constructing a feature vector for each file. The dimensionality of this feature vector was found to be 21,918.

PE Header: We extract the values of 17 fields from the DOS header, 7 from the COFF file header, and 28 from the optional header; a total of 52 fields from the PE header. Since these are numeric values, therefore, we normalize them and create a 52 dimensional vector for each file to represent this feature set.

PE section: We extract the values of ten fields each from ten sections of PE. (see Table 3). All ten fields are numeric except the Name field so we omit this field and use the normalized values of the other nine fields from each section. This gives us a feature vector of 90 dimensions.

Integrated feature set1 (IFS1): We simply merge the first two feature sets, i.e, DLLs imported and API Functions to form this integrated feature set. The integrated feature vector contains 22,575 features in total.

Integrated feature set2 (IFS2): We form this integrated feature set by merging 52 fields of the PE header and 90 fields of the PE section. The integrated feature vector has 142 dimensions.

Feature selection

The raw features are numerous especially in the case of DLLs imported and API functions called by a sample and it is possible that some features might not contain useful information for a machine learning model. Therefore, we applied two feature selection or dimensionality reduction techniques namely Information Gain (IG), and principal component analysis (PCA). By applying these feature selection techniques, we are able to reduce the number of features in each feature set significantly. As a result, we decrease the processing time to train and test the classifiers and possibly also improve their accuracy in detecting the malware. The total number of features in raw feature sets and selected feature sets after applying Information Gain and principal component analysis are given in Table 4.

Experimental results

Our dataset consists of a total of 29,797 PE samples, of them 27,920 are malware and 1,877 are benign or goodware. We apply seven machine learning classifiers, two feature selection techniques, and three ensemble learning methods to detect malware as depicted in Fig. 1. We use standard 10-fold cross-validation for training and testing our models. It means we randomly divide our dataset into 10 smaller subsets such that nine subsets are used for training and 1 subset is used for testing. We repeat this process 10 times for every combination. This methodology helps evaluate the robustness of any approach to detect malware without any a priori information. Moreover, the dataset was split into 70:30 ratio for training and testing purposes respectively, i.e., 70% data is used for training the classifiers whereas 30% is used for testing them.

Table 4:
The number of features in raw feature sets and selected feature sets after applying Information Gain and principal component analysis.
Feature Type Raw Features After PCA After IG
DLLs Imported 629 195 125
API Functions 21,918 494 219
PE Header 52 19 10
PE Section 90 26 18
Integrated Feature Set1 22,547 563 225
integrated Feature Set2 142 49 28
DOI: 10.7717/peerjcs.1319/table-4
Malware detection phase.

Figure 1: Malware detection phase.

Evaluation criteria

To evaluate the performance of our system, we create a confusion matrix for each classifier. A confusion matrix summarizes the performance of a classifier in the form of a table with the help of four quantities namely True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). It helps in measuring the accuracy, recall, precision, and F-score of a classifier. We briefly define the metrics we used to measure the performance of our system.

True Positive (TP): A malware classified as a malware.

True Negative (TN): A benign classified as a benign.

False Positive (FP): A benign classified as a malware.

False Negative (FN): A malware classified as a benign.

Accuracy (ACC): It calculates the correctly classified or predicted samples by the system as the ratio of correct predictions to the total predictions. A C C = T P + T N T P + T N + F P + F N Error rate (ERR): It is calculated as the ratio of the number of incorrect predictions to the total predictions. It is also called misclassification. E R R = F P + F N T P + T N + F P + F N Recall: The recall is calculated as the number of correct positive predictions divided by the total number of positives. It is also called True Positive Rate (TPR). R e c a l l = T P T P + F N Precision: The precision is calculated as the number of correct positive predictions divided by the total number of positive predictions. P r e c i s i o n = T P T P + F P F-score It is a harmonic mean of precision and recall. F s c o r e = 2 P r e c i s i o n R e c a l l P r e c i s i o n + R e c a l l .

Experimental setup

To validate the proposed system shown in Fig. 1, we create an experimental setup on the Windows operating system running on AMD Ryzen 7 4800H @4.2 GHz processor and 16 GB of main memory. We use the Scikit-learn (Pedregosa et al., 2011) library of Python to run all the experiments. It has the implementation of many classifiers and helps in splitting the data into training and testing, 10-fold cross-validation, and comparing the performance of different classifiers using confusion matrix and other metrics.

Testing with raw features

In the first experiment, we evaluate our system on raw features. We apply all the classifiers and ensemble learning methods on individual and integrated features, and the results are presented in Table 5. We see that in the case of imported DLLs, the random forest model outperforms other classifiers with an accuracy of 96.41% and an error rate of only 3.59%. The Stack Generalization gives the best accuracy of 96.47% on this feature set. We see the same trend in other feature sets. The random forest model achieves the best accuracy of 99.36% on the PE Header feature set while the worst performance with 92.0% accuracy is attained by the Nearest Centroid method on the PE Section feature set.

Table 5:
The performance of different classifiers and ensemble learning techniques on individual and integrated raw features.
Feature Classifier ACC (%) ERR (%) Recall Precision F-Score
DLLs Imported Naïve Bayes 96.03 3.97 0.998 0.961 0.979
SVM 96.27 3.73 0.998 0.964 0.981
Decision Tree 96.37 3.63 0.998 0.965 0.981
Random Forest 96.41 3.59 1.000 0.964 0.981
KNN 96.20 3.80 0.999 0.962 0.980
Nearest Centroid 95.73 4.27 0.996 0.960 0.978
Gradient Boost 96.08 3.92 1.000 0.960 0.980
Ensemble Learning
Majority Voting 96.31 3.69 0.999 0.963 0.981
Stacking Generalization 96.47 3.53 0.999 0.965 0.982
AdaBoost 96.41 3.59 0.998 0.965 0.981
API Functions Naïve Bayes 95.42 4.58 0.988 0.965 0.976
SVM 94.84 5.15 1.000 0.948 0.973
Decision Tree 96.46 3.54 0.999 0.965 0.982
Random Forest 96.59 3.41 0.999 0.966 0.982
KNN 95.56 4.44 0.999 0.955 0.977
Nearest Centroid 95.80 4.20 0.994 0.963 0.978
Gradient Boost 96.51 3.49 0.999 0.965 0.982
Ensemble Learning
Majority Voting 96.19 3.81 1.000 0.961 0.980
Stacking Generalization 96.37 3.63 0.999 0.964 0.981
AdaBoost 96.36 3.64 0.999 0.963 0.981
PE Header Naïve Bayes 95.09 4.91 0.994 0.956 0.974
SVM 97.10 2.90 0.995 0.975 0.985
Decision Tree 99.11 0.89 0.997 0.994 0.995
Random Forest 99.36 0.64 0.999 0.994 0.997
KNN 98.71 1.29 0.994 0.992 0.993
Nearest Centroid 93.74 6.26 0.978 0.957 0.967
Gradient Boost 98.86 1.14 0.997 0.991 0.994
Ensemble Learning
Majority Voting 98.83 1.17 0.998 0.989 0.994
Stacking Generalization 99.31 0.69 0.998 0.994 0.996
AdaBoost 99.11 0.89 0.997 0.994 0.995
PE Section Naïve Bayes 94.02 5.98 0.989 0.949 0.969
SVM 95.80 4.20 0.996 0.961 0.978
Decision Tree 96.47 3.53 0.981 0.981 0.981
Random Forest 97.32 2.67 0.991 0.981 0.986
KNN 96.72 3.28 0.988 0.978 0.983
Nearest Centroid 92.00 8.00 0.956 0.959 0.958
Gradient Boost 97.01 2.99 0.996 0.973 0.984
Ensemble Learning
Majority Voting 97.03 2.97 0.998 0.972 0.984
Stacking Generalization 97.31 2.70 0.992 0.980 0.986
AdaBoost 96.69 3.31 0.984 0.981 0.982
Integrated Feature Set1 Naïve Bayes 95.97 4.03 0.988 0.970 0.979
SVM 95.69 4.31 1.000 0.956 0.978
Decision Tree 97.69 2.31 0.999 0.977 0.988
Random Forest 97.53 2.47 0.999 0.975 0.987
KNN 95.80 4.20 0.999 0.958 0.978
Nearest Centroid 95.64 4.36 0.993 0.962 0.977
Gradient Boost 97.24 2.76 1.000 0.972 0.986
Ensemble Learning
Majority Voting 97.05 2.95 1.000 0.970 0.985
Stacking Generalization 97.68 2.32 0.998 0.977 0.988
AdaBoost 97.85 2.15 0.999 0.979 0.989
Integrated Feature Set2 Naïve Bayes 93.23 6.77 0.973 0.956 0.964
SVM 97.61 2.39 0.998 0.977 0.987
Decision Tree 99.14 0.86 0.996 0.995 0.995
Random Forest 99.41 0.59 0.999 0.995 0.997
KNN 98.52 1.48 0.994 0.991 0.992
Nearest Centroid 93.67 6.33 0.974 0.960 0.967
Gradient Boost 98.91 1.09 0.998 0.991 0.994
Ensemble Learning
Majority Voting 98.87 1.13 0.999 0.989 0.994
Stacking Generalization 99.48 0.52 0.999 0.996 0.997
AdaBoost 99.14 0.86 0.996 0.995 0.995
DOI: 10.7717/peerjcs.1319/table-5

Regarding the integrated features, the decision tree model and AdaBoost ensemble learning method outperform with accuracies of 97.69% and 97.85% respectively when we integrate DLLs and API functions into one feature set, i.e, IFS1. For combined PE Header and Section, i.e, IFS2, random forest, and Stack Generalization give the best results with accuracies of 99.41% and 99.48% respectively whereas naive Bayes gives the worst results. On average, we detect malware with an accuracy of more than 97% using raw features. The maximum F-score of 0.997 is delivered by both the random forest model and Stack Generalization method on IFS2 while the minimum error rate is given by Stack Generalization on the same feature set.

Testing with selected features

In the second experiment, we evaluate the performance of our system on selected features after applying Information Gain (IG) and principal component analysis (PCA) methods to choose important features. The results obtained with the features selected using Information Gain are presented in Table 6. The table shows that the performance of different classifiers slightly decreases when compared to their performance on raw features. However, overall the performance improves on integrated feature sets. Moreover, using selected features we can lessen the training time significantly. We achieve the best accuracy of 99.5% and the best F-score of 0.998 with the Stack Generalization method on IFS2 when we apply it to the top 20% features ranked by their IG score. Similarly, the results shown in Table 7 depict that the contribution of PCA transformation slimly deteriorates the performance. We achieve the best accuracy of 99.41% and the best F-score of 0.997 with the Stack Generalization method on IFS2 when we apply it to the top 30% features selected by their principal components. The reason that the accuracy does not improve significantly with selected features is that, on average, we select only 20% and 30% most important features after applying IG and PCA respectively (In the case of API Functions we use the top 1% features). The main purpose of selecting a small number of important features was to reduce the training time significantly while maintaining a good overall performance. On the other side, the ensemble learning techniques show promising results on both the raw and selected features and their performance improves marginally on selected integrated features.

Table 6:
The performance of different classifiers and ensemble learning techniques on individual and integrated features selected using the Information Gain method.
Feature Classifier ACC (%) ERR (%) Recall Precision F-Score
DLLs Imported Naïve Bayes 95.36 4.64 0.997 0.956 0.976
SVM 95.80 4.20 0.999 0.958 0.978
Decision Tree 95.83 4.17 0.999 0.959 0.978
Random Forest 95.78 4.22 0.999 0.958 0.978
KNN 95.78 4.22 0.999 0.958 0.978
Nearest Centroid 95.49 4.51 0.997 0.957 0.977
Gradient Boost 95.73 4.27 1.000 0.957 0.978
Ensemble Learning
Majority Voting 96.31 3.69 1.000 0.963 0.981
Stacking Generalization 96.46 3.54 0.998 0.965 0.982
AdaBoost 96.47 3.53 0.999 0.965 0.982
API Functions Naïve Bayes 93.65 6.35 0.969 0.964 0.966
SVM 96.58 3.42 1.000 0.965 0.982
Decision Tree 96.37 3.63 0.997 0.965 0.981
Random Forest 96.48 3.52 0.998 0.965 0.982
KNN 96.39 3.61 0.998 0.965 0.981
Nearest Centroid 94.41 5.59 0.978 0.963 0.971
Gradient Boost 96.49 3.51 1.000 0.964 0.982
Ensemble Learning
Majority Voting 96.58 3.42 1.000 0.965 0.982
Stacking Generalization 96.42 3.58 0.998 0.965 0.981
AdaBoost 96.37 3.63 0.998 0.965 0.981
PE Header Naïve Bayes 94.21 5.79 1.000 0.942 0.970
SVM 95.54 4.46 0.995 0.959 0.977
Decision Tree 98.76 1.24 0.995 0.992 0.993
Random Forest 99.11 0.89 0.997 0.993 0.995
KNN 98.59 1.41 0.995 0.990 0.993
Nearest Centroid 94.75 5.25 0.988 0.958 0.973
Gradient Boost 98.64 1.36 0.996 0.990 0.993
Ensemble Learning
Majority Voting 98.78 1.22 0.998 0.989 0.994
Stacking Generalization 99.31 0.70 0.998 0.994 0.996
AdaBoost 99.08 0.92 0.997 0.993 0.995
PE Section Naïve Bayes 94.19 5.81 1.000 0.942 0.970
SVM 95.41 4.59 0.997 0.956 0.976
Decision Tree 96.14 3.86 0.980 0.979 0.980
Random Forest 96.89 3.11 0.991 0.976 0.984
KNN 96.44 3.56 0.989 0.973 0.981
Nearest Centroid 92.89 7.11 0.967 0.958 0.962
Gradient Boost 96.66 3.34 0.996 0.969 0.982
Ensemble Learning
Majority Voting 96.98 3.02 0.997 0.971 0.984
Stacking Generalization 97.33 2.67 0.993 0.979 0.986
AdaBoost 96.49 3.51 0.982 0.981 0.981
Integrated Feature Set1 Naïve Bayes 94.37 5.63 0.971 0.969 0.970
SVM 97.03 2.97 0.997 0.972 0.984
Decision Tree 96.97 3.07 0.997 0.971 0.984
Random Forest 97.05 2.92 0.997 0.973 0.985
KNN 96.92 3.08 0.997 0.971 0.984
Nearest Centroid 94.37 5.63 0.977 0.964 0.970
Gradient Boost 96.95 3.05 0.997 0.971 0.984
Ensemble Learning
Majority Voting 97.02 2.98 0.997 0.972 0.984
Stacking Generalization 97.07 2.92 0.997 0.973 0.985
AdaBoost 97.07 2.93 0.997 0.973 0.985
Integrated Feature Set2 Naïve Bayes 95.35 4.65 0.996 0.956 0.976
SVM 95.73 4.27 0.998 0.959 0.978
Decision Tree 99.13 0.86 0.995 0.996 0.995
Random Forest 99.36 0.60 0.999 0.994 0.997
KNN 98.20 1.80 0.994 0.987 0.990
Nearest Centroid 94.81 5.19 0.991 0.956 0.973
Gradient Boost 98.99 1.01 0.997 0.992 0.995
Ensemble Learning
Majority Voting 98.56 1.48 0.999 0.986 0.992
Stacking Generalization 99.50 0.47 0.999 0.996 0.998
AdaBoost 99.16 0.76 0.996 0.996 0.996
DOI: 10.7717/peerjcs.1319/table-6
Table 7:
The performance of different classifiers and ensemble learning techniques on individual and integrated features selected using principal component analysis method.
Feature Classifier ACC (%) ERR (%) Recall Precision F-Score
DLLs Imported Naïve Bayes 95.47 4.53 0.999 0.955 0.977
SVM 95.73 4.27 0.999 0.957 0.978
Decision Tree 95.64 4.36 1.000 0.956 0.977
Random Forest 95.61 4.39 1.000 0.956 0.977
KNN 95.44 4.56 0.999 0.954 0.976
Nearest Centroid 95.54 4.46 0.997 0.957 0.977
Gradient Boost 95.39 4.61 1.000 0.954 0.976
Ensemble Learning
Majority Voting 95.66 4.34 1.000 0.956 0.977
Stacking Generalization 95.69 4.31 0.999 0.957 0.978
AdaBoost 95.71 4.29 1.000 0.957 0.978
API Functions Naïve Bayes 94.75 5.25 0.999 0.948 0.973
SVM 94.70 5.30 0.998 0.948 0.973
Decision Tree 94.80 5.20 0.999 0.948 0.973
Random Forest 94.78 5.22 0.999 0.948 0.973
KNN 94.53 5.47 1.000 0.945 0.972
Nearest Centroid 94.56 5.44 0.994 0.951 0.972
Gradient Boost 94.42 5.58 1.000 0.944 0.971
Ensemble Learning
Majority Voting 94.81 5.20 1.000 0.948 0.973
Stacking Generalization 94.76 5.24 0.996 0.951 0.973
AdaBoost 94.81 5.19 0.999 0.948 0.973
PE Header Naïve Bayes 94.21 5.79 1.000 0.942 0.970
SVM 96.09 3.91 0.995 0.964 0.980
Decision Tree 98.66 1.34 0.994 0.992 0.993
Random Forest 99.20 0.81 0.999 0.993 0.996
KNN 98.54 1.46 0.994 0.991 0.992
Nearest Centroid 90.81 9.19 0.936 0.965 0.950
Gradient Boost 98.39 1.61 0.996 0.987 0.991
Ensemble Learning
Majority Voting 98.46 1.54 0.999 0.985 0.992
Stacking Generalization 99.26 0.74 0.999 0.993 0.996
AdaBoost 99.11 0.89 0.998 0.992 0.995
PE Section Naïve Bayes 94.22 5.78 1.000 0.942 0.970
SVM 95.70 4.30 0.996 0.960 0.978
Decision Tree 96.62 3.38 0.981 0.983 0.982
Random Forest 97.63 2.37 0.993 0.983 0.987
KNN 96.77 3.23 0.988 0.978 0.983
Nearest Centroid 92.89 7.11 0.967 0.958 0.962
Gradient Boost 96.84 3.16 0.997 0.970 0.983
Ensemble Learning
Majority Voting 96.89 3.11 0.998 0.970 0.984
Stacking Generalization 97.60 2.40 0.993 0.982 0.987
AdaBoost 96.91 3.09 0.986 0.981 0.984
Integrated Feature Set1 Naïve Bayes 94.92 5.08 0.998 0.950 0.974
SVM 94.97 5.03 0.998 0.951 0.974
Decision Tree 94.95 5.05 0.999 0.950 0.974
Random Forest 94.93 5.07 0.999 0.949 0.974
KNN 94.68 5.32 0.999 0.948 0.972
Nearest Centroid 94.64 5.36 0.992 0.953 0.972
Gradient Boost 94.49 5.51 1.000 0.945 0.972
Ensemble Learning
Majority Voting 94.95 5.05 0.999 0.950 0.974
Stacking Generalization 94.93 5.07 0.998 0.950 0.974
AdaBoost 95.05 4.95 0.999 0.951 0.974
Integrated Feature Set2 Naïve Bayes 94.28 5.76 1.000 0.942 0.970
SVM 96.47 3.53 0.997 0.966 0.982
Decision Tree 98.98 1.02 0.995 0.994 0.995
Random Forest 99.31 0.69 0.999 0.993 0.996
KNN 98.66 1.34 0.996 0.990 0.993
Nearest Centroid 96.11 3.89 0.988 0.962 0.953
Gradient Boost 98.92 1.08 0.997 0.992 0.994
Ensemble Learning
Majority Voting 98.94 1.06 1.000 0.989 0.994
Stacking Generalization 99.41 0.59 0.999 0.995 0.997
AdaBoost 99.16 0.84 0.996 0.995 0.996
DOI: 10.7717/peerjcs.1319/table-7

ROC curves

Receiver operating characteristic (ROC) curve graphically shows the performance of a classifier at all classification thresholds. It is created by plotting Recall or True Positive Rate (TPR) against Specificity or False Positive Rate (FPR) where Specificity is calculated as T N T N + F P . ROC curve depicts the discriminative ability of a binary classifier and is considered a good metric when class imbalance might lead to accuracy paradox (Gibert, Mateu & Planes, 2020).

Figure 2 shows the ROC curves for six classifiers on raw feature sets. It excludes the NC classifier as we cannot compute probabilities in NC. The figure also shows the AUC or Area Under the Curve for each ROC. Both ROC and AUC values confirm that all the classifiers give a good performance because the feature sets help them in discriminating different classes of malware at all thresholds. The figure also shows that the AUC values of IFS2 are very promising and reach the maximum values of 1.00 and 0.99 for RF and GB classifiers respectively whereas these values are 1.00 and 0.98 with the PE Header feature set. We can see similar trends for other classifiers where the integrated feature sets improve the discriminating ability of a classifier as its AUC values increase. Random Forest outperforms all other classifiers on both individual and integrated features. The ROC curves on selected features after applying IG and PCA are given in Appendix A (See Figs. A1 and A2).

ROC curves for different classifiers on raw feature sets.

Figure 2: ROC curves for different classifiers on raw feature sets.

Accuracy of different classifiers on raw feature sets with 10-fold cross validation.

Figure 3: Accuracy of different classifiers on raw feature sets with 10-fold cross validation.

10-fold cross validation

As mentioned above, we use 10-fold cross-validation for training and testing the models. Since a single train-test split has limitations such as the split might not represent each class proportionally, therefore, the roust cross-validation method is becoming a default. For 10-fold cross-validation, we split the dataset into 10 folds and for 10 times the 9 folds are used for training and the one fold is used for testing. The final result is given as the average of all 10 folds.

Figure 3 shows the accuracy of each model during 10-fold cross-validation in the form of a box plot for each classifier. The figure shows that there is more variation in the case of DLLs Imported and API Functions feature sets. Though the variation decreases when we combine these features in IFS1, it is still more than that of other features. A possible reason is that in every fold the feature vectors had very different words from the bag of words and the sparseness of these vectors gave diverse results. We can also see that there are more outliers in this case compared to other feature sets. On the other hand, PE Header, PE Section, and IFS2 have low variation as their feature vectors have normalized numeric values and there is no chance of having a sparse vector in training or testing. In a nutshell, all the classifiers give high accuracy on all the features but there is more variation in IFS1 and its components compared to IFS2 and its components. The accuracy of different classifiers on selected features after applying PCA and IG with 10-fold cross validation is given in Appendix A. (see Fig. A3 and Fig. A4).

Performance on raw vs. selected features

Table 8 summarizes the maximum accuracy achieved by our system on raw and selected features. Our system is composed of seven classifiers and three ensemble learning techniques as shown in Fig. 1 and we pick the best results in each case for comparison. Table 8 shows that IFS2 gives the best results on both raw and selected features. It is also clear from the table that our system achieves the best accuracy of 99.50% on this set after selecting important features using the Information Gain method. The table also highlights a very interesting pattern. When we apply IG to select features, the accuracy of our system slightly improves on IFS2 while it marginally decreases on IFS1. On the other side, PCA does not seem to play its role and the accuracy of our system slightly decreases on both raw and selected features. The table also shows that the system performs consistently better on the PE Header feature set than other feature sets. It seems that the PE Header feature set alone or integrated with another feature set (e.g., PE Section) is a good candidate for developing a malware detection system for filtering zero-day malware.

Comparison with previous work

In this section, we compare the performance of our system with previous systems or studies to detect malware. We compare our work with some recent works that also applied static malware analysis to detect malware using classic machine learning models. We reproduce these works so that we have a fair comparison. There are more works related to this proposed work. However, either they used different feature sets such as byte-n-gram, opcode-n-gram or they applied deep learning neural networks to detect malware, therefore, we skipped them as they are less relevant.

Table 8:
The maximum accuracy (in percentage) achieved by our system on raw feature sets and selected feature sets after applying Information Gain and principal component analysis.
Feature type Raw features After PCA After IG
DLLs Imported 96.47 95.73 96.47
API Functions 96.59 94.81 96.58
PE Header 99.36 99.26 99.31
PE Section 97.32 97.63 97.33
Integrated Feature Set1 97.85 95.05 97.07
Integrated Feature Set2 99.48 99.41 99.50
DOI: 10.7717/peerjcs.1319/table-8

Table 9 summarizes the accuracy and error rate of the proposed work and the previous works. The work presented by Kumar, Kuppusamy & Aghila (2019) uses 53 field values of the PE header as raw features and then creates an integrated feature set having 68 features. The work then applies six classification models on both raw and integrated features. Random Forest model achieves the maximum accuracy of 98.4% and an error rate of 1.47% on integrated features in this work. Azmee et al. (2020) extract 77 features of the PE header and deploy nine classifiers to classify malware samples. They achieve the maximum accuracy and minimum error rate of 98.6% and 1.41% respectively with the XGBoost model. The work Damaševičius et al. (2021) implements both machine learning and deep learning models for Windows PE malware detection. ExtraTrees model achieves an accuracy of 98.7% and an error rate of 1.32% on the 68-dimensional feature set of the PE header in this work. Kim et al. (2021) first extract 54 attributes from the PE structure and then use the top 12 most important features to classify malware. Their work achieves a maximum accuracy of 98.7% and a minimum error rate of 1.31% using the AdaBoost model. It is clear from Table 9 that our proposed work with an accuracy of 99.5% and error rate of only 0.47% on integrated feature set outperforms previous works. The table shows that the proposed system produces a very small error. In other words, the probability of misclassification in the proposed system is much lower than in the previous systems. We agree that in terms of accuracy the system improvement is marginal, however, when combined with other metrics, the proposed system gives better results, especially in terms of a very small error rate.

Table 9:
Comparison of the proposed work with the previous works.
Work Accuracy Error rate Classifier Feature set
Proposed work 99.5% 0.47% Random forest Integrated feature set
Kumar, Kuppusamy & Aghila (2019) 98.3% 1.47% Random forest Integrated feature set
Azmee et al. (2020) 98.6% 1.41% XGBoost 77 features of PE header
Damaševičius et al. (2021) 98.7% 1.32% ExtraTrees 68 features of PE header
98.7% 1.31% AdaBoost PE Structure
DOI: 10.7717/peerjcs.1319/table-9

Conclusion

The work presents a static malware detection system based on mining DLLs, and API calls from each DLL, PE Header, and PE Section and also combines the features to create integrated features. A new dataset of a total of 27,920 PE samples is collected and the features are extracted to feed them to seven machine learning models and three ensemble learning techniques. Moreover, Information Gain and principal component analysis are used to finding a reduced set of features. The empirical results show that random forest outperforms all other classifiers while decision tree stands second. An accuracy of 99.5% with an error rate of only 0.47% is achieved on the integrated feature set, a combination of PE Header and PE Section. On average, the system’s accuracy is greater than 96% while the error rate is below 3.5%. The feature set having the values of PE Header turns out to be the best feature set and when combined with PE Section, the resulting integrated feature set gives the maximum accuracy. Furthermore, the system surpasses the previous studies in terms of higher accuracy and lower error rate.

As a tangible outcome, a preprocessed dataset having 27,920 malware samples is created and available on request along with raw and integrated feature sets for comparing future work with the proposed work. We tested the proposed malware detection system extensively and performed multiple experiments on raw and integrated features to check its performance. By applying two feature selection methods, seven machine learning classifiers and three ensemble learning techniques on multiple features, we tried to bridge the gap in the previous works on malware detection.

Our experiments show that PE Header forms the best feature set and gives the maximum accuracy and minimum error rate when integrated with PE Section. However, real-world scenario can be different from the experimental environment, hence, we cannot recommend using PE Header alone to detect malware. But, we do assert that this could be a starting point to further explore PE Header and its fields to develop a feature set for detecting zero-day attacks accurately and quickly. In future studies, we can add more file formats such as image, pdf, audio, video, etc. We can also work on adding mobile environments such as iOS and android.

Supplemental Information

The python implementation of machine learning models, ensemble learning methods and feature selection techniques

DOI: 10.7717/peerj-cs.1319/supp-1
8 Citations   Views   Downloads