The impact of using large training data set KDD99 on classification accuracy

Atilla Özgür; Hamit Erdem

doi:10.7287/peerj.preprints.2838v1

The impact of using large training data set KDD99 on classification accuracy

Atilla Özgür , Hamit Erdem

Electrical Engineering, Başkent University, Ankara, Ankara, Turkey

DOI: 10.7287/peerj.preprints.2838v1

Published: 2017-03-01
Accepted: 2017-03-01

Subject Areas: Data Mining and Machine Learning, Data Science
Keywords: Machine Learning, KDD99, Supervised Learning, Classification, Intrusion Detection, Large Datasets

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Özgür A, Erdem H. 2017. The impact of using large training data set KDD99 on classification accuracy. PeerJ Preprints 5:e2838v1 https://doi.org/10.7287/peerj.preprints.2838v1

Abstract

This study investigates the effects of using a large data set on supervised machine learning classifiers in the domain of Intrusion Detection Systems (IDS). To investigate this effect 12 machine learning algorithms have been applied. These algorithms are: (1) Adaboost, (2) Bayesian Nets, (3) Decision Tables, (4) Decision Trees (J48), (5)Logistic Regression, (6) Multi-Layer Perceptron, (7) Naive Bayes, (8) OneRule, (9)Random Forests, (10) Radial Basis Function Neural Networks, (11) Support Vector Machines (two different training algorithms), and (12) ZeroR. A well-known IDS benchmark dataset, KDD99 has been used to train and test classifiers. Full training data set of KDD99 is 4.9 million instances while full test dataset is 311,000 instances. In contrast to similar previous studies, which used 0.08%–10% for training and 1.2%–100% for testing, this study uses full training dataset and full test dataset. Weka Machine Learning Toolbox has been used for modeling and simulation. The performance of classifiers has been evaluated using standard binary performance metrics: Detection Rate, True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate, Precision, and F1-Rate. To show effects of dataset size, performance of classifiers has been also evaluated using following hardware metrics: Training Time, Working Memory and Model Size. Test results shows improvements in classifiers in standard performance metrics compared to previous studies.

Author Comment

This is a submission to PeerJ Computer Science for review.