Complexity curve: a graphical measure of data complexity and classifier performance
- Published
- Accepted
- Subject Areas
- Algorithms and Analysis of Algorithms, Artificial Intelligence, Data Mining and Machine Learning
- Keywords
- Learning curves, Data complexity, Data pruning, Hellinger distance, Bias-variance decomposition, Performance measures
- Copyright
- © 2016 Zubek et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2016) Complexity curve: a graphical measure of data complexity and classifier performance. PeerJ Preprints 4:e2095v1 https://doi.org/10.7287/peerj.preprints.2095v1 (
Abstract
We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. Contrary to some popular measures it is not focused on the shape of decision boundary in a classification task but on the amount of available data with respect to attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. We use it to propose a new variant of learning curve plot called generalisation curve. Generalisation curve is a standard learning curve with x-axis rescaled according to the data set complexity curve. It is a classifier performance measure, which shows how well the information present in the data is utilised. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining the performance of specific classifiers on these sets. Then we apply our methodology to a panel of benchmarks of standard machine learning algorithms on typical data sets, demonstrating how it can be used in practice to gain insights into data characteristics and classifier behaviour. Moreover, we show that complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without reducing classification accuracy. Associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).
Author Comment
This paper introduces new methods of data analysis and classifier evaluation. They can be used within experimental studies of machine learning or applied as diagnostic tools for real-life problems. The code is freely available. The manuscript was submitted to PeerJ Computer Science journal.
Supplemental Information
Software
The software for complexity curve calculations