This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. Contrary to some popular measures it is not focused on the shape of decision boundary in a classification task but on the amount of available data with respect to attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. We use it to propose a new variant of learning curve plot called generalisation curve. Generalisation curve is a standard learning curve with x-axis rescaled according to the data set complexity curve. It is a classifier performance measure, which shows how well the information present in the data is utilised. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining the performance of specific classifiers on these sets. Then we apply our methodology to a panel of benchmarks of standard machine learning algorithms on typical data sets, demonstrating how it can be used in practice to gain insights into data characteristics and classifier behaviour. Moreover, we show that complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without reducing classification accuracy. Associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).
This paper introduces new methods of data analysis and classifier evaluation. They can be used within experimental studies of machine learning or applied as diagnostic tools for real-life problems. The code is freely available. The manuscript was submitted to PeerJ Computer Science journal.
"Following" is like subscribing to any updates related to a preprint.
These updates will appear in your home dashboard each time you visit PeerJ.
You can also choose to receive updates via daily or weekly email digests.
If you are following multiple preprints then we will send you
no more than one email per day or week based on your preferences.
Note: You are now also subscribed to the subject areas of this preprint
and will receive updates in the daily or weekly email digests if turned on.
You can add specific subject areas through your profile settings.