A redundancy-removing feature selection algorithm for nominal data

Zhihua Li

doi:10.7287/peerj.preprints.1184v1

A redundancy-removing feature selection algorithm for nominal data

1 Key Laboratory of Advanced Process Control for Light Industry Ministry of Education, JiangSu, China

2 Engineering Research Center of Internet of Things Technology Application Ministry of Education, JiangSu, China

3 Department of Computer Science, Engineering School of Internet of Things Engineering, Jiangnan University, JiangSu, China

4 Department of Computer Science, Georgia State University, Atlanta, GA, United States of America

DOI: 10.7287/peerj.preprints.1184v1

Published: 2015-06-18
Accepted: 2015-06-18

Subject Areas: Data Mining and Machine Learning, Data Science
Keywords: Nominal data, Feature selection, Redundancy-removing, Mutual information

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.

Cite this article: Li Z. 2015. A redundancy-removing feature selection algorithm for nominal data. PeerJ PrePrints 3:e1184v1 https://doi.org/10.7287/peerj.preprints.1184v1

Abstract

No order correlation or similarity metric exists in nominal data, and there will always be more redundancy in a nominal dataset, which means that an efficient mutual information-based nominal-data feature selection method is relatively difficult to find. In this paper, a nominal-data feature selection method based on mutual information without data transformation, called the redundancy-removing more relevance less redundancy algorithm, is proposed. By forming several new information-related definitions and the corresponding computational methods, the proposed method can compute the information-related amount of nominal data directly. Furthermore, by creating a new evaluation function that considers both the relevance and the redundancy globally, the new feature selection method can evaluate the importance of each nominal-data feature. Although the presented feature selection method takes commonly used MIFS-like forms, it is capable of handling high-dimensional datasets without expensive computations. We perform extensive experimental comparisons of the proposed algorithm and other methods using three benchmarking nominal datasets with two different classifiers. The experimental results demonstrate the average advantage of the presented algorithm over the well-known NMIFS algorithm in terms of the feature selection and classification accuracy, which indicates that the proposed method has a promising performance.

Author Comment

This is a submission to PeerJ Computer Science for review.