Unbalanced sentiment classification: an assessment of ANN in the context of sampling the majority class

Unidade de Graduação, Universidade Vale do Rio dos Sinos, São Leopoldo, RS, Brazil
Artificial Intelligence Engineers, Porto Alegre, RS, Brazil
School of Engeneering & IT, Centro Universitário Ritter dos Reis, Porto Alegre, Rio Grande do Sul, Brazil
DOI
10.7287/peerj.preprints.26618v1
Subject Areas
Computational Linguistics, Data Mining and Machine Learning, Data Science, Natural Language and Speech, World Wide Web and Web Science
Keywords
Opinion mining, Artificial neural networks, Support vector machines, Sentiment classification, Unbalanced dataset, Comparative study
Copyright
© 2018 Moraes et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Moraes R, Valiati JF, Gavião Neto WP. 2018. Unbalanced sentiment classification: an assessment of ANN in the context of sampling the majority class. PeerJ Preprints 6:e26618v1

Abstract

Many people make their opinions available on the Internet nowadays, and researchers have been proposing methods to automate the task of classifying textual reviews as positive or negative. Usual supervised learning techniques have been adopted to accomplish such a task. In practice, positive reviews are abundant in comparison to negative's. This context poses challenges to learning-based methods and data undersampling/oversampling are popular preprocessing techniques to overcome the problem. A combination of sampling techniques and learning methods, like Artificial Neural Networks (ANN) or Support Vector Machines (SVM), has been successfully adopted as a classification approach in many areas, while the sentiment classification literature has not explored ANN in studies that involve sampling methods to balance data. Even the performance of SVM, which is widely used as a sentiment learner, has been rarely addressed under the context of a preceding sampling method. This paper addresses document-level sentiment analysis with unbalanced data and focus on empirically assessing the performance of ANN in the context of undersampling the (majority) set of positive reviews. We adopted the performance of SVM as a baseline, since some studies have indicated SVM as being less subject to the class imbalance problem. Results are produced in terms of a traditional bag-of-words model with popular feature selection and weighting methods. Our experiments indicated that SVM are more stable than ANN in highly unbalanced (80%) data scenarios. However, under the discarding of information generated by random undersampling, ANN outperform SVM or produce comparable results.

Author Comment

This is a submission to PeerJ Computer Science for review.

Supplemental Information

Matlab code (.m and .mat) and raw data

DOI: 10.7287/peerj.preprints.26618v1/supp-1