An empirical analysis of machine learning models for automated essay grading

Computer Science, Ashoka University, Sonepat, Haryana, India
DOI
10.7287/peerj.preprints.3518v1
Subject Areas
Artificial Intelligence, Computational Linguistics, Data Mining and Machine Learning
Keywords
Automated Essay Grading, Information Retrieval, Machine Learning, Text Classification
Copyright
© 2018 Madala et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Madala DSV, Gangal A, Krishna S, Goyal A, Sureka A. 2018. An empirical analysis of machine learning models for automated essay grading. PeerJ Preprints 6:e3518v1

Abstract

Background. Automated Essay Scoring (AES) is an area which falls at the intersection of computing and linguistics. AES systems conduct a linguistic analysis of a given essay or prose and then estimates the writing skill or the essay quality in the form a numeric score or a letter grade. AES systems are useful for the school, university and testing company community for efficiently and effectively scaling the task of grading a large number of essays.

Methods. We propose an approach for automatically grading a given essay based on 9 surface level and deep linguistic features, 2 feature selection and ranking techniques and 4 text classification algorithms. We conduct a series of experiments on publicly available manually graded and annotated essay data and demonstrate the effectiveness of our approach. We investigate the performance of two different features selection techniques (1) RELIEF (2) Correlation-based Feature Subset Selection (CFS) with three different machine learning classifiers (kNN, SVM and Linear Regression). We also apply feature normalization and scaling.

Results. Our results indicate that features like world count with respect to the world limit, appropriate use of vocabulary, relevance of the terms in the essay with the given topic and coherency between sentences and paragraphs are good predictors of essay score. Our analysis reveals that not all features are equally important and few features are more relevant and better correlated with respect to the target class. We conduct experiments with k-nearest neighbour, logistic regression and support vector machine based classifiers. Our results on 4075 essays across multiple topics and grade score range are encouraging with an accuracy of 73% to 93%.

Discussion. Our experiments and approach are based on Grade 7 to Grade 10 essays which can be generalized to essays from other grades and level after doing context specific customization. Few features are more relevant and important than other features and it is interplay or combination of multiple feature values which determines the final score. We observe that different classifiers result in difference accuracy.

Author Comment

This is a submission to PeerJ Computer Science for review.

Supplemental Information

Source Code

All our source code (Python files) are uploaded on GitHub public repository. Following is the link to our GitHub Public Repository https://github.com/ashoka-university/CS309-IR-Monsoon-2017-Off-by-One

DOI: 10.7287/peerj.preprints.3518v1/supp-1

Processed Dataset

All our dataset is uploaded on Figshare. Following is the link to our dataset repository on Figshare https://doi.org/10.6084/m9.figshare.5765727.v1

DOI: 10.7287/peerj.preprints.3518v1/supp-2