An empirical analysis of machine learning models for automated essay grading
- Published
- Accepted
- Subject Areas
- Artificial Intelligence, Computational Linguistics, Data Mining and Machine Learning
- Keywords
- Automated Essay Grading, Information Retrieval, Machine Learning, Text Classification
- Copyright
- © 2018 Madala et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2018. An empirical analysis of machine learning models for automated essay grading. PeerJ Preprints 6:e3518v1 https://doi.org/10.7287/peerj.preprints.3518v1
Abstract
Background. Automated Essay Scoring (AES) is an area which falls at the intersection of computing and linguistics. AES systems conduct a linguistic analysis of a given essay or prose and then estimates the writing skill or the essay quality in the form a numeric score or a letter grade. AES systems are useful for the school, university and testing company community for efficiently and effectively scaling the task of grading a large number of essays.
Methods. We propose an approach for automatically grading a given essay based on 9 surface level and deep linguistic features, 2 feature selection and ranking techniques and 4 text classification algorithms. We conduct a series of experiments on publicly available manually graded and annotated essay data and demonstrate the effectiveness of our approach. We investigate the performance of two different features selection techniques (1) RELIEF (2) Correlation-based Feature Subset Selection (CFS) with three different machine learning classifiers (kNN, SVM and Linear Regression). We also apply feature normalization and scaling.
Results. Our results indicate that features like world count with respect to the world limit, appropriate use of vocabulary, relevance of the terms in the essay with the given topic and coherency between sentences and paragraphs are good predictors of essay score. Our analysis reveals that not all features are equally important and few features are more relevant and better correlated with respect to the target class. We conduct experiments with k-nearest neighbour, logistic regression and support vector machine based classifiers. Our results on 4075 essays across multiple topics and grade score range are encouraging with an accuracy of 73% to 93%.
Discussion. Our experiments and approach are based on Grade 7 to Grade 10 essays which can be generalized to essays from other grades and level after doing context specific customization. Few features are more relevant and important than other features and it is interplay or combination of multiple feature values which determines the final score. We observe that different classifiers result in difference accuracy.
Author Comment
This is a submission to PeerJ Computer Science for review.
Supplemental Information
Source Code
All our source code (Python files) are uploaded on GitHub public repository. Following is the link to our GitHub Public Repository https://github.com/ashoka-university/CS309-IR-Monsoon-2017-Off-by-One
Processed Dataset
All our dataset is uploaded on Figshare. Following is the link to our dataset repository on Figshare https://doi.org/10.6084/m9.figshare.5765727.v1