Wrangling categorical data in R

Amelia McNamara; Nicholas J Horton

doi:10.7287/peerj.preprints.3163v2

Wrangling categorical data in R

Amelia McNamara ¹, Nicholas J Horton²

1 Program in Statistical & Data Sciences, Smith College, Northampton, Massachusetts, United States

2 Department of Mathematics and Statistics, Amherst College, Amherst, Massachusetts, United States

DOI: 10.7287/peerj.preprints.3163v2

Published: 2017-08-30
Accepted: 2017-08-30

Subject Areas: Computer Education, Data Science, Scientific Computing and Simulation, Social Computing
Keywords: statistical computing, data derivation, data science, data management

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: McNamara A, Horton NJ. 2017. Wrangling categorical data in R. PeerJ Preprints 5:e3163v2 https://doi.org/10.7287/peerj.preprints.3163v2

Abstract

Data wrangling is a critical foundation of data science, and wrangling of categorical data is an important component of this process. However, categorical data can introduce unique issues in data wrangling, particularly in real-world settings with collaborators and periodically-updated dynamic data. This paper discusses common problems arising from categorical variable transformations in R, demonstrates the use of factors, and suggests approaches to address data wrangling challenges. For each problem, we present at least two strategies for management, one in base R and the other from the ‘tidyverse.’ We consider several motivating examples, suggest defensive coding strategies, and outline principles for data wrangling to help ensure data quality and sound analysis.

Author Comment

This version contains updated citations to other articles in this collection.