Assessing the reproducibility of discriminant function analyses

Rose L Andrew; Arianne YK Albert; Sebastien Renaut; Diana J Rennison; Dan G Bock; Tim Vines

doi:10.7287/peerj.preprints.832v1

Assessing the reproducibility of discriminant function analyses

Rose L Andrew^1,2, Arianne YK Albert³, Sebastien Renaut^2,4, Diana J Rennison², Dan G Bock², Tim Vines ^2,5

1 School of Environmental and Rural Science, University of New England, Armidale, NSW, Australia

2 Biodiversity Research Centre, University of British Columbia, Vancouver, BC, Canada

3 Women’s Health Research Institute, BC Women’s Hospital and Health Centre, Vancouver, BC, Canada

4 Institut de recherche en biologie végétale, Département de sciences biologiques, Université de Montréal, Montreal, QC, Canada

5 Molecular Ecology Editorial Office, Vancouver, BC, Canada

DOI: 10.7287/peerj.preprints.832v1

Published: 2015-02-13
Accepted: 2015-02-13

Subject Areas: Evolutionary Studies, Taxonomy, Zoology, Science Policy
Keywords: data curation, repeatability, data archiving, statistics

Copyright: © 2015 Andrew et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Cite this article: Andrew RL, Albert AY, Renaut S, Rennison DJ, Bock DG, Vines T. 2015. Assessing the reproducibility of discriminant function analyses. PeerJ PrePrints 3:e832v1 https://doi.org/10.7287/peerj.preprints.832v1

Abstract

Data are the foundation of empirical research, yet all too often the datasets underlying published papers are unavailable, incorrect, or poorly curated. This is a serious issue, because future researchers are then unable to validate published results or reuse data to explore new ideas and hypotheses. While data files may be securely stored and accessible, they must also be accompanied by accurate labels and identifiers. To assess how often problems with metadata or data curation affect the reproducibility of published results, we attempted to reproduce Discriminant Function Analyses (DFAs) from the field of organismal biology. DFA is a commonly used statistical analysis that has changed little since its inception almost eight decades ago, and therefore provides an excellent case study to test reproducibility. Out of 100 papers we initially surveyed, fourteen were excluded because they did not present the common types of quantitative result from their DFA, used complex and unique data transformations, or gave insufficient details of their DFA. Of the remaining 86 datasets, there were 16 cases for which we were unable to confidently relate the dataset we received to the one used in the published analysis. The reasons ranged from incomprehensible or absent variable labels, the DFA being performed on an unspecified subset of the data, or incomplete data sets. We focused on reproducing three common summary statistics from DFAs: the percent variance explained, the percentage correctly assigned and the largest discriminant function coefficient. The reproducibility of the first two was high (20 of 25 and 43 of 59 datasets, respectively), whereas our success rate with the discriminant function coefficients was lower (15 of 36 datasets). When considering all three summary statistics, we were able to completely reproduce 46 (66%) of 70 datasets. While our results are encouraging, they highlight the fact that science still has some way to go before we have the carefully curated and reproducible research that the public expects.

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

R code used for reanalyzing the discriminant functions

DOI: 10.7287/peerj.preprints.832v1/supp-1

Download

Supplemental Information

R code used for reanalyzing the discriminant functions

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article