Problems in using p-curve analysis and text-mining to detect rate of p-hacking
- Published
- Accepted
- Subject Areas
- Science Policy, Statistics
- Keywords
- Reproducibility, P-hacking, Simulation, Ghost variables, Text-mining, P-curve
- Copyright
- © 2015 Bishop et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
- Cite this article
- 2015. Problems in using p-curve analysis and text-mining to detect rate of p-hacking. PeerJ PrePrints 3:e1266v3 https://doi.org/10.7287/peerj.preprints.1266v3
Abstract
Background: The p-curve is a plot of the distribution of p-values below .05 reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. We argue that binomial tests on the p-curve are not robust enough to be used for this purpose. Methods: P-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results: We first show that a p-curve suggestive of p-hacking can be obtained if researchers misapply parametric tests to data that depart from normality, even when no p-hacking occurs. We go on to show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions: A significant bump in the p-curve just below .05 is not necessarily evidence of p-hacking, and lack of a bump is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis.
Author Comment
This manuscript has been totally rewritten in the light of reviewer comments. We have added further analysis and description of the impact of correlated variables on the p-curve, have added an analysis of the impact of non-normality of underlying variables, have included a power analysis for p-curve analysis based on binomial tests on extreme p-value bins, and have deleted a section reanalysing the text-mined data of Head et al using different criteria for inclusion of studies.