Problems in using text-mining and p-curve analysis to detect rate of p-hacking

Dorothy V Bishop; Paul A Thompson

doi:10.7287/peerj.preprints.1266v1

Problems in using text-mining and p-curve analysis to detect rate of p-hacking

Dorothy V Bishop , Paul A Thompson

Department of Experimental Psychology, University of Oxford, Oxford, United Kingdom

DOI: 10.7287/peerj.preprints.1266v1

Published: 2015-07-27
Accepted: 2015-07-27

Subject Areas: Statistics
Keywords: Reproducibility, P-hacking, Simulation, Ghost variables, Text-mining, P-curve

Copyright: © 2015 Bishop et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.

Cite this article: Bishop DV, Thompson PA. 2015. Problems in using text-mining and p-curve analysis to detect rate of p-hacking. PeerJ PrePrints 3:e1266v1 https://doi.org/10.7287/peerj.preprints.1266v1

Abstract

Background: The p-curve is a plot of the distribution of p-values below .05 reported in a set of scientific studies. It has been used to estimate the frequency of bias in the selection of variables and analyses for publication, p-hacking. A recent study by Head et al. (2015) combined this approach with automated text-mining of p-values from over 100 000 published papers and concluded that although there was evidence of p-hacking, it was not common enough to cause serious distortions in the literature. Methods: P-hacking can take various forms. For the current paper, we developed R code to simulate the use of ghost variables, where an experimenter gathers data on numerous variables but reports only those with statistically significant effects. In addition, we examined the dataset used by Head et al. to assess its suitability for investigating p-hacking. This consisted of a set of open access papers that reported at least one p-value below .05; where more than one p-value was less than .05, one was randomly sampled per paper. Results: For uncorrelated variables, simulated p-hacked data do not give the signature left-skewed p-curve that Head et al. took as evidence of p-hacking. A right-skewed p-curve is obtained, as expected, when there is a true difference between groups, but it was also obtained in p-hacked datasets containing a high proportion of cases with a true null effect. The automated text mining used by Head et al. detected any p-value mentioned in the Results or Abstract of a paper, including those reported in the course of validation of materials or methods, or confirmation of well-established facts, as opposed to hypothesis-testing. There was no information on the statistical power of studies, nor on the statistical test conducted. In addition, Head et al. excluded p-values in tables, p-values reported as 'less than' rather than 'equal to' a given value, and those reported using scientific notation or in ranges. Conclusions: Use of ghost variables, a form of p-hacking where the experimenter tests many variables and reports only those with the largest effect sizes, does not give the kind of p-curve with left-skewing around .05 that Head et al. focused on. Furthermore, to interpret a p-curve we need to know whether the p-values were testing a specific hypothesis, and to be confident that if any p-values are excluded, the effect on the p-curve is random rather than systematic. It is inevitable that with automated text-mining there will be some inaccuracies in data: the key question is whether the advantages of having very large amounts of extracted data compensates for these inaccuracies. The analysis presented here suggests that the potential of systematic bias is mined data is substantial and invalidates conclusions about p-hacking based on p-values obtained by text-mining.

Author Comment

We plan to submit this manuscript to PeerJ for review.

Feedback on this revision

0

3743 days ago - Titus von der Malsburg

Good point about ghost variables and the p-curve. Unfortunately, things are even worse than you describe. In reading research, it is common to test dozens of dependent variables and to get published in high-ranking journals even if just one of them shows an effect. However, understanding of the p-value is so poor that people do not even bother to hide the non-significant variables. In fact, the opposite is the case: they are all reported (without appropriate corrections) and researchers pride themselves for having done a particularly thorough analysis. You might think that this is a rookie mistake but this way to analyse eye tracking data is the de facto standard in reading research. Here is our paper in which we investigate this issue through computational simulations: http://arxiv.org/abs/1504.06896

3711 days ago - Dorothy Bishop

Thanks for your feedback. My experience is much as yours: I find that at conferences, people often present large correlation matrices, or tables of differences, and just focus then on the one or two values that have p < .05, as if they are especially meaningful - p-hacking is accepted and out in the open and just not recognised as problematic. This is one reason why I was concerned that people might think the 'take home' message from Head et al that p-hacking may occur but it is nothing to worry about.

0

3739 days ago - Joost de Winter

Some feedback is available here

3711 days ago - Dorothy Bishop

Many thanks for your feedback. We were not familiar with the papers on publication bias, and it is clear they raise many similar issues. Your paper made us think harder about the difference between text-mined data from Abstracts and that from Results, and we agree with you that the former is less likely to be problematic - this is now discussed in the revised paper.

Regarding other forms of p-hacking and the p-curve: yes, of course we accept that there are forms that lead to the 'bump', but my impression (in accord with Titus von Malsburg) is that p-hacking with ghost variables is particularly common and so if the method misses it, it is potentially misses a lot. It is hard, though, to know just how frequent, except by asking people and hoping they give an honest response.

0

3737 days ago - Luke Holman

Here is a reply from me and the other authors of the Head et al paper. We wrote a new model inspired by the 'ghost variable' p-hacking idea, and we also point out a number of problems with the PrePrint. The reply and the R code for the new model are here

3712 days ago - Dorothy Bishop

Thank you for the comments. We have just uploaded a revised paper to PeerJ for consideration for publication, which will be available in preprint form once it has passed checks.

It is helpful to have clarification of what you see as the main contribution of your paper. It was reassuring to see your comment: 'Perhaps we should have been clearer that we underestimate p-hacking, and the extent of p-hacking might be even worse than suggested by our data. But you never provide an argument that challenges our main conclusion, namely “There is evidence of p-hacking in much of the literature”. '

In our preprint the focus was on the idea that your analysis provided information on the extent and consequences of p-hacking. The impression from your paper is that though p-hacking occurs, it is not a serious problem, e.g. "Given recent concerns about the lack of reproducibility of findings (e.g., [49] but see [50]) and the possibility that many published results are false [2], our results are reassuring." Also, the end of your Abstract stated: "while p-hacking is probably common, its effect seems to be weak relative to the real effect sizes being measured. This result suggests that p-hacking probably does not drastically alter scientific consensuses drawn from meta-analyses."

You are correct in noting that we focused exclusively on the large text-mined dataset with Results, which has different characteristics from the meta-analyses. We have now added a mention of the latter analysis, as well as the Abstracts.

We note that you had added qualifiers to some of your conclusions. Nevertheless, we remain dubious that any meaningful quantification of the amount of either p-hacking or evidential value can come from this kind of p-curve analysis, as explained more clearly in our revised manuscript. In your reply you state you deliberately left out any quantitative measurement of the amount of evidential value, yet when describing your approach, you stated 'To quantify “evidential value” (i.e., if there is evidence that the true effect size is nonzero) and p-hacking, we constructed p-curves from the p-values we obtained'.

This response is in two parts: First we respond to specific points you raised in your comments. Second, we note some new concerns that have arisen in additional analyses we have conducted.

A. Response to specific points in your commentary

1-2. Head et al did detect p-hacking.

We did not discuss the 'bump' below .05 because we shared the concern of Hartgerink (2015) that this might be a spurious consequence of use of different N decimal places. We had explained this at top of p. 11. We note that you are not impressed by the arguments of Hartgerink, and we have added some further analysis of our own to address this point (see below).

Note, however, that this point is not central to our argument, which is that, while a bump under .05 is hard to explain except by p-hacking, you can have a great deal of p-hacking and no bump. This is a key point, and you would seem to agree with it.

Simulations

It is reassuring that you reproduced the results of our simulation study. Where we differ is in the interpretation of the left skewed figure that you get when variables are highly intercorrelated. Our point here is that (a) this gives a continuous slope, rather than a 'bump' just below .05. (b) the slope is slight enough that it would not usually give a significant difference on binomial test when the two bins just below .05 are compared; (c) data in those bins are noisy – we managed to get a significant binomial test on one run from uncorrelated data, where the true slope is flat.

We had actually started out by modelling p-hacking with ghost variables using the model you show in your comment's Figure 2, i.e. researchers pick the smallest p-value to report. We did not proceed with this model, however, as our introspection was that this was not a plausible account of how people do p-hacking. Suppose you had 3 out of 10 variables with p < .05 – would you really throw away two of them and only report the most extreme? We have just seen an in press paper by Simonsohn et al (2015) that makes a similar point, and which we now cite.

We've done some further analyses, as you have, simulating cases when there is a true effect, and hope to report these later, as the results are somewhat counterintuitive. We are not convinced, though, that the 'bump' in your lower panel figure 3 would be detectable in a realistic dataset, but obviously that is something that both of us could consider in future work. Our main conclusions, though, are that:

a) Though a bump below .05 is good evidence of p-hacking, lack of a bump is not good evidence of lack of p-hacking – your simulation and ours both confirm that.

b) Furthermore, while evidential value is associated with a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of evidential value when there is no control over the type of p-values entered into the analysis.

You suggest we should recast our paper as a more general critique of p-curve analysis, but, while we agree that point (a) applies to Simonsohn et al (as they acknowledge), point (b) is specific to Head et al (and any other analyses that do not control which p-values are entered into the analysis).

We have added Lakens to the Bibliography; thanks for noting the omission.

Comments 3-4. Independence of data points

You are correct to have picked us up on the confusing way we described this. The argument we wanted to make is that, for some papers, you are picking randomly from a range of different statistical tests, whereas in others, your selection may include a set of p-values that are all effectively indicators the same effect – if so, the selection will be biased to include that effect. We weren't thinking so much of the p-hacking that you mention in point 4, as cases where you, for instance, do a regression analysis where successive variables are dropped, or where you follow up a significant interaction with further tests. On reflection, we accept this is not a relevant argument, as it is not clear whether and how it might affect the p-curve, so we have removed it.

Double-dipping remains a real concern, however, because this is a case where p-hacking, far from yielding a 'bump' just below .05, can give highly significant but spurious p-values that will look as if they have evidential value.

Comment 5

As with previous comments, you emphasise that, yes, you did find p-hacking. We now state up-front that this is a valid conclusion from your study. But this is missing the thrust of our argument, which is that you simply can't quantify and compare rates of evidential value and p-hacking from data like these.

B. New analyses

Is there really a 'bump' below .05?

As noted above, we had found the analysis by Hartgerink convincing, but your misgivings led us to consider this further. As we note in our revised ms, it really is very difficult to know what to do with results that are reported to two decimal places: in particular, a value of .04 could be a genuine exact p-value, or it could have been rounded down. Our solution was to redo the analysis, excluding any study that reported any p-value to only two decimal places. This gives a distribution of p-values with a notch at .04, .05 etc. (see our new Figure 3). If, however, we exclude those exact values when comparing bins, we think we then have a comparison that is unbiased. We did that and found that the proportion of cases in the upper bin was reduced, but the test for 3 subject areas and for the complete set of papers was still significant. Thus this analysis did support your conclusion there is a small effect of p-hacking (overall proportion in the upper bin was 52%).

We would add that, being forced to look at the data in such detail did also impress upon us just how rare it was to find p-values close to .05. This really seemed surprising to us in absolute terms: e.g. in around 1300 papers in Psychology, only 51 had p-values in this range. In part this could be because many papers would report p < .05. We say a little about this in the revised paper.

As you will gather from our comments above, we would not want to draw the conclusion from our analyses that p-hacking does not occur: we simply think that (a) the text-mined dataset is not capable of testing for the 'bump', because of the numerous uncontrolled sources of variation, and (b) substantial p-hacking is compatible with lack of a bump.

Minor point re bootstrapped means

We were interested in how the means in different bins varied according to which p-values were selected. We tried to look at this using your scripts, but the program that we uploaded from Dryad (analysis.r) did not behave as expected. We tracked down the problem to the function one.random.pvalue.per.paper.r. The version on Dryad gives the same set of p-values every time it is run, always picking the first p-value for a paper. We wrote a new version of the function so that we could both replicate your intended analysis, and re-run it with different data specification (as described above, excluding papers reporting p-values to 2 decimal places). This does not make much difference to outcomes, but we mention it for completeness, and we will add our code on Open Science Framework (see link in our revised paper)

Final comment

As we have worked through the many analyses, we are increasingly appreciative of the heroic efforts in data-mining and analysis that have gone into your analysis. We are also most grateful to you for making your scripts and data publicly available; this is invaluable for those who wish to reconstruct what you did and consider alternative approaches. We do, however, think that the more we look at the data, the harder it is to justify using such material for drawing conclusions about rates of p-hacking vs. evidential value.

Problems in using text-mining and p-curve analysis to detect rate of p-hacking

Abstract

Author Comment

Feedback on other revisions

Feedback on this revision

0

0

0

Add your feedback

Feedback on other revisions

Feedback on this revision

0

0

0

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article