Thank you for the comments. We have just uploaded a revised paper to PeerJ for consideration for publication, which will be available in preprint form once it has passed checks.
It is helpful to have clarification of what you see as the main contribution of your paper. It was reassuring to see your comment: 'Perhaps we should have been clearer that we underestimate p-hacking, and the extent of p-hacking might be even worse than suggested by our data. But you never provide an argument that challenges our main conclusion, namely “There is evidence of p-hacking in much of the literature”. '
In our preprint the focus was on the idea that your analysis provided information on the extent and consequences of p-hacking. The impression from your paper is that though p-hacking occurs, it is not a serious problem, e.g. "Given recent concerns about the lack of reproducibility of findings (e.g., [49] but see [50]) and the possibility that many published results are false [2], our results are reassuring." Also, the end of your Abstract stated: "while p-hacking is probably common, its effect seems to be weak relative to the real effect sizes being measured. This result suggests that p-hacking probably does not drastically alter scientific consensuses drawn from meta-analyses."
You are correct in noting that we focused exclusively on the large text-mined dataset with Results, which has different characteristics from the meta-analyses. We have now added a mention of the latter analysis, as well as the Abstracts.
We note that you had added qualifiers to some of your conclusions. Nevertheless, we remain dubious that any meaningful quantification of the amount of either p-hacking or evidential value can come from this kind of p-curve analysis, as explained more clearly in our revised manuscript. In your reply you state you deliberately left out any quantitative measurement of the amount of evidential value, yet when describing your approach, you stated 'To quantify “evidential value” (i.e., if there is evidence that the true effect size is nonzero) and p-hacking, we constructed p-curves from the p-values we obtained'.
This response is in two parts: First we respond to specific points you raised in your comments. Second, we note some new concerns that have arisen in additional analyses we have conducted.
A. Response to specific points in your commentary
1-2. Head et al did detect p-hacking.
We did not discuss the 'bump' below .05 because we shared the concern of Hartgerink (2015) that this might be a spurious consequence of use of different N decimal places. We had explained this at top of p. 11. We note that you are not impressed by the arguments of Hartgerink, and we have added some further analysis of our own to address this point (see below).
Note, however, that this point is not central to our argument, which is that, while a bump under .05 is hard to explain except by p-hacking, you can have a great deal of p-hacking and no bump. This is a key point, and you would seem to agree with it.
Simulations
It is reassuring that you reproduced the results of our simulation study. Where we differ is in the interpretation of the left skewed figure that you get when variables are highly intercorrelated. Our point here is that (a) this gives a continuous slope, rather than a 'bump' just below .05. (b) the slope is slight enough that it would not usually give a significant difference on binomial test when the two bins just below .05 are compared; (c) data in those bins are noisy – we managed to get a significant binomial test on one run from uncorrelated data, where the true slope is flat.
We had actually started out by modelling p-hacking with ghost variables using the model you show in your comment's Figure 2, i.e. researchers pick the smallest p-value to report. We did not proceed with this model, however, as our introspection was that this was not a plausible account of how people do p-hacking. Suppose you had 3 out of 10 variables with p < .05 – would you really throw away two of them and only report the most extreme? We have just seen an in press paper by Simonsohn et al (2015) that makes a similar point, and which we now cite.
We've done some further analyses, as you have, simulating cases when there is a true effect, and hope to report these later, as the results are somewhat counterintuitive. We are not convinced, though, that the 'bump' in your lower panel figure 3 would be detectable in a realistic dataset, but obviously that is something that both of us could consider in future work. Our main conclusions, though, are that:
a) Though a bump below .05 is good evidence of p-hacking, lack of a bump is not good evidence of lack of p-hacking – your simulation and ours both confirm that.
b) Furthermore, while evidential value is associated with a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of evidential value when there is no control over the type of p-values entered into the analysis.
You suggest we should recast our paper as a more general critique of p-curve analysis, but, while we agree that point (a) applies to Simonsohn et al (as they acknowledge), point (b) is specific to Head et al (and any other analyses that do not control which p-values are entered into the analysis).
We have added Lakens to the Bibliography; thanks for noting the omission.
Comments 3-4. Independence of data points
You are correct to have picked us up on the confusing way we described this. The argument we wanted to make is that, for some papers, you are picking randomly from a range of different statistical tests, whereas in others, your selection may include a set of p-values that are all effectively indicators the same effect – if so, the selection will be biased to include that effect. We weren't thinking so much of the p-hacking that you mention in point 4, as cases where you, for instance, do a regression analysis where successive variables are dropped, or where you follow up a significant interaction with further tests. On reflection, we accept this is not a relevant argument, as it is not clear whether and how it might affect the p-curve, so we have removed it.
Double-dipping remains a real concern, however, because this is a case where p-hacking, far from yielding a 'bump' just below .05, can give highly significant but spurious p-values that will look as if they have evidential value.
Comment 5
As with previous comments, you emphasise that, yes, you did find p-hacking. We now state up-front that this is a valid conclusion from your study. But this is missing the thrust of our argument, which is that you simply can't quantify and compare rates of evidential value and p-hacking from data like these.
B. New analyses
Is there really a 'bump' below .05?
As noted above, we had found the analysis by Hartgerink convincing, but your misgivings led us to consider this further. As we note in our revised ms, it really is very difficult to know what to do with results that are reported to two decimal places: in particular, a value of .04 could be a genuine exact p-value, or it could have been rounded down. Our solution was to redo the analysis, excluding any study that reported any p-value to only two decimal places. This gives a distribution of p-values with a notch at .04, .05 etc. (see our new Figure 3). If, however, we exclude those exact values when comparing bins, we think we then have a comparison that is unbiased. We did that and found that the proportion of cases in the upper bin was reduced, but the test for 3 subject areas and for the complete set of papers was still significant. Thus this analysis did support your conclusion there is a small effect of p-hacking (overall proportion in the upper bin was 52%).
We would add that, being forced to look at the data in such detail did also impress upon us just how rare it was to find p-values close to .05. This really seemed surprising to us in absolute terms: e.g. in around 1300 papers in Psychology, only 51 had p-values in this range. In part this could be because many papers would report p < .05. We say a little about this in the revised paper.
As you will gather from our comments above, we would not want to draw the conclusion from our analyses that p-hacking does not occur: we simply think that (a) the text-mined dataset is not capable of testing for the 'bump', because of the numerous uncontrolled sources of variation, and (b) substantial p-hacking is compatible with lack of a bump.
Minor point re bootstrapped means
We were interested in how the means in different bins varied according to which p-values were selected. We tried to look at this using your scripts, but the program that we uploaded from Dryad (analysis.r) did not behave as expected. We tracked down the problem to the function one.random.pvalue.per.paper.r. The version on Dryad gives the same set of p-values every time it is run, always picking the first p-value for a paper. We wrote a new version of the function so that we could both replicate your intended analysis, and re-run it with different data specification (as described above, excluding papers reporting p-values to 2 decimal places). This does not make much difference to outcomes, but we mention it for completeness, and we will add our code on Open Science Framework (see link in our revised paper)
Final comment
As we have worked through the many analyses, we are increasingly appreciative of the heroic efforts in data-mining and analysis that have gone into your analysis. We are also most grateful to you for making your scripts and data publicly available; this is invaluable for those who wish to reconstruct what you did and consider alternative approaches. We do, however, think that the more we look at the data, the harder it is to justify using such material for drawing conclusions about rates of p-hacking vs. evidential value.