Distributions of pvalues smaller than .05 in psychology: what is going on?
 Published
 Accepted
 Received
 Academic Editor
 Melania Pintilie
 Subject Areas
 Psychiatry and Psychology, Science Policy, Statistics
 Keywords
 pvalues, NHST, QRP, Caliper test, Data peeking
 Copyright
 © 2016 Hartgerink et al.
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
 Cite this article
 2016. Distributions of pvalues smaller than .05 in psychology: what is going on? PeerJ 4:e1935 https://doi.org/10.7717/peerj.1935
Abstract
Previous studies provided mixed findings on pecularities in pvalue distributions in psychology. This paper examined 258,050 test results across 30,710 articles from eight high impact journals to investigate the existence of a peculiar prevalence of pvalues just below .05 (i.e., a bump) in the psychological literature, and a potential increase thereof over time. We indeed found evidence for a bump just below .05 in the distribution of exactly reported pvalues in the journals Developmental Psychology, Journal of Applied Psychology, and Journal of Personality and Social Psychology, but the bump did not increase over the years and disappeared when using recalculated pvalues. We found clear and direct evidence for the QRP “incorrect rounding of pvalue” (John, Loewenstein & Prelec, 2012) in all psychology journals. Finally, we also investigated monotonic excess of pvalues, an effect of certain QRPs that has been neglected in previous research, and developed two measures to detect this by modeling the distributions of statistically significant pvalues. Using simulations and applying the two measures to the retrieved test results, we argue that, although one of the measures suggests the use of QRPs in psychology, it is difficult to draw general conclusions concerning QRPs based on modeling of pvalue distributions.
Introduction
A set of pvalues can be informative of the underlying effects that are investigated, but can also be indicative of potential research biases or questionable research practices (QRPs). In the absence of QRPs, the distribution of significant pvalues can be expected to have a certain shape. Under the nullhypothesis all pvalues are equally probable (i.e., follow a uniform distribution). If there is truly an effect, smaller pvalues are more likely than larger pvalues (i.e., the distribution decreases monotonically in the pvalue). Consequently, because some hypotheses are false and some are true, the distribution of observed pvalues arises from a mixture of uniform and rightskewed distributions and should also decrease monotonically.^{1} QRPs may have various effects on the pvalue distribution. Figure 1 shows the pvalue distribution of statistical tests both with data peeking (solid lines) and without data peeking. Data peeking (also known as optional stopping) refers to conducting intermediate significance testing during data collection (Armitage, McPherson & Rowe, 1969). Data peeking greatly affects the pvalue distribution in all panels, which can be seen from comparing the ‘true’ and ’datapeeked’ pvalue distributions. Figure 1A, which is obtained after data peeking of studies with standardized effect size d = 0, shows a ‘bump’ in the distribution. A bump corresponds to that part of the pvalue distribution that makes it no longer monotonically decreasing. Figure 1B also shows a bump for data peeking of studies with d = 0. However, Fig. 1C shows no bump but merely monotonic excess, i.e., an increase in the frequency of pvalues below .05 in the absence of a bump. Consequently, data peeking may either lead to monotonic excess or a bump in the distribution of pvalues. There are other known QRPs in the analysis of data (John, Loewenstein & Prelec, 2012), but these have different effects on the pvalue distribution and do not necessarily lead to a bump, as shown in Fig. 1.
In this paper we attempt to answer two questions: (1) Does a bump or monotonic excess of pvalues below .05 exist in psychology? and (2) Did evidence for a bump increase over time in psychology? We chose to focus on psychology because of the availability of an extensive database on statistical results in psychology (used in Nuijten et al., 2015) and because discussions on research practices are particularly salient in this discipline (e.g., Pashler & Wagenmakers, 2012; John, Loewenstein & Prelec, 2012; Simmons, Nelson & Simonsohn, 2011; Wagenmakers et al., 2012; Asendorpf et al., 2013).
How QRPs relate to distributions of pvalues
QRPs are defined as practices that are detrimental to the research process (Panel on Scientific Responsibility and the Conduct of Research, 1992), with a recent focus on those which “increase the likelihood of finding support for a false hypothesis” (p. 524 John, Loewenstein & Prelec, 2012). Several QRPs related to significance testing are known to affect pvalues of statistical tests and consequently the decisions based on these tests. Specifically, particular QRPs may yield results that are just significant and can create a bump of pvalues, such as ad hoc exclusion of outliers (Bakker & Wicherts, 2014), repeatedly sampling new participants and checking the results (i.e., data peeking, Armitage, McPherson & Rowe, 1969), including various combinations of covariates until a significant result is reached, operationalizing a measure in different ways until significance is reached (Simmons, Nelson & Simonsohn, 2011), or selective reporting of pvalues (Franco, Malhotra & Simonovits, 2015). These QRPs have been used by many researchers at least once in their career. For instance, data peeking and the ad hoc exclusion of outliers were admitted by 63% and 38% of psychological researchers, respectively (John, Loewenstein & Prelec, 2012). On the other hand, other QRPs mainly yield very small and (clearly) significant pvalues, such as analyzing multiple conditions or correlated variables and selecting only the smallest pvalue out of this set of analyses (R Van Aert, J Wicherts & M Van Assen, 2016, unpublished data; Ulrich & Miller, 2015) and do not lead to a bump. To summarize, different QRPs may differently affect the distribution of statistically significant pvalues.
However, there are at least two problems with using pvalue distributions to examine the prevalence of QRPs. First, as we previously argued, not all QRPs lead to a bump of pvalues just below .05. Hence, examining the distribution of pvalues just below .05 will not inform us on the prevalence of QRPs that do not aim to obtain just significant results but yield mainly small and clearly significant pvalues (R Van Aert, J Wicherts & M Van Assen, 2016, unpublished data; Ulrich & Miller, 2015). Second, the QRPs yielding just significant results do not necessarily result in a nonmonotonic pvalue distribution, that is, a distribution with a bump. For instance, consider Fig. 1 that shows the result of simulations done for data peeking, which is known to result in mainly just significant pvalues (Armitage, McPherson & Rowe, 1969; Lakens, 2015b; Wagenmakers, 2007). Figure 1 illustrates that data peeking may result in nonmonotonic excess (i.e., bump; A and B), but can also cause monotonic excess (C), even if all researchers use data peeking. Specifically, if all underlying effects are genuinely and substantially different from zero (C), data peeking will generally not lead to a bump below .05. In the present paper, we therefore examine the peculiar prevalence of pvalues just below .05 by both investigating the presence of a bump or monotonic excess in distributions of statistically significant results.
Previous findings
Masicampo & Lalande (2012) found a bump of pvalues just below .05 in three main psychology journals (i.e., Journal of Personality and Social Psychology, JPSP; Journal of Experimental Psychology: General, JEPG; Psychological Science, PS), which, as we saw, could be explained by research biases due to QRPs. The observation of a bump was one of several signals of a crisis of confidence in research findings in psychological science (Pashler & Wagenmakers, 2012; Ferguson, 2015). Leggett et al. (2013) later corroborated this bump of pvalues for JPSP and JEPG, and observed that it was larger in 2005 than in 1965. Considering that research biases can lead to overemphasis on statistical significance, this result suggested that the state of psychology may have even deteriorated over the years. Additional corroboration in samples of published articles from various fields was provided by Head et al. (2015), who documented the bump of pvalues below .05 in 1,048,575 articles across 16 disciplines including psychology. Ginsel et al. (2015) found similar biased reporting of pvalues in medical abstracts, but noted the variety of potential causes (e.g., publication bias, fraud, selective reporting).
At the same time, other studies failed to find a bump of pvalues below .05 (Jager & Leek, 2014; Krawczyk, 2015; Vermeulen et al., 2015). Reanalysis of original data by Lakens (2015b) and ourselves indicated that the results may have been confounded by publication bias (Masicampo & Lalande, 2012) and tendencies to round pvalues (Head et al., 2015). Publication bias refers to the fact that the probability of getting published is higher for statistically significant results than for statistically nonsignificant results (Gerber et al., 2010; Franco, Malhotra & Simonovits, 2014). Publication bias only changes the pvalue distribution above .05 and cannot cause a bump. Krawczyk (2015) analyzed a sample of around 5,000 psychology articles and found no bump in pvalues that were recalculated on the basis of reported test statistics and degrees of freedom (cf. Bakker & Wicherts, 2011). However, he did observe a bump for reported pvalues. As such, this highlights an important difference between reported pvalues and recalculated pvalues, and stresses the need to distinguish both types of results when studying signs of questionable research practices.
Extensions of previous studies
In answering our research questions, we extend previous studies on four dimensions. First, we eliminate the distortive effects of publication bias on the pvalue distribution by inspecting only statistically significant results. Second, we use a large dataset on pvalues from entire articles instead of only pvalues from abstracts (as in Jager & Leek, 2014; De Winter & Dodou, 2015). Third, we distinguish between reported and recalculated pvalue distributions for the same set of test results and show that this distinction affects answers to the two questions because of common mismatches (Bakker & Wicherts, 2011). Fourth, we fit analytic models to pvalue distributions to investigate the existence of monotonic excess as shown in Fig. 1C, whereas previous research only investigated whether there was nonmonotonic excess (i.e., a bump).
Publication bias distorts the pvalue distribution, but distortions caused by this bias should not be confounded with distortions caused by other QRPs. Publication bias refers to the selective publication of disproportionate amounts of statistically significant outcomes (Gerber et al., 2010; Franco, Malhotra & Simonovits, 2014). Publication bias contributes to a higher frequency of pvalues just below .05 relative to the frequency of pvalues just above .05, but only does so by decreasing the frequency of pvalues larger than .05. Masicampo & Lalande (2012) and De Winter & Dodou (2015) indeed found this relatively higher frequency, which is more readily explained by publication bias. QRPs that lead to a bump affect only the distribution of pvalues smaller than .05 (Lakens, 2015b). We focus only on the distribution of significant pvalues, because this distribution is directly affected by QRPs that cause a bump or monotonic excess. Publication bias only indirectly affects this distribution, through QRPs to obtain statistically significant results, but not directly because publication bias lowers the frequency of observed nonsignificant pvalues.
The second extension is the use of more extensive data for psychology than previously used to inspect QRPs that cause a bump or monotonic excess, improving our ability to examine the prevalence of QRPs. Masicampo & Lalande (2012) and Leggett et al. (2013) manually collected pvalues from a relatively small set of full research articles (i.e., 3,627 and 3,701), whereas Jager & Leek (2014) and De Winter & Dodou (2015) used automated extraction of pvalues from only the abstracts of research papers. However, pvalues from abstracts are not representative for the population of pvalues from the entire paper (Benjamini & Hechtlinger, 2014; Ioannidis, 2014), even though some have argued against this (Pautasso, 2010). Our large scale inspection of fulltext articles is similar to papers by Head et al. (2015) and Krawczyk (2015).
Third, we examine the prevalence of QRPs that cause a bump or monotonic excess by investigating both reported and the accompanying recalculated pvalues. Not all previous studies distinguished results from reported pvalues and recalculated pvalues. This distinction is relevant, because reported pvalues are subject to reporting bias such as rounding errors, particularly relevant around the .05 threshold. Such reporting biases result in inaccurate pvalue distributions. For example, there is evidence that reporting errors that affect statistical significance occur in approximately 10–15% of papers in psychology (i.e., gross inconsistencies Bakker & Wicherts, 2011; GarcíaBerthou & Alcaraz, 2004; Nuijten et al., 2015; Veldkamp et al., 2014). The advantage of analyzing recalculated pvalues is that they contain more decimals than typically reported and that they correct reporting errors. Some previous studies analyzed reported pvalues (De Winter & Dodou, 2015; Jager & Leek, 2014; Head et al., 2015), whereas others looked at recalculated pvalues (Masicampo & Lalande, 2012) or a mix of reported and recalculated (Leggett et al., 2013). Only Krawczyk (2015) used both reported and recalculated pvalues for a subset of the data (approximately 27,000 of the 135,000 were recalculated), and found that the peculiar prevalence below .05 disappeared when the recalculated data were used. Hence, this distinction between reported and recalculated pvalues allows us to distinguish between peculiarities due to reporting errors and peculiarities due to QRPs such as data peeking.
Fourth, we examine the prevalence of pvalues just below .05 by taking into account various models to test and explain characteristics of pvalue distributions. We applied tests and fitted models to pvalues below .05, in two ways. We first applied the nonparametric Caliper test (Gerber et al., 2010) comparing frequencies of pvalues in an interval just below .05 to the frequency in the adjacent lower interval; a higher frequency in the interval closest to .05 is evidence for QRPs that seek to obtain just significant results. The Caliper test has also been applied to examine publication bias, by comparing just significant to just nonsignificant pvalues (Kühberger, Fritz & Scherndl, 2014), and to detect QRPs (Head et al., 2015). However, the Caliper test can only detect a bump but not monotonic excess, as illustrated by the distributions of pvalues in Fig. 1. Therefore, we also attempted to model the distribution of significant pvalues in order to investigate for all forms of excess (i.e., both a bump and monotonic excess), and illustrate the results and difficulties of this approach.
In short, this paper studies the distribution of significant pvalues in four ways. First, we verified whether a bump is present in reported pvalues just below .05 with the Caliper test. Second, to examine how reporting errors might influence pvalue distributions around .05, we analyzed only the recalculated pvalues corresponding to those reported as .05. Third, we used the Caliper test to examine if a bump effect is present in recalculated pvalues and whether evidence for a bump changed over time. Finally, we modeled the distribution of significant recalculated pvalues in an attempt to also detect a monotonic excess of pvalues below .05.
Data and Methods
Data
We investigated the pvalue distribution of research papers in eight high impact psychology journals (also used in Nuijten et al., 2015). These eight journals were selected due to their highimpact across different subfields in psychology and their availability within the Tilburg University subscriptions. This selection also encompasses the journals covered by Masicampo & Lalande (2012) and Leggett et al. (2013). A summary of the downloaded articles is included in Table 1.
Journal  Acronym  Timespan  Articles downloaded  Articles with extracted results (%)  APA results extracted 

Developmental Psychology  DP  1985–2013  3,381  2,607 (77%)  37,658 
Frontiers in Psychology  FP  2010–2013  2,126  702 (33%)  10,149 
Journal of Applied Psychology  JAP  1985–2013  2,782  1,638 (59%)  15,134 
Journal of Consulting and Clinical Psychology  JCCP  1985–2013  3,519  2,413 (69%)  27,429 
Journal of Experimental Psychology General  JEPG  1985–2013  1,184  821 (69%)  18,921 
Journal of Personality and Social Psychology  JPSP  1985–2013  5,108  4,346 (85%)  101,621 
Public Library of Science  PLOS  2000–2013  10,303  2,487 (24%)  31,539 
Psychological Science  PS  2003–2013  2,307  1,681 (73%)  15,654 
Total  30,710  16,695 (54%)  258,105 
For these journals, our sample included articles published from 1985 through 2013 that were available in HTML format. For the PLOS journals, HTML versions of articles were downloaded automatically with the rplos package (v0.3.8; Chamberlain, Boettiger & Ram, 2015). This package allows an R user to search the PLOS database as one would search for an article on the website.^{2} We used this package to retrieve search results that include the subject ‘psychology’ for (part of) an article. For all other journals, HTML versions of articles were downloaded manually by the first author.
APA test results were extracted from the downloaded articles with the R package statcheck (v1.0.1; Epskamp & Nuijten, 2015). The only requirement for this package to operate is a supply of HTML (or PDF) files of the articles that are to be scanned and statcheck extracts all test results reported according to the standards of the American Psychological Association (APA; American Psychological Association, 2010). This format is defined as test results reported in the following order: the test statistic and degrees of freedom (encapsulated in parentheses) followed by the pvalue (e.g., t(85) = 2.86, p = .005). This style has been prescribed by the APA since at least 1983 (American Psychological Association, 1983; American Psychological Association, 2001), with the only relevant revision being the precision of the reported pvalue, changing from two decimal places to three decimal places in the sixth edition from 2010. statcheck extracts t, F, χ^{2}, Z and r results reported in APA style. Additional details on the validity of the statcheck package can be found in Nuijten et al. (2015).
From the 30,710 downloaded papers, statcheck extracted 258,105 test results. We removed 55 results, because these were impossible test results (i.e., F(0, 55) = ⋯ or r > 1). The final dataset thus included 258,050 test results. The extracted test results can have four different formats, where test results or pvalues are reported either exactly (e.g., p = .042) or inexactly (e.g., p < .05). Table 2 shows the composition of the dataset, when split across these (in)exactly reported pvalues and (in)exactly reported test results.
Exact test statistic  Inexact test statistic  

Exact pvalue  68,776  274  69,050 (27%) 
Inexact pvalue  187,617  1,383  189,000 (73%) 
256,393 (99.36%)  1,657 (0.64%)  258,050 (100%) 
From this dataset, we selected six subsets throughout our analyses to investigate our research questions regarding a bump below .05. We analyzed (i) all reported pvalues (N = 258, 050) for a bump in their distribution just below .05. Subsequently we analyzed (ii) only exactly reported pvalues (N = 69, 050). It is possible that reporting or rounding errors have occurred among the reported pvalues. To investigate the degree to which this happens at p = .05, we analyzed (iii) exactly reported test statistics that are accompanied by an exactly reported pvalue of .05 (i.e., p = .05). This subset contains 2,470 results. To attenuate the effect of rounding errors and other factors influencing the reporting of pvalues (e.g., Ridley et al., 2007), we also investigated the recalculated pvalue distribution with (iv) pvalues that were accompanied by exactly reported test statistics (N = 256, 393). To investigate whether evidence for a bump differs for inexactly and exactly reported pvalues, (v) 68,776 exactly reported test statistics with exactly reported pvalues were analyzed. Finally, we used (vi) all recalculated pvalues in 0–.05 for t, r, and F(df_{1} = 1) values to model the effect size distribution underlying these pvalues to investigate evidence of both a bump and monotonic excess.
Methods
We used the Caliper test and two new measures to examine if the observed pvalue distribution shows evidence for a bump or monotonic excess below .05. We applied the two measures to the observed pvalue distribution and we examined their performance to detect a bump or monotonic excess using a simulation study on data peeking. Data peeking was chosen because it is one of the most frequently used and wellknown QRPs. Below, we explain the Caliper test, how the pvalue distributions are modeled with the two new measures, and describe the design of the simulation study in more detail.
Caliper test
In order to test for a bump of pvalues just below .05, we applied the Caliper test (e.g., Gerber et al., 2010; Kühberger, Fritz & Scherndl, 2014). This proportion test compares the frequencies of pvalues in two intervals, such as the intervals .04–.045 and .045–.05. Let Pr denote the proportion of pvalues of the interval .045–.05. Then, independent of the population effect sizes underlying the pvalues, Pr should not be higher than .5 in any situation because the pvalue distribution should be monotone decreasing. Hence Pr > .5 signifies a bump of pvalues just below .05.
We carried out onetailed binomial proportion tests, with H_{0}:Pr ≤ .5 and H_{1}:Pr > .5. For example, if 40 and 60 pvalues are observed in the intervals .04–.045 and .045–.05, respectively, then Pr = .6 and the binomial test results in pvalue = .0284, suggesting evidence for a bump below .05. We applied the Caliper test to the reported pvalues (subsets one through three as described in the previous section) and recalculated pvalues (subsets four and five), both for the entire dataset and each of the eight psychology journals.
The Caliper test requires specifying the width of the intervals that are to be compared. For reported pvalues, we selected the intervals (.03875–.04] and [.04875–.05) because there is a strong preference to report pvalues to the second decimal in research papers (see also Hartgerink, 2015). For recalculated pvalues we used the same interval width as used by Masicampo & Lalande (2012) and Leggett et al. (2013), which is .00125, corresponding to a comparison of intervals (.0475–.04875] and [.04875–.05). Note that rounding is not a problem for recalculated pvalues. Considering that some journals might show small frequencies of pvalues in these intervals, we also carried out Caliper tests with interval widths of .0025, .005, and .01. Note that, on the one hand, increasing interval width increases the statistical power of the Caliper test because more pvalues are included in the test, but on the other hand also decreases power because Pr is negatively related to interval width whenever pvalues correspond to tests of nonzero population effects. In other words, a bump just below .05 will tend more and more towards a monotonically decreasing distribution as the binwidth increases.
To verify if evidence for a bump of pvalues increased over time, we fitted a linear trend to proportion Pr of the Caliper test with binwidths .00125, .0025, .005, and .01. We computed these proportions for each year separately, for both the total dataset and per journal. Time was centered at the start of data collection, which was 1985 except for PLOS (2000), PS (2006; due to 0 pvalues in the considered interval for preceding years), and FP (2010). The value .5 was subtracted from all Pr values, such that the intercept of the trend corresponds to the bump of pvalues at the start of data collection, where 0 means no bump. A positive linear trend signifies an increase in the bump of pvalues below .05 over time.
Measures based on pvalue distributions
Figure 1 demonstrates that the effect of data peeking on the shape of the pvalue distribution (i.e., bump or just monotonic excess) depends on the true effect size. The distribution after data peeking does not monotonically decrease for d = 0 or d = .2 (A and B), whereas it does decrease monotonically for d = 0.5 (C). Consequently, the Caliper test will signal a bump of pvalues for d = 0 (i.e., it will detect a bump), but not for d = 0.5.
We examined how we may be able to detect both a bump and monotonic excess of pvalues below .05. Figure 1 indicates that, for pvalues close to zero (e.g., ≤.00125) the pvalue distributions with data peeking (solid lines) closely match the pvalue distributions without data peeking (dashed lines). In other words, datapeeking in studies with initially nonsignificant pvalues rarely results in tiny significant pvalues, but more often in pvalues larger than .00125. The basic idea of this analysis is therefore to estimate the ‘true’ effect size distribution using only these tiny pvalues (i.e., ≤.00125), assuming that none or a very small proportion of these pvalues were affected by datapeeking. We note that we selected the .00125 cutoff point rather arbitrarily. Other, more liberal (e.g., .01, in case of a smaller set of statistically significant pvalues) or even more conservative cutoff points (e.g., .0001, in case of a very large dataset as ours) can be selected.
We examined the performance of two measures to detect a bump or monotonic excess of pvalues below .05. The first method compares the effect sizes estimated on pvalues smaller than .00125 to effect sizes estimated using all pvalues smaller than .05. The idea of this first method is that increasing the frequency of justsignificant pvalues decreases the effect size estimate. Indeed, the more rightskewed the pvalue distribution, the higher the effect size estimate when keeping constant studies’ sample sizes (Simonsohn, Nelson & Simmons, 2014; Van Assen, Van Aert & Wicherts, 2015). According to the first method, there is evidence suggestive of data peeking (or other QRPs leading to a bump of pvalues just below .05) if the effect size estimate is considerably lower when based on all pvalues than when based on only pvalues ≤.00125.
The second method yields a measure of excess of pvalues just below .05, for either a bump or monotonic excess, by comparing the observed frequency of pvalues in the interval .00125–.05 to the predicted frequency of pvalues in that interval. This prediction is based on the effect size estimated using the pvalues smaller than .00125. If the ratio of observed over expected pvalues is larger than 1, referred to as statistic D, then this could indicate data peeking. Statistic D is calculated as (1)$D=\frac{{p}_{.00125}^{o}}{1{p}_{.00125}^{o}}\times \frac{1{p}_{.00125}^{e}}{{p}_{.00125}^{e}}$ with ${p}_{.00125}^{o}$ and ${p}_{.00125}^{e}$ representing the proportion of pvalues lower than .00125 observed and expected, respectively. Note that D is an odds ratio.
For both measures the expected pvalue distribution needs to be derived and compared to the observed pvalue distribtuion. The expected pvalue distribution was derived by minimizing the χ^{2}statistic as a function of mean effect δ and standard deviation τ, where it was assumed that the true effect size (Fishertransformed correlation, ρ_{F}) is normally distributed with parameters δ and τ. We only considered nonnegative values of δ because we only fitted our model to observed positive effects. See File S1 for the technical details.
Design of simulation study
To examine the potential of the two measures to detect data peeking, their performance was examined on simulated data with and without data peeking. We used a twogroup betweensubjects design with 24 participants per group (n_{k} = 24), and compared their means using a ttest. The performance of both measures was examined as a function of true effect size δ (0; 0.2; 0.5; 0.8) and heterogeneity τ (0; 0.15). In the data peeking conditions, data were simulated as follows: means and variances per group were simulated and a twosample ttest was conducted. If this ttest was statistically significant (i.e., p ≤ .05), the pvalue was stored, otherwise the data peeking procedure was started. In this data peeking procedure, onethird of the original sample size was added to the data before conducting another twosample ttest. This data peeking procedure was repeated until a statistically significant result was obtained or three rounds of additive sampling had taken place (see osf.io/x5z6u for functions used in the simulation). The simulations were stopped if 1,000,000 studies with a pvalue below .1 were obtained for each combination of δ and τ.
Results and Discussion
In this section, we report the results of our analyses in the following order for the subsets: all reported pvalues (258,050 results), exactly reported pvalues (69,050 results), pvalues erroneously reported as equal to .05 (2,470 results), all recalculated pvalues based on exactly reported test statistics (256,393 results), recalculated pvalues based on exactly reported test statistics and exactly reported pvalues (68,776 results), and the modeling of pvalue distributions based on recalculated pvalues 0–.00125 and 0–.05 (54,561 results and 127,509, respectively). These analyses apply the Caliper test to investigate evidence of a possible bump below .05. Subsequently, the results of the two measures are presented based on all recalculated pvalues.
Reported pvalues
Figure 2 shows the distribution for all reported pvalues (i.e., 258,050; white bars) and exactly reported pvalues (i.e., 69,050; blue bars). Results of the Caliper test indicate (i) there is a bump just below .05 when considering all reported pvalues in bins .03875–.04 versus .04875–.05, N = 45, 667, Pr = 0.905, p < .001 and (ii) there is less evidence for a bump when considering only exactly reported pvalues, N = 4, 900, Pr = 0.547, p < .001. The difference in bumps between these two subsets can be explained by the amount of pvalues that are reported as <.05, which is 86% of all pvalues reported as exactly equal to .05 and 14% of all reported pvalues.
To investigate whether this observed bump below .05 across exactly reported pvalues originates from one or multiple journals, we performed the Caliper test on the exactly reported pvalues per journal. Table 3 shows the results for these tests. The results indicate that there is sufficient and reliable evidence for a bump below .05 (i.e., Pr > .5) for the journals DP and JPSP and sufficient evidence, but debatable reliability for JAP, where the results depend on the binwidth. However, the other five journals show no evidence for a bump below .05 in exactly reported pvalues at all. In other words, the bump below .05 in exactly reported pvalues is mainly driven by the journals DP, JAP, and JPSP.
Binwidth  0.00125  0.0025  0.005  0.01  

x  N  Pr  p  x  N  Pr  p  x  N  Pr  p  x  N  Pr  p  
All  2,682  4,900  0.547  <.001  2,881  5,309  0.543  <.001  3,308  6,178  0.535  <.001  4,218  8,129  0.519  < .001 
DP  319  531  0.601  <.001  336  567  0.593  <.001  383  653  0.587  <.001  464  843  0.55  0.002 
FP  96  193  0.497  0.557  105  227  0.463  0.884  141  304  0.464  0.906  215  458  0.469  0.912 
JAP  78  131  0.595  0.018  82  137  0.599  0.013  85  154  0.552  0.113  101  183  0.552  0.092 
JCCP  246  517  0.476  0.874  267  562  0.475  0.889  308  641  0.48  0.848  395  823  0.48  0.882 
JEPG  147  285  0.516  0.318  159  310  0.513  0.346  195  375  0.52  0.235  258  509  0.507  0.395 
JPSP  1,252  2,097  0.597  <.001  1,310  2,207  0.594  <.001  1,408  2,399  0.587  <.001  1,623  2,869  0.566  <.001 
PLOS  307  649  0.473  0.921  366  760  0.482  0.854  489  1,000  0.489  0.766  744  1,558  0.478  0.964 
PS  237  497  0.477  0.859  256  539  0.475  0.886  299  652  0.459  0.984  418  886  0.472  0.957 
Notes:
 x

frequency of pvalues in .05 minus binwidth through .05
 N

total frequency of pvalues across both intervals in the comparison; Pr, x∕N
 p

pvalue of the binomial test
Significant results (α = .05, onetailed) indicating excess of pvalues just below .05 and are reported in bold.
The Caliper test results for reported pvalues indicate two things: (i) including inexactly reported pvalues has a large impact on the pvalue distribution and (ii) a bump below .05 is also found when only considering exactly reported pvalues. Because inexact reporting of pvalues causes excess at certain points of the pvalue (e.g., the significance threshold .05; Ridley et al., 2007), we recommend only inspecting exactly reported pvalues when examining pvalue distributions.
Considering only exactly reported pvalues, there is sufficient evidence for a bump below .05 in the journals DP, JAP, and JPSP, but not in the remaining five journals (i.e., FP, JCCP, JEPG, PLOS, PS). A tentative explanation of the bump of pvalues just below .05 for DP, JAP, and JPSP may be that QRPs that aim to obtain barely significant results are more frequent in the fields of these journals. However, another explanation may be that scientists in these fields are more prone to exactly report pvalues just below .05 (e.g., to emphasize they are really smaller than .05) than pvalues considerably smaller than .05.
Recalculated pvalue distributions
Recalculated when reported p= .05
Results for reported pvalues remain inconclusive with regard to the distribution of pvalues, due to potential rounding or errors (Bakker & Wicherts, 2011; Nuijten et al., 2015; Veldkamp et al., 2014). Rounding and errors could result in an overrepresentation of pvalues ≤.05. To investigate the plausibility of this notion, we inspected recalculated pvalues when p = .05 was reported (i.e., 2,470 values). Figure 3 indicates that pvalues that were reported as .05 show remarkable spread when recalculated, which indicates that the reported pvalue might frequently be rounded or incorrect, assuming that the reported test statistics are correct. More specifically, 67.45% of pvalues reported as .05 were larger than .05 when recalculated and 32.55% were smaller than .05. This percentage does not greatly vary across journals (range 58.8%–73.4% compared to 67.45%). Taking into account rounding possibilities (i.e., widening the range of correct pvalues to .045–.055), these percentages become 13.81% and 7.85%, respectively, meaning incorrect reporting of at least 21.66% of the pvalues that were reported as .05. In comparison, pvalues reported as p = .04, p = .03, or p = .02 show smaller proportions of downward rounding when compared to p = .05 (i.e., 53.33%, 54.32%, 50.38%, respectively compared to 67.45%). When taking into account potential rounding errors in the initial reporting of pvalues, the discrepancy remains but becomes smaller (i.e., 11.74%, 9.57%, 8.03%, respectively compared to 13.81%). These results provide direct evidence for the QRP “incorrect rounding of pvalue” (John, Loewenstein & Prelec, 2012), which contributes to a bump or monotonic excess just below .05.
The discrepancy between recalculated pvalues and pvalues reported as equal to .05 highlights the importance of using recalculated pvalues when underlying effect distributions are estimated as in puniform and pcurve (Van Assen, Van Aert & Wicherts, 2015; Simonsohn, Nelson & Simmons, 2014). When interested in inspecting the pvalue distribution, reported pvalues can substantially distort the pvalue distribution, such that results become biased if we rely solely on the reported pvalue. Such a discrepancy indicates potential rounding of pvalues, erroneous reporting of pvalues, or strategic reporting of pvalues. The pvalue distortions can be (partially) corrected for by recalculating pvalues based on reported test statistics. Additionally, potential distortions to the distribution at the third decimal place due to the rounding of pvalues to the second decimal (Hartgerink, 2015) is also solved by recalculating pvalues. We continue with recalculated pvalues in our following analyses.
Recalculated pvalues
Figure 4 shows the distribution of all recalculated pvalues (i.e., set of 256,393 results) and of recalculated pvalues whenever the reported pvalue is exact (i.e., set of 68,776 results). The recalculated pvalue distribution is markedly smoother than the reported pvalue distribution (see Fig. 2) due to the absence of rounded pvalues.
After inspecting all recalculated pvalues, we did not observe a bump just below .05, N = 2, 808, Pr = .5, p = 0.508. When we analyzed the recalculated pvalues per journal (Table 4), there is no evidence for a bump below .05 in any of the journals. Additionally, we inspected all recalculated pvalues that resulted from exactly reported pvalues. For this subset we did observe a bump below .05, N = 809, Pr = 0.564, p = 0.000165 (blue histogram in Fig. 4) for the smallest binwidth (i.e., .00125), but this effect was not robust across larger binwidths, as shown in Table 5. This table also specifies the results for a bump below .05 per journal, with sufficient evidence of a bump only in JPSP. This finding, however, was only observed for binwidths .00125 and .0025, not for larger binwidths. Considering the results from the recalculated pvalues, there is sparse evidence for the presence of a bump below .05, opposed to previously claimed widespread evidence (Masicampo & Lalande, 2012; Leggett et al., 2013; Head et al., 2015). Moreover, interpretation of the bump for JPSP is not straightforward; it may also be that authors of JPSP are more prone to report exact test statistics if the pvalue is just below .05 than whenever pvalues are considerably smaller than .05.
Binwidth  0.00125  0.0025  0.005  0.01  

x  N  Pr  p  x  N  Pr  p  x  N  Pr  p  x  N  Pr  p  
All  1,404  2,808  0.5  0.508  2,808  5,761  0.487  0.973  5,761  11,824  0.487  0.997  11,824  25,142  0.47  >.999 
DP  184  382  0.482  0.779  382  829  0.461  0.989  829  1,710  0.485  0.9  1,710  3,579  0.478  0.996 
FP  30  69  0.435  0.886  69  172  0.401  0.996  172  376  0.457  0.956  376  799  0.471  0.955 
JAP  73  145  0.503  0.5  145  270  0.537  0.124  270  556  0.486  0.765  556  1,168  0.476  0.952 
JCCP  160  308  0.519  0.265  308  633  0.487  0.763  633  1,267  0.5  0.522  1,267  2,706  0.468  >.999 
JEPG  81  164  0.494  0.593  164  332  0.494  0.608  332  683  0.486  0.778  683  1,535  0.445  >.999 
JPSP  640  1,268  0.505  0.379  1,268  2,557  0.496  0.668  2,557  5,174  0.494  0.802  5,174  10,976  0.471  >.999 
PLOS  125  260  0.481  0.752  260  541  0.481  0.828  541  1,170  0.462  0.995  1,170  2,544  0.46  >.999 
PS  111  212  0.524  0.268  212  427  0.496  0.577  427  888  0.481  0.88  888  1,835  0.484  0.919 
Notes:
 x

frequency of pvalues in .05 minus binwidth through .05
 N

total frequency of pvalues across both intervals in the comparison, Pr, x∕N
 p

pvalue of the binomial test
Significant results (α = .05, onetailed) indicating excess of pvalues just below .05 and are reported in bold.
Binwidth  0.00125  0.0025  0.005  0.01  

x  N  Pr  p  x  N  Pr  p  x  N  Pr  p  x  N  Pr  p  
All  456  809  0.564  <.001  809  1,617  0.5  0.5  1,617  3,403  0.475  0.998  3,403  7,402  0.46  1 
DP  46  87  0.529  0.334  87  185  0.47  0.811  185  358  0.517  0.281  358  756  0.474  0.932 
FP  15  27  0.556  0.351  27  87  0.31  >.999  87  192  0.453  0.915  192  437  0.439  0.995 
JAP  8  20  0.4  0.868  20  29  0.69  0.031  29  65  0.446  0.839  65  141  0.461  0.844 
JCCP  43  78  0.551  0.214  78  161  0.484  0.682  161  364  0.442  0.988  364  780  0.467  0.971 
JEPG  27  50  0.54  0.336  50  98  0.51  0.46  98  209  0.469  0.834  209  479  0.436  0.998 
JPSP  184  305  0.603  <.001  305  547  0.558  0.004  547  1,117  0.49  0.764  1,117  2,451  0.456  >.999 
PLOS  76  149  0.51  0.435  149  323  0.461  0.926  323  698  0.463  0.978  698  1,470  0.475  0.975 
PS  57  93  0.613  0.019  93  187  0.497  0.558  187  400  0.468  0.912  400  888  0.45  0.999 
Notes:
 x

frequency of pvalues in .05 minus binwidth through .05
 N

total frequency of pvalues across both intervals in the comparison, Pr, x∕N
 p

pvalue of the binomial test
Significant results (α = .05, onetailed) indicating excess of pvalues just below .05 and are reported in bold.
Excessive significance over time
The regression results of the development of a bump below .05 over time, based on recalculated pvalues, are shown in Table 6. Results indicate that there is no evidence for a linear relation between publication year and the degree to which a bump of pvalues below .05 is present across the different binwidths (only results for binwidth .00125 are presented; results for the other binwidths are available at http://osf.io/96kbc/). Conversely, for PLOS there is some evidence for a minor increase of a bump throughout the years (b = .072, p = .039), but this result is not robust for binwidths .0025, .005, and .01. These results contrast with Leggett et al. (2013), who found a linear relation between time and the degree to which a bump occurred for JEPG and JPSP. Hence, based on the period 1985–2013, our findings contrast with the increase of a bump below .05 for the period 1965–2005 in psychology (Leggett et al., 2013). In other words, our results of the Caliper test indicate that, generally speaking, there is no evidence for an increasing prevalence of pvalues just below .05 or of QRPs causing such a bump in psychology.
Timespan  Coefficient  Estimate  SE  t  p  

All  1985–2013  Intercept  0.007  0.017  0.392  0.698 
All  Years (centered)  −0.001  0.001  −0.492  0.627  
DP  1985–2013  Intercept  −0.043  0.056  −0.769  0.448 
DP  Years (centered)  0.001  0.003  0.193  0.849  
FP  2010–2013  Intercept  −0.182  0.148  −1.233  0.343 
FP  Years (centered)  0.055  0.079  0.694  0.560  
JAP  1985–2013  Intercept  0.041  0.081  0.504  0.619 
JAP  Years (centered)  −0.001  0.005  −0.208  0.837  
JCCP  1985–2013  Intercept  0.077  0.058  1.315  0.200 
JCCP  Years (centered)  −0.006  0.004  −1.546  0.134  
JEPG  1985–2013  Intercept  −0.022  0.124  −0.176  0.862 
JEPG  Years (centered)  0.001  0.007  0.097  0.924  
JPSP  1985–2013  Intercept  −0.002  0.027  −0.062  0.951 
JPSP  Years (centered)  0.000  0.002  −0.005  0.996  
PLOS  2006–2013  Intercept  −0.382  0.114  −3.344  0.016 
PLOS  Years (centered)  0.072  0.027  2.632  0.039  
PS  2003–2013  Intercept  0.081  0.078  1.045  0.323 
PS  Years (centered)  −0.009  0.013  −0.669  0.520 
Notes:
Significant results (α = .05, twotailed) are reported in bold.
τ = 0  τ = .15  

pvalues  δ = 0ρ_{F} = 0  δ = .2ρ_{F} = .099  δ = .5ρ_{F} = .247  δ = .8ρ_{F} = .390  δ = 0ρ_{F} = 0  δ = .2ρ_{F} = .099  δ = .5ρ_{F} = .247  δ = .8ρ_{F} = .390  
Without data peeking  0–1  ${\stackrel{\u02c6}{\rho}}_{F}$  0  0.103  0.258  0.413  0  0.103  0.258  0.413 
${\stackrel{\u02c6}{\tau}}_{{\rho}_{F}}$  0  0  0  0  0.077  0.077  0.077  0.077  
0–.05  ${\stackrel{\u02c6}{\rho}}_{F}$  0  0.103  0.258  0.413  0  0.103  0.258  0.413  
${\stackrel{\u02c6}{\tau}}_{{\rho}_{F}}$  0  0  0  0.001  0.077  0.077  0.077  0.077  
Misfit χ^{2}  0  0  0  0  0  0  0  0  
0–.00125  ${\stackrel{\u02c6}{\rho}}_{F}$  0  0.103  0.258  0.413  0.1  0.107  0.259  0.413  
${\stackrel{\u02c6}{\tau}}_{{\rho}_{F}}$  0  0  0  0.001  0.025  0.076  0.077  0.077  
Misfit χ^{2}  0  0  0  0  0  0  0  0  
D  1  1  1  1  1.205  1.006  1.003  1.001  
With data peeking  0–.05  ${\stackrel{\u02c6}{\rho}}_{F}$  0  0  0.117  0.345  0  0  0.075  0.360 
${\stackrel{\u02c6}{\tau}}_{{\rho}_{F}}$  0  0  0  0.038  0  0.055  0.137  0.091  
Misfit χ^{2}  126,267.4  50,298.4  696.6  101.6  14,867.6  1,209.5  576.3  340.6  
N  759,812  811,296  936,517  994,974  434,660  525,023  707,650  889,681  
0–.00125  ${\stackrel{\u02c6}{\rho}}_{F}$  0  0.075  0.218  0.366  0.066  0.161  0.283  0.402  
${\stackrel{\u02c6}{\tau}}_{{\rho}_{F}}$  0  0  0  0  0.036  0  0  0.012  
Misfit χ^{2}  6.9  3.2  7.1  11.8  2  1.9  2.6  2.1  
N  9,729  21,576  95,615  350,482  14,791  34,530  124,991  366,875  
D  1.977  1.976  1.835  1.166  1.628  1.620  1.472  1.164 
Notes:
${\stackrel{\u02c6}{\rho}}_{F}$, estimated population effect; ${\stackrel{\u02c6}{\tau}}_{{\rho}_{F}}$, estimated population heterogeneity; misfit 0–.05; misfit of estimates based on pvalues 0–.05, misfit 0–.00125, misfit of estimates based on pvalues 0–.00125 (bold indicates p < .05); N, number of results included in estimation; D, comparison of observed and expected pvalue frequencies.
Results of two measures based on modeling pvalue distributions
Simulation study
Table 7 shows the results of the two measures for data simulated with and without data peeking. The column headers show the mean effect size (i.e., δ) and heterogeneity (i.e., τ) of the simulated conditions, with the corresponding ρ_{F} and τ_{ρF} on the Fisher transformed correlation scale. The first set of rows shows the results for the data simulated without data peeking, of which we discuss the results first.
The results for the data without data peeking inform us on (i) whether the effect size distribution parameters can accurately be recovered using only very small (≤.00125) or small pvalues (≤.05), and (ii) if both measures accurately signal no data peeking. Note that ρ_{F} is slightly overestimated due to categorizing the pvalue distribution into 40 categories: the estimates based on all pvalues (i.e., ${\stackrel{\u02c6}{\rho}}_{F}$, first row) are slightly larger than the population parameter (i.e., ρ_{F}, column headers).
Answering the first question of accurate parameter estimates, whenever there is no heterogeneity (i.e., τ_{ρF} = 0) both ρ_{F} and τ_{ρF} are accurately recovered. When heterogeneity is nonzero, the parameters were also accurately recovered, but not when ρ_{F} = 0. Here, ρ_{F} was overestimated (equal to .1) and τ_{ρF} underestimated (.025 rather than the true .077), while at the same time the misfit was negligible.
The latter result, that the effect is overestimated under heterogeneity when ρ_{F} = 0, is explained by the fact that a pvalue distribution can accurately be modeled with an infinite range of negatively correlated values of ρ_{F} and τ_{ρF}. An increase in ρ_{F} yields a more rightskewed distribution, which is hardly distinguishable from the rightskewed distribution caused by an increase in τ_{ρF}. Hence almost identical pvalue distributions can be generated with (δ, τ) and some values (δ^{∗}, τ^{∗}), with δ^{∗} > μ and at the same time τ^{∗} < τ, or δ^{∗} < μ and at the same time τ^{∗} > τ. The similar effects of both parameters on the fitted pvalue distribution already hint at potential problems for both measures, because performance of these measures is dependent on accurate estimates of these parameters.
With respect to the second question, whether the measures accurately signal the absence of data peeking, the first measure does so in both homo and heterogeneous conditions, whereas the second measure correctly signals absence only under homogeneity. The first measure signals data peeking if the estimate of ρ_{F} is smaller when based on p ≤ .05 than on p ≤ .00125. Previously, we already noted that effect size estimates were identical to population effect sizes under homogeneity, and equal or larger when based on p ≤ .00125 under heterogeneity. This suggests that the first measure behaves well if there is no data peeking (but see the conclusion section). The second measure, D, performed well (i.e., was equal to 1) under homogeneity, but incorrectly suggested data peeking under heterogeneity. For instance, D = 1.205 for ρ_{F} = 0 and τ = .15, which suggests that 20.5% more pvalues were observed in the interval .00125–.05 than were expected based on the ${\stackrel{\u02c6}{\rho}}_{F}$ estimate even though no data peeking occurred. The explanation for the breakdown of the performance of D is that the parameters of the effect size distribution were not accurately recovered, overestimating the average effect size and underestimating heterogeneity based on small pvalues. This yields a lower expected frequency of higher pvalues (between .00125 and .05), thereby falsely suggesting data peeking.
The last rows present the results obtained when data peeking does occur. First, consider the estimates of ρ_{F} and the performance of the first measure of data peeking. The estimates of ρ_{F} confirm that data peeking results in underestimation, particularly if the average true effect size is not large (i.e., δ = .2 or .5). Moreover, downward bias of ρ_{F} decreases when it is estimated on pvalues ≤.00125 than on ≤.05, accurately signaling data peeking with the first measure. For instance, if ρ_{F} = .099 and τ = 0, ${\stackrel{\u02c6}{\rho}}_{F}=.075$ when based on pvalues ≤.00125 and ${\stackrel{\u02c6}{\rho}}_{F}=0$ when based on pvalues ≤.05. Together with the good performance of this measure under no data peeking, these results suggest that the first measure may be useful to detect data keeping in practice.
Consider the estimates of τ_{ρF} and the performance of D. Similar to conditions under no data peeking, heterogeneity is grossly underestimated when using pvalues ≤.00125. Hence D cannot be expected to perform well under data peeking. Although Dvalues seem to correctly signal data peeking in all conditions and decrease as expected when the effect size increases, these values do not correspond to the actual values of data peeking. For instance, consider the condition with δ = .5 and τ_{ρF} = .15; of the 582,659 simulated pvalues in interval .00125–.05, 106,241 pvalues were obtained through datapeeking, which yields a true D = 1.223, which is very different from the estimated D = 1.472 in Table 7.
Finally, consider the (mis)fit of the estimated pvalue distribution. Despite the considerable downward bias in heterogeneity estimate ${\stackrel{\u02c6}{\tau}}_{{\rho}_{F}}$, the simulated pvalue distribution is mostly well approximated by the expected pvalue distribution, as indicated by the small values of the χ^{2} statistic for pvalues in 0–.00125. Hence, good fit again does not imply accurate parameter estimates. The misfit of the estimated distribution for pvalues ≤.05 is indicated by large χ^{2}values, particularly when the pvalue distribution is not monotonically decreasing (which is the case for, e.g., δ = 0).
To conclude, this simulation study showed that under true homogeneity both measures of data peeking can accurately signal both absence and presence of data peeking. However, under true heterogeneity, heterogeneity is underestimated and the performance of D breaks down, while results suggest that comparing estimates of average effect size, the first measure, may still accurately signal both the absence and presence of data peeking.
Applied to data of eight psychology journals
Figure 5 depicts the observed pvalue distribution and the expected pvalue distribution corresponding to the fitted effect size distribution based on pvalues ≤.00125. Estimates for pvalues ≤.05 were effect size ${\stackrel{\u02c6}{\rho}}_{F}=0$ and heterogeneity ${\stackrel{\u02c6}{\tau}}_{{\rho}_{F}}=.183$, and ${\stackrel{\u02c6}{\rho}}_{F}=.149$ and ${\stackrel{\u02c6}{\tau}}_{{\rho}_{F}}=.106$ for pvalues ≤.00125. Note that we only considered nonnegative values of δ in the estimation procedure. Misfit between observed and expected pvalue distribution for p ≤ .00125 was minor (χ^{2} = 4.1), indicating that the observed pvalues ≤.00125 were well approximated by the estimated effect size distribution.
Our first measure suggests practices leading to a monotonic excess of pvalues below .05, because the estimated effect size based on all significant pvalues (i.e., 0) is much smaller than the supposedly more accurate estimate based on only the very small pvalues (i.e., .183). Moreover, assuming that effect sizes are normally distributed with ρ_{F} = 0 and τ_{ρF} = .183, combined with the degrees of freedom of the observed effects, implies that only 27.5% of all effects would be statistically significant. However, of all reported pvalues, 74.7% were statistically significant, but this difference may at least partly be caused by other factors such as publication bias. It is highly unlikely that the average true effect size underlying statistically significant results in psychology is truly zero. It remains undecided, however, whether this very low estimate is mainly due to QRPs leading to a downward bias of the effect size estimate, or to a misspecification of the model, an issue we revisit later in the paper.
For the second measure that compares the ratio of observed and expected pvalues below .05, we found D = .701, which does not suggest data peeking but underreporting of pvalues (29.9%) in the pvalue interval .00125–.05. The simulation results, however, have already demonstrated that the measure D performs badly under effect size heterogeneity. Since heterogeneity is underlying the observed data, we conclude that the measure D is not useful for investigating evidence of a bump or monotonic excess of pvalues.
Limitations and Conclusion
Before concluding, some limitations of our method to collect pvalues need to be addressed. First, statcheck (Epskamp & Nuijten, 2015; Nuijten et al., 2015), the R package used to collect the observed data, extracts all APA test results reported in the text of an article, but not those reported in tables. Hence, our selection of results is potentially not representative of all reported results and systematically excludes results that are not reported to APA standards. Second, our analysis assumed that test statistics other than pvalues were accurately reported. If test statistics and degrees of freedom are incorrectly reported, recalculated pvalues are wrong as well. We identified some erroneous test statistics (e.g., df_{1} = 0 and r > 1), but do not know how often errors in reported test statistics and df occur and how these errors may have affected our results. We assumed that pvalue errors were made due to the overemphasis on them in current day research.
In light of conflicting findings and interpretations, we aimed to provide final answers to the questions (1). Does a bump or monotonic excess of pvalues below .05 exist in psychology? and (2) Did evidence for a bump increase over time in psychology? Answering these research questions may inform us on the prevalence of QRPs and its development over time in psychology. Using statcheck, we extracted and analyzed 258,050 test results conforming to APAstyle across 30,710 articles from eight high impact journals in psychology, and distinguished between results with inexactly reported pvalues, exactly reported pvalues, and recalculated pvalues. The basic idea underlying our analyses is that QRPs distort the pvalue distribution. We argued that only some QRPs yield an excess of pvalues just below .05, and show that QRPs sometimes yield a bump and sometimes only monotonic excess of pvalues just below .05. We used the Caliper test to test for a bump, and suggested two measures to examine monotonic excess.
Starting with the existence of a bump in psychology, we drew the following conclusions. First, inexactly reported pvalues are not useful for analyses of pvalue distributions. Second, a bump in exactly reported pvalues indeed exists in psychology journals DP, JAP, and JPSP. QRPs leading to just significant pvalues can explain these bumps, but we also cannot rule out the explanation that scientists in these particular journals are more prone to exactly report pvalues just below .05 (e.g., to emphasize they are really smaller than .05) than pvalues considerably smaller than .05. Third, contradicting Leggett et al. (2013), the bump and evidence of a bump in psychology did not increase over the years. Fourth, when analyzing only the exactly reported pvalues equal to .05, clear and direct evidence was obtained for the QRP “incorrect rounding of pvalue” (John, Loewenstein & Prelec, 2012). Evidence of this QRP, which contributed to the bump in exactly reported pvalues in psychology, was found in all psychology journals. Fifth, after removing reporting errors and analyzing the recalculated reported pvalues, evidence of a bump was found only for JPSP. Again, this may have been caused by QRPs or by scientists being more prone to report all test statistics when pvalues are just below .05 than if they are considerable smaller than zero.
The conclusions obtained with the two measures investigating the bump and monotonic excess are not satisfactory. First, performance of both measures is dependent on accurately recovering parameters of the effect size distribution, which turned out to be difficult; estimates of effect size heterogeneity and average effect size are highly correlated and unstable when based on only statistically significant findings. Second, simulations show that one of the measures, D, does not accurately assess the QRP data peeking when effect sizes are heterogeneous. Third, even though performance of the second measure (i.e., difference between effect sizes based on contaminated and supposedly uncontaminated pvalues) is affected by estimation problems, it correctly signaled data peeking in the simulations. Fourth, when applying the second measure to the observed distribution of significant pvalues in psychology, the measure found evidence of monotonic excess of pvalues; the average effect size estimate based on all these pvalues was 0, which seems very unrealistic, and suggests the use of QRPs in psychology leading to pvalues just below .05.
Notwithstanding the outcome of the second measure, suggesting QRPs that cause monotonic excess, we do not consider it as direct evidence of such QRPs in psychology. Lakens (p.3; 2015) suggests that “it is essential to use a model of pvalue distributions before drawing conclusions about the underlying reasons for specific distributions of pvalues extracted from the scientific literature.” We explicitly modeled the effect size distribution and by using the degrees of freedom of test results also model the effect sizes’ power and the pvalue distribution. But we fear this is not and cannot be sufficient. First of all, we could not accurately recover the effect size distribution under heterogeneity in our simulation study, even if all assumptions of our model were met. This rendered measure D unfruitful when there is heterogeneity, and severely limits the usefulness of the second measure that compares estimated average effect sizes. Second, devising other models may yield other results and thereby other interpretations (Benjamini & Hechtlinger, 2014; Goodman, 2014; Lakens, 2015a; De Winter & Dodou, 2015).
Results of all the aforementioned models are most likely not robust to violations of their assumptions. For instance, we assume a normal distribution of true effect sizes. This assumption is surely violated, since the reported pvalues arise from a mixture of many different types of effects, such as very large effects (manipulation checks), effects corresponding to main hypotheses, and zero effects (‘control’ variables). Additionally, consider the QRPs themselves; we examined the effect of only one QRP, data peeking, in one of its limited variants. Other QRPs exist that also increase the prevalence of pvalues just below .05, such as multiple operationalizations of a measure and selecting the first one to be significant. Other QRPs even increase the frequency of very small pvalues (R Van Aert, J Wicherts & M Van Assen, 2016, unpublished data). We deem it impossible to accurately model QRPs and their effects, considering the difficulties we already demonstrated for modeling the pvalue distribution generated using a single QRP that was clearly defined. To conclude, we fear that Gelman & O’Rourke (2014) may be right when suggesting that drawing conclusions with regard to any QRP based on modeling pvalue distributions obtained from automatically extracted results is unfruitful.
On the other hand, we do recommend modeling effect size and pvalue distributions of results that all intend to test the same hypothesis, to prevent contamination by irrelevant test results (Bishop & Thompson, 2016; Simonsohn, Simmons & Nelson, 2015). Examples of methods that focus on similar results are puniform (Van Assen, Van Aert & Wicherts, 2015) and pcurve (Simonsohn, Nelson & Simmons, 2014), which model statistically significant statistics pertaining to one specific effect and estimate the effect size based on these statistics while correcting for publication bias. Further research should reveal if both methods can also be used to detect and correct for phacking in the context of estimating one particular effect size. Preliminary results suggest, however, that detection and correcting for phacking based on statistics alone is rather challenging (R Van Aert, J Wicherts & M Van Assen, 2016, unpublished data).