The error-rate estimates that were obtained from the simulations appear to have been computed incorrectly. Those estimates—which are called “Proportion of type I per comparison error rates” in the Figure 3 caption, “Proportion of type I error” in the Figure 3 graphs, and “type I PCER” or simply “type I error” in some other places in the paper—are presumably intended to estimate the per-comparison Type I error rate (PCER). But an examination of the Figure 3 graphs (https://doi.org/10.7717/peerj.10387/fig-3) and the simulation R code (https://github.com/stevemidway/MultipleComparisons/blob/master/simulation_function.R) reveals that the computations are not correct.
First, consider the Figure 3 graphs. Notice that in all the “many groups” cases (labeled HSMG and LSMG in the graphs), the bars are roughly 3 times the height they should be. For instance, because the LSD procedure conducts each test at the (unadjusted) .05 level, LSD’s PCER should be .05 by definition—yet the bars for LSD in the “many groups” cases reach roughly .15 instead.
The source of the computation error appears to be in line 198 of the simulation code:
prop_type1[i,1]<-round(length(which(sig_diffs[,,i]<0.05))/(sims*groups),3)
If I understand the program correctly, i
indexes the examined procedures (i=1 for “tukey,” i=2 for “snk,” etc.), sig_diffs
is an array of p-values (in which each row is for a given iteration, each column is for a given pairwise combination of groups, and each “page” is for a given procedure), sims
is the number of iterations, and groups
is the number of groups in the design (either 3 or 7). Thus, for each procedure, that line of code computes a value (rounded to 3 decimal places) equal to the total number of statistically significant p-values divided by the product of the number of iterations and the number of groups. That is not the correct formula for estimating the PCER; it appears the authors should have replaced groups
by num_combs
(the number of pairwise combinations of groups) in that line of code, as follows:
prop_type1[i,1]<-round(length(which(sig_diffs[,,i]<0.05))/(sims*num_combs),3)
Or to perform the same computation more simply, and without the unnecessary rounding:
prop_type1[i, 1] = mean(sig_diffs[, , i] < .05)
In other words, the estimated PCER should simply be the proportion of p-values that were less than .05 for the given procedure.
For the “few groups” simulations, the computation ends up being correct despite the incorrect formula—but only because the number of groups (namely, 3) happens to be equal to the number of pairwise combinations of groups in that case. For the “many groups” simulations, the number of groups (namely, 7) is not equal to the number of pairwise comparisons of groups (namely, 21), making the estimate 3 times what it is supposed to be (which explains the aforementioned peculiarities in the Figure 3 graphs).
Even if the computation were corrected, there would be no apparent purpose of estimating the PCER. After all, as noted by Reviewer 2 in the peer review (https://doi.org/10.7287/peerj.10387v0.1/reviews/2), the aim of multiple-testing procedures is to control some overall error rate (e.g., the experimentwise Type I error rate). Moreover, for any single-step multiple-testing procedure, there is no need to estimate the PCER via simulation, because the PCER is—by definition—simply the comparisonwise alpha level used in the tests. For instance, the LSD procedure uses an unadjusted (.05) comparisonswise alpha level, so its PCER is .05. And Bonferroni uses a comparisonwise alpha level of .05/m (where m is the number of comparisons), so its PCER is .05/m, e.g., if there are 3 comparisons Bonferroni’s PCER is .05/3 = .0167. Incidentally, it is pointless to imply a distinction between the PCERs of “LSD Bonferroni” and “t-test Bonferroni," since the adjustment is the same in both cases. The same goes for “LSD Šidák” versus “t-test Šidák.”
The problems with the computations are compounded by the low number of iterations (100), which make the estimations imprecise even when the computations are performed correctly. For instance, notice that in Figure 3A, each procedure’s PCER appears considerably higher in the “low sample-size, few groups” (LSFG) case than in the “high sample-size, few groups” (HSFG) case. But that should not be. For instance, LSD’s PCER should be .05, regardless of sample size. And Bonferroni’s PCER should be .05/3 = .0167, regardless of sample size (because there are 3 comparisons in the “few groups” simulations). Thus, the authors’ references to differences between the low and high sample-size simulations (including their observation of the “reverse trend” in the unbalanced case) are essentially meaningless because those differences apparently reflect mere "noise," i.e., random variation from one batch of iterations to another. That can be confirmed simply by running the simulation program several times, graphing the results each time, and observing how inconsistent the purported trends are from one run to another (though the provided simulation program does not run without errors unless each=8
is corrected to each=9
in lines 296 and 326, since there are 9 examined procedures). In any case, the number of iterations would need to be orders of magnitude higher to obtain reasonably precise, stable estimates.