0
Problems with the decision diagram and the recommendations
Viewed 24 times

There are several problems with the decision diagram in Figure 7 and with the corresponding recommendations in the text.

First of all, the diagram exempts all planned orthogonal comparisons from any adjustment whatsoever (which has no apparent justification; see https://files.eric.ed.gov/fulltext/EJ1083896.pdf). In fact, unadjusted orthogonal tests produce higher experimentwise Type I error rates than unadjusted positively-dependent tests. If 3 orthogonal tests are conducted at .05, then there is an approximately 1 in 7 chance of Type I error (when the null hypotheses are true): 1 - (1 - .05)3 = .143 = 1/7. And the fact that the tests were "planned" does not alter that mathematics. That level of Type I error inflation does not seem generally acceptable. And as the number of orthogonal tests increases, the Type I error inflation becomes even worse.

For "unplanned" comparisons, there is no generally valid multiple-comparisons adjustment. That is because adjustment is based on the number of tests—a number that may be misrepresented when tests are unplanned. Indeed, if the researcher is allowed to "snoop" at the data before choosing which tests to formally conduct, then the researcher is effectively conducting a (possibly indeterminate) number of informal tests during the snooping process. Thus, to control the experimentwise Type I error rate, one must adjust not only for the formally conducted tests, but also for all the tests that would have been conducted if they had looked more promising when snooping. Imagine for example that there are 100 different potential tests the researcher could potentially decide to conduct, but after peeking at the data the researcher decides to conduct only the 2 tests that look most promising. Certainly adjusting only for those 2 tests (e.g., by using a Bonferroni-adjusted alpha level of .025) would not adequately control Type I error, since by chance alone one would expect 2.5 out of 100 tests to be significant at the .025 level.

One could argue that in some unplanned-testing situations, the total number of "sensible" potential tests is finite, in which case it is possible to meaningfully adjust the unplanned tests (e.g., by adjusting for all 100 potential tests in the above example, or by using Scheffé tests to adjust for all possible linear combinations). But in that case, one is effectively conducting planned tests (by planning for every potential test).

Thus, the heuristic recommendation that planned orthogonal comparisons be unadjusted and other comparisons be adjusted does not appear justifiable. A better heuristic would be something like the following: Planned comparisons typically should be adjusted whether orthogonal or not (unless there is a specific reason to do otherwise), and unplanned comparisons typically should not be trusted (except for informal exploratory work) even if adjusted. That said, as you rightly note in the paper, there is no "one size fits all" approach that covers all multiple-testing situations.

There are other issues with the diagram as well. For instance, there are unmentioned important differences between the "independent of ANOVA" methods (which, incidentally, are not actually "independent" of ANOVA in the statistical sense). For example, unlike the Tukey and Tukey–Kramer methods, Bonferroni and Šidák can be applied to unequal-variance tests (e.g., Welch's t-tests). In fact, contrary to what the diagram implies, it is the equal variance assumption—not the normality assumption—that distinguishes procedures like Tukey and Tukey–Kramer from Games–Howell (though Games–Howell has been shown to be liberal in some cases; e.g., see https://www.tandfonline.com/doi/pdf/10.1080/00949650903219935).

Also, as noted in my comment on Table 2, there is no reason procedures such as Bonferroni, sequential Bonferroni (i.e., Holm), and Šidák cannot be applied to nonparametric tests (such as Mann–Whitney–Wilcoxon tests).

waiting for moderation