From a quick glance at the code (great that it is available, but it could perhaps use some nice whitespace), it looks like the p-values are computed from a linear model and assuming normality and homoscedasticity (for them to be correct in small samples at least). Without each of these assumptions (normality, homoscedastic errors, and normal errors), how many of the "positive" cases remain so? For example, which were significant with a non-parametric test, or which were significant when using the bootstrap or sandwich errors with OLS, or which were significant when using a robust regression? It is unclear that the instructions to students made this clear.