Thanks for the link! However on first inspection, I think the word "demonstrate" is a bit strong in light of your new data - the power of the new study is much too low to conclude anything concrete. In short, I believe you have narrowed down effect size to somewhere between d =  -0.66 and 0.44.
In the experiment in which you tested for contact pheromones by comparing workers housed with a caged queen with queenless workers (i.e. the most relevant part for the debate here), the sample size was 36 vs 42 workers. This is a lot less than in any previous queen pheromone study I can think of (e.g. my quickie PeerJ bumblebee study had 502 workers), and so I think that your negative result could very well be due to low power rather than the real absence of an effect of caged queens on worker oocytes.
I also suspect that your statistical power is not really 90-100%, as you claim in the paper. Measuring a highly variable trait with a smallish sample size should not be giving you such large power values, and I suspect there is a mistake in your power calculations (as in the Proc B paper). My guess is that you are doing the power calculation at the wrong experimental level.
More importantly, post-hoc power analysis is not a useful method for judging the reliability of negative results - see Shinichi Nakagawa's papers on this topic. See also Nakagawa and Cuthill’s review of effect size and its confidence intervals, which is a substantially better way for you to show that you have high confidence in a negative result.
To get to the bottom of this, I took the liberty of downloading your data and calculating effect size and its confidence intervals, for the most pertinent result (i.e. experiment 2, where you look at the effect of a caged queen vs a no-queen control on worker oocyte size). The R code is below. As you can see, your experiment has enough power to tell us, with 95% confidence, that the effect size of a caged queen on worker oocyte size is between -0.66 and 0.44 (as measured by Cohen’s d). This means that your data are equally consistent with a very large negative effect, no effect, or a large positive effect. Thus, the sample size is too low to measure the effect size of the queen or her pheromone with enough precision.
Incidentally there is also something odd about the distribution of the response variable in the control group - it’s weirdly bimodal, and I wonder if you randomly allocated the bees to the treatment and control groups (given than you said you didn’t do so in the Proc B paper). See the ggplot graph I made.
If you’re interested, I am more than happy to give constructive criticism on experimental design ideas or statistical analysis on any of your future work. I agree that current data do not conclusively settle the matter of what regulates reproduction in bumblebee colonies, and more experiments are needed - but it’s not useful to keep doing underpowered or confounded experiments. Happy to advise where I can.
library(lme4)
library(ggplot2)
b <- read.csv("/Desktop/new bumblebees.csv", stringsAsFactors = F) # You’ll need to get the Fig2aoocyte tab into a csv file and put it on the desktop
b <- b[b$Treatment %in% c("Caged queen", "control"), ] # Remove the free queen treatment, we don't need it here as it is not as informative with regards to the effects of contact pheromones
# Fit a model - for simplicity I'll assume the response variable is Gaussian (so we don’t need a GLMM)
# Note that I scaled the response variable, so it is expressed in standard deviations. This means the fixed effect of Treatment will be expressed in standard deviations too, i.e. it is effect size (as measured by Cohen's d)
# I left out 'trial' as it was not significant
model <- lmer(scale(Oocyte.size)  Treatment + (1|Cage.identity), data = b)
# Our best estimate of the effect size of a caged queen is -0.10, meaning she reduces oocyte size by 1/10th of a standard deviation.
# Also, there is plenty of variation between cages in worker oocyte size (20% variance is between cages)
summary(model)
# You can get a crude estimate of the effect size confidence limits as +/- 1.96SE
-0.10225 + 1.960.27152   # Upper 95% CI: 0.43
-0.10225 - 1.96*0.27152   # Lower 95% CI: -0.63
# You can get a slightly better estimate using bootstrapping
boot.straps <- bootMer(model, function(x) (beta = getME(x, "beta")), nsim = 1000) # bootstrap some 95% confidence limits for effect i
as.numeric(quantile(boot.straps$t[,2], probs = c(0.025, 0.975))) # The bootstrap 95% CIs on effect size are: -0.66 to 0.44
# So, we know that the effect size of treatment is somewhere between -0.66 and 0.44. This means the experiment's power is too low to say anything useful about the effect of a caged queen on worker oocyte size.
# The raw data is also pretty weird looking - there is a surprisingly clear cluster of non-reproducing control bees. Maybe they were just unhealthy, and they bring the mean down in the control? I checked, and those bees are not from a particular trial.
ggplot(b, aes(x = Treatment, y = Oocyte.size, colour = as.factor(Cage.identity), shape = Trail)) + geomjitter()
     
                                                            
        Hi Etya, I think you are missing my point here. My R code shows that you have not been able to measure the difference in means between the control and caged groups with enough precision to conclude anything one way or the other. According to your small dataset, you are only able to determine that the difference in means is somewhere between -0.66 standard deviations and +0.44 standard deviations. Thus it makes no sense to say there is no trend in the data.
I'll try to put this another way in case it is clearer: even a very large difference in means has a high chance of failing to be detected in a small experiment, because there is a high risk of a 'false negative' when sample size (and therefore statistical power) is very low. As a rule of thumb, I recommend you try to have a sample size at least as large as all the previous studies on this topic (e.g. my stuff or Tom's) - aim for at least 50 (but preferably >100) workers in each treatment (NB I did the dissections for that 500-worker PeerJ study in 2 days). As my graph of your new data shows, your dataset contains a lot of unusual bees (e.g. that cluster of non-reproducing control bees I mentioned) - this high within-treatment variance means that you need a decent sample size to properly measure the difference in treatment means. That is, high variance in the data reduces statistical power, so you need to compensate by adding more samples.
While I'm here, I'd also suggest that you always do your ovary dissections blind to treatment (apparently not done in the new paper). Not working blind can add lots of conscious or unconscious bias, especially when the response variable is subjective and where the researcher has an interest in getting a particular outcome (say, trying to replicate a result that was the subject of a recent commentary). See here - lack of blindness is a big problem in our field: http://journals.plos.org/plosbiology/article?id=10.1371%2Fjournal.pbio.1002190. And here's a social insect related article on the same topic: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0053548