Thanks for the feedback, it's much appreciated.
My approach (and the approach that I recommend) is to let SVM algorithm do the classification.
I agree that this is an effective method for identifying sets of stimuli that provide subjects with very few shortcuts. In effect, your approach reduces the number of ways in which the classifier's training phase may be tailored to the task. While generating normalized stimuli is an important approach (in that it yields classification under extreme conditions), the study of pictorial discrimination must also grapple with behavior when the stimuli are plausible and photographic. If, for example, subjects rely on specific features, that's a part of the phenomenon that we would like to understand. So, eventually, studies using non-normalized stimuli become necessary, and the question the arises of how best to pursue them.
The usual way to asses this is to compute the (say 95 %) confidence interval of accuracy with binomial distribution.
Let us suppose that accuracy is indeed 70%, with 95% a confidence interval that spans only +/- 1%. The only factor that determines the confidence interval is the number of trials: For such a narrow CI, you'd need about 10,000 trials, 7,000 of which were correct. Now, the question is: Of those 7,000 correct responses, how many of those were guesses? If subjects know the correct answer with perfect certainty on 4,000 of those trials, and guess at random on the remaining 6,000, then their accuracy will be 70% (0.4 + 0.3). A confidence interval for the probability provides no clue as to which responses are guesses and which are made with considerable confidence (and, in practice, responses will probably instead fall along a gradient of confidence). Put another way, responding is an unknown mixture of responses made with varying levels of subjective confidence, and because computing a confidence interval relies on the assumption of exchangeability between observations, there is no way to infer precisely what that distribution is from the responses alone.
If we exclude the "background" category, I do not think that the results in figure 1 left support your conclusion.
I agree, because the "background" category is central to the argument we're making: Of the stimulus categories included, it is the only one that is defined in abstract terms (i.e. by the absence of a foreground object). The point we are making with that figure is that the bag-of-features algorithm (which is powerful, but not capable of sophisticated abstraction) reveals its limitations only when it was trained on a large number of categories simultaneously. When it was trained on only a few, "background" was identified consistently, despite the algorithm's lack of abstract cognition, and it is therefore unwise to treat that high accuracy on the "background" category as indicative of conceptual abstraction. Training many categories in parallel reveals the relative strengths and weaknesses of the classifier much more effectively than training them in pairs or groups of three.
The studies of concept learning that I know make the following claim: The discrimination performance can't be explained by chance and we can conclude that subjects are able to discriminate the concept under study. For this claim, it really doesn't matter whether the performance is 90 [89,91] % or 55 [54,56] %. It doesn't matter whether the subject showed mistakes because it was tired or because the experiment conditions were not ideal or because "responding is an unknown mixture of responses made with varying levels of subjective confidence". As long as one gets non-chance performance one can conclude that at least on some trials with some non-chance accuracy the subject was able to discriminate the concept.
You are right that a problem arises if concept category is set against a simple category. Then the simple category can be used to identify the concept as its complement. In such case one thing that may be useful is signal detection theory. One should specifically look at the performance for the concept category.
I gave it some more thought. I think your idea is sound, though I would put it in more general and formal terms with the help of statistical learning theory. Each simple category is defined by a vector in a multidimensional feature space. Concept category is some irregular cloud/blob of convoluted shape. If you have one simple and one concept category, you can always define a hyperplane that discriminates well between the two (just take the vector that defines the simple category as the normal vector of the hyperplane). With multiple simple categories - especialy if these are orthogonal, it becomes increasingly difficult to find a simple classification rule - a hyperplane. Viewed this way, you could draw a nice picture that shows the blob, some simple categories and how it becomes impossible to create hyperplane when the number of categories is high.
In any case, I would, if not discuss then, at least mention in the report that the confidence intervals and signal detection theory do not resolve the problems with binary classification tasks.