This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
In this opinion piece, we outline two shortcomings in experimental design that limit the claims that can be made about concept learning in animals. On the one hand, most studies of concept learning train too few concepts in parallel to support general claims about their capacity of subsequent abstraction. On the other hand, even studies that train many categories of stimulus in parallel only test one or two stimuli at a time, allowing even a simplistic learning rule to succeed by making informed guesses. To demonstrate these shortcomings, we include simulations performed using an off-the-shelf image classifier. These simulations demonstrate that, when either training or testing are overly simplistic, a classification algorithm that is incapable of abstraction nevertheless yields levels of performance that have been described in the literature as proof of concept learning in animals.
This updated version contains minor modifications for clarification purposes.
When training requires that only two categories be identified, then the classifier (and therefore the organism) need only identify some difference that distinguishes them, and nothing more. The result is a tailor-made classifier: Tailored by the specifics of the binary training paradigm, and optimized solely for that dichotomous discrimination.
As a psychologist I use binary classification tasks quite often. My approach (and the approach that I recommend) is to let SVM algorithm do the classification. If it performs above chance then I look at the minima and maxima (i.e. the typical TRUE or FALSE stimulus) of the SVM classifier. If I find that the SVM exploits some stimulus ideosynchracies (what you call "specifics of the binary training paradigm") I improve the stimulus. I iterate this process until I obtain stimuli where SVM performs at chance. Then I conclude that there can't be any "tailor-made classifier" as SVM - a state of the art classifier - doesn't find any.
However, it is difficult to assess what proportion of correct responses are genuine classifications and what proportion are merely lucky guesses. Accuracy of 70% on a binary test could mean that the subject is guessing at random more than half the time.
The usual way to asses this is to compute the (say 95 %) confidence interval of accuracy with binomial distribution. Then you get statement like "accuracy is 70 [40-90] %" or "accuracy is 70 [69-71] %" (in "x [l,u]" x is the mean, l is the lower bound and u is the upper bound) . The latter case excludes the possibility of guessing.
Regarding your simulations, I assume that the airplane or motorbike images also had background. Then the category "background" overlaps with other categories and the poor performance of background category is not surprising. I assume behavioral experiments would not use overlapping categories. If we exclude the "background" category, I do not think that the results in figure 1 left support your conclusion.
My approach (and the approach that I recommend) is to let SVM algorithm do the classification.
I agree that this is an effective method for identifying sets of stimuli that provide subjects with very few shortcuts. In effect, your approach reduces the number of ways in which the classifier's training phase may be tailored to the task. While generating normalized stimuli is an important approach (in that it yields classification under extreme conditions), the study of pictorial discrimination must also grapple with behavior when the stimuli are plausible and photographic. If, for example, subjects rely on specific features, that's a part of the phenomenon that we would like to understand. So, eventually, studies using non-normalized stimuli become necessary, and the question the arises of how best to pursue them.
The usual way to asses this is to compute the (say 95 %) confidence interval of accuracy with binomial distribution.
Let us suppose that accuracy is indeed 70%, with 95% a confidence interval that spans only +/- 1%. The only factor that determines the confidence interval is the number of trials: For such a narrow CI, you'd need about 10,000 trials, 7,000 of which were correct. Now, the question is: Of those 7,000 correct responses, how many of those were guesses? If subjects know the correct answer with perfect certainty on 4,000 of those trials, and guess at random on the remaining 6,000, then their accuracy will be 70% (0.4 + 0.3). A confidence interval for the probability provides no clue as to which responses are guesses and which are made with considerable confidence (and, in practice, responses will probably instead fall along a gradient of confidence). Put another way, responding is an unknown mixture of responses made with varying levels of subjective confidence, and because computing a confidence interval relies on the assumption of exchangeability between observations, there is no way to infer precisely what that distribution is from the responses alone.
If we exclude the "background" category, I do not think that the results in figure 1 left support your conclusion.
I agree, because the "background" category is central to the argument we're making: Of the stimulus categories included, it is the only one that is defined in abstract terms (i.e. by the absence of a foreground object). The point we are making with that figure is that the bag-of-features algorithm (which is powerful, but not capable of sophisticated abstraction) reveals its limitations only when it was trained on a large number of categories simultaneously. When it was trained on only a few, "background" was identified consistently, despite the algorithm's lack of abstract cognition, and it is therefore unwise to treat that high accuracy on the "background" category as indicative of conceptual abstraction. Training many categories in parallel reveals the relative strengths and weaknesses of the classifier much more effectively than training them in pairs or groups of three.
The studies of concept learning that I know make the following claim: The discrimination performance can't be explained by chance and we can conclude that subjects are able to discriminate the concept under study. For this claim, it really doesn't matter whether the performance is 90 [89,91] % or 55 [54,56] %. It doesn't matter whether the subject showed mistakes because it was tired or because the experiment conditions were not ideal or because "responding is an unknown mixture of responses made with varying levels of subjective confidence". As long as one gets non-chance performance one can conclude that at least on some trials with some non-chance accuracy the subject was able to discriminate the concept.
You are right that a problem arises if concept category is set against a simple category. Then the simple category can be used to identify the concept as its complement. In such case one thing that may be useful is signal detection theory. One should specifically look at the performance for the concept category.
I gave it some more thought. I think your idea is sound, though I would put it in more general and formal terms with the help of statistical learning theory. Each simple category is defined by a vector in a multidimensional feature space. Concept category is some irregular cloud/blob of convoluted shape. If you have one simple and one concept category, you can always define a hyperplane that discriminates well between the two (just take the vector that defines the simple category as the normal vector of the hyperplane). With multiple simple categories - especialy if these are orthogonal, it becomes increasingly difficult to find a simple classification rule - a hyperplane. Viewed this way, you could draw a nice picture that shows the blob, some simple categories and how it becomes impossible to create hyperplane when the number of categories is high.
In any case, I would, if not discuss then, at least mention in the report that the confidence intervals and signal detection theory do not resolve the problems with binary classification tasks.
"Following" is like subscribing to any updates related to a preprint.
These updates will appear in your home dashboard each time you visit PeerJ.
You can also choose to receive updates via daily or weekly email digests.
If you are following multiple preprints then we will send you
no more than one email per day or week based on your preferences.
Note: You are now also subscribed to the subject areas of this preprint
and will receive updates in the daily or weekly email digests if turned on.
You can add specific subject areas through your profile settings.