Combining active learning suggestions
 Published
 Accepted
 Received
 Academic Editor
 Sebastian Ventura
 Subject Areas
 Data Mining and Machine Learning
 Keywords
 Active learning, Bandit, Rank aggregation, Benchmark, Multiclass classification
 Copyright
 © 2018 Tran et al.
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
 Cite this article
 2018) Combining active learning suggestions. PeerJ Computer Science 4:e157 https://doi.org/10.7717/peerjcs.157 (
Abstract
We study the problem of combining active learning suggestions to identify informative training examples by empirically comparing methods on benchmark datasets. Many active learning heuristics for classification problems have been proposed to help us pick which instance to annotate next. But what is the optimal heuristic for a particular source of data? Motivated by the success of methods that combine predictors, we combine active learners with bandit algorithms and rank aggregation methods. We demonstrate that a combination of active learners outperforms passive learning in large benchmark datasets and removes the need to pick a particular active learner a priori. We discuss challenges to finding good rewards for bandit approaches and show that rank aggregation performs well.
Introduction
Recent advances in sensors and scientific instruments have led to an increasing use of machine learning techniques to manage the data deluge. Supervised learning has become a widely used paradigm in many big data applications. This relies on building a training set of labeled examples, which is timeconsuming as it requires manual annotation from human experts.
The most common approach to producing a training set is passive learning, where we randomly select an instance from a large pool of unlabeled data to annotate, and we continue doing this until the training set reaches a certain size or until the classifier makes sufficiently good predictions. Depending on how the underlying data is distributed, this process can be quite inefficient. Alternatively we can exploit the current set of labeled data to identify more informative unlabeled examples to annotate. For instance we can pick examples near the decision boundary of the classifier, where the class probability estimates are uncertain (i.e., we are still unsure which class the example belongs to).
Many active learning heuristics have been developed to reduce the labeling bottleneck without sacrificing the classifier performance. These heuristics actively choose the most informative examples to be labeled based on the predicted class probabilities. “Overview of Active Learning” describes two families of algorithms in detail: uncertainty sampling and version space reduction.
In this paper, we present a survey of how we can combine suggestions from various active learning heuristics. In supervised learning, combining predictors is a wellstudied problem. Many techniques such as AdaBoost (Freund & Schapire, 1996) (which averages predictions from a set of models) and decision trees (Breiman et al., 1984) (which select one model for making predictions in each region of the input space) have been shown to perform better than just using a single model. Inspired by this success, we propose to combine active learning suggestions with bandit and rank aggregation methods in “Combining Suggestions.”
The use of bandit algorithms to combine active learners has been studied before (Baram, ElYaniv & Luz, 2004; Hsu & Lin, 2015). Borda count, a simple rank aggregation method, has been used in the context of multitask learning for linguistic annotations (Reichart et al., 2008), where we have one active learner selecting examples to improve the performance of multiple related tasks (e.g., partofspeech tagging and name entity recognition). Borda count has also been used in multilabel learning (Reyes, Morell & Ventura, 2018) to combine uncertainty information from multiple labels. As far as we know, other aggregation methods have not been explored and our work is the first time that social choice theory is used to rank and aggregate suggestions from multiple active learners.
This paper makes the following two main contributions:
We empirically compare four bandit and three rank aggregation algorithms in the context of combining active learning heuristics. We apply these algorithms to 11 benchmark datasets from the UCI Machine Learning Repository (Lichman, 2013) and a large dataset from the Sloan Digital Sky Survey (SDSS) (Alam et al., 2015). The experimental setup and discussion are described in “Experimental Protocol, Results, and Discussion.”
We propose two metrics for evaluation: the mean posterior balanced accuracy (MPBA) and the strength of an algorithm. The MPBA extends the metric proposed in Brodersen et al. (2010) from the binary to the multiclass setting. This is an accuracy measure that takes class imbalance into account. The strength measure is a variation on the deficiency measure used in Baram, ElYaniv & Luz (2004) which evaluates the performance of an active learner or combiner, relative to passive learning. The main difference between our measure and that of Baram, ElYaniv & Luz (2004) is that ours assigns a higher number for better active learning methods and ensures that it is upperbounded by 1 for easier comparison across datasets.
Overview of Active Learning
In this paper we consider the binary and multiclass classification settings where we would like to learn a classifier h, which is a function that maps some feature space $\mathcal{X}\subseteq {\mathbb{R}}^{d}$ to a probability distribution over a finite label space 𝒴: (1) $$h:\mathcal{X}\to p(\mathcal{Y})$$
In other words, we require that the classifier produces class probability estimates for each unlabeled example. For instance, in logistic regression with only two classes, i.e., $\mathcal{Y}=\{0,1\}$, we can model the probability that an object with feature vector x belongs to the positive class with (2) $$h(\mathit{x};\mathbf{\text{\theta}})=\mathbb{P}(y=1\mathit{x};\mathbf{\text{\theta}})=\frac{1}{1+{\text{e}}^{{\mathbf{\theta}}^{T}\mathit{x}}}$$ and the optimal weight vector θ is learned in training. We can further consider kernel logistic regression, where the feature space $\mathcal{X}$ is the feature space corresponding to a given kernel, allowing for nonlinear decision functions.
In active learning, we use the class probability estimates from a trained classifier to estimate a score of informativeness for each unlabeled example. In poolbased active learning, where we select an object from a pool of unlabeled examples at each time step, we require that some objects have already been labeled. In practice this normally means that we label a small random sample at the beginning. These become the labeled training set ${\mathcal{L}}_{T}\subset \mathcal{X}\times \mathcal{Y}$ and the rest form the unlabeled set $\mathcal{U}\subset \mathcal{X}$.
Now consider the problem of choosing the next example in U for querying. Labeling can be a very expensive task, because it requires using expensive equipment or human experts to manually examine each object. Thus, we want to be smart in choosing the next example. This motivates us to come up with a rule s(x; h) that gives each unlabeled example a score based only on their feature vector x and the current classifier h. Recall that the classifier produces p(𝒴), a probability estimate for each class. We use these probability estimates from the classifier over the unlabeled examples to calculate the scores: (3) $$s:p(\mathcal{Y})\to \mathbb{R}$$ The value of s(x; h) indicates the informativeness of example x, where bigger is better. We would then label the example with the largest value of s(x; h). This will be our active learning rule r: (4) $$r(\mathcal{U};h)=\underset{x\in \mathcal{U}}{\mathrm{arg}\mathrm{max}}s(x;h)$$
Algorithm 1 outlines the standard poolbased active learning setting.
Input: unlabeled set 𝒰, labeled training set ℒ_{T}, classifier h(x), and active learner $r(\mathcal{U};\text{\hspace{0.17em}}h)$. 
repeat 
Select the most informative candidate x_{*} from 𝒰 using the active learning rule $r(\mathcal{U};\text{\hspace{0.17em}}h)$. 
Ask the expert to label x_{*}. Call the label y_{*}. 
Add the newly labeled example to the training set: ${\mathcal{L}}_{T}\leftarrow {\mathcal{L}}_{T}\cup \{({\mathit{x}}_{*},{y}_{*})\}$. 
Remove the newly labeled example from the unlabeled set: $\mathcal{U}\leftarrow \mathcal{U}\backslash \left\{{\mathit{x}}_{*}\right\}$. 
Retrain the classifier h(x) using ℒ_{T}. 
until we have enough training examples. 
Coming up with an optimal rule is itself a difficult problem, but there have been many attempts to derive good heuristics. Five common ones, which we shall use in our experiments, are described in “Uncertainty Sampling” and “Version Space Reduction.”
There are also heuristics that involve minimizing the variance or maximizing the classifier certainty of the model (Schein & Ungar, 2007), but they are computationally expensive. For example, in the variance minimization heuristic, the score of a candidate example is the expected reduction in the model variance if that example were in the training set. To compute this reduction, we first need to give the example each of the possible labels, add it to the training set, and update the classifier. This is expensive to run since in each iteration, the classifier needs to be retrained k × U times, where k is the number of classes and U is the size of the unlabeled pool. There are techniques to speed this up such as using online training or assigning a score to only a small subset of the unlabeled pool. Preliminary experiments showed that these heuristics do not perform as well as the simpler ones (Tran, 2015), so we do not consider them in this paper.
A more comprehensive treatment of these active learning heuristics can be found in (Settles, 2012).
Uncertainty sampling
Lewis & Gale (1994) introduced uncertainty sampling, where we select the instance whose class membership the classifier is least certain about. These tend to be points that are near the decision boundary of the classifier. Perhaps the simplest way to quantify uncertainty is the least confidence heuristic (Culotta & McCallum, 2005), where we pick the candidate whose most likely label the classifier is most uncertain about: (5) $${r}_{LC}(\mathcal{U};h)=\underset{\mathit{x}\in \mathcal{U}}{\mathrm{arg}\mathrm{max}}\{\underset{y\in \mathcal{Y}}{\mathrm{max}}\text{\hspace{0.17em}}p(y\mathit{x};h)\}$$ where p(yx; h) is the probability that the object with feature vector x belongs to class y under classifier h. For consistency, we have flipped the sign of the score function so that the candidate with the highest score is picked.
A second option is to calculate the entropy (Shannon, 1948), which measures the amount of information needed to encode a distribution. Intuitively, the closer the class probabilities of an object are to a uniform distribution, the higher its entropy will be. This gives us the heuristic of picking the candidate with the highest entropy of the distribution over the classes: (6) $${r}_{HE}(\mathcal{U};h)=\underset{\mathit{x}\in \mathcal{U}}{\mathrm{arg}\mathrm{max}}\left\{\sum _{y\in \mathcal{Y}}p(y\mathit{x};h)\mathrm{log}[p(y\mathit{x};h)]\right\}$$ As a third option we can pick the candidate with the smallest margin, which is defined as the difference between the two highest class probabilities (Scheffer, Decomain & Wrobel, 2001): (7) $${r}_{SM}(\mathcal{U};h)=\underset{\mathit{x}\in \mathcal{U}}{\mathrm{arg}\mathrm{max}}\left\{\left(\underset{y\in \mathcal{Y}}{\mathrm{max}}\text{\hspace{0.17em}}p(y\mathit{x};h)\underset{z\in \mathcal{Y}\backslash \left\{{y}^{*}\right\}}{\mathrm{max}}p(z\mathit{x};h)\right)\right\}$$ where ${y}^{*}=\underset{y\in \mathcal{Y}}{\mathrm{arg}\mathrm{max}}\text{\hspace{0.17em}}p(z\mathit{x};h)$ and we again flip the sign of the score function. Since the sum of all probabilities must be 1, the smaller the margin is, the harder it is to differentiate between the two most likely labels.
An extension to the above three heuristics is to weight the score with the information density so that we give more importance to instances in regions of high density: (8) $${s}_{ID}(\mathcal{U};h)=\left(\frac{1}{U}{\displaystyle \mathit{\sum}_{k=1}^{E}}\text{sim}(\mathit{x}\mathrm{,}{\mathit{x}}^{(k)})\right)s(\mathit{x};h)$$ where h is the classifier, s(x; h) is the original score function of the instance with feature vector x, U is the size of the unlabeled pool, and sim(x, x^{(k)}) is the similarity between x and another instance x^{(k)} using the Gaussian kernel with parameter γ: (9) $$\text{sim}(x,{x}^{(k)})=\mathrm{exp}(\gamma \Vert x{x}^{(k)}{\Vert}^{2})$$
The information density weighting was proposed by Settles & Craven (2008) to discourage the active learner from picking outliers. Although the class membership of outliers might be uncertain, knowing their labels would probably not affect the classifier performance on the data as a whole.
Version space reduction
Instead of focusing on the uncertainty of individual predictions, we could instead try to constrain the size of the version space, thus allowing the search for the optimal classifier to be more precise. The version space is defined as the set of all possible classifiers that are consistent with the current training set. To quantify the size of this space, we can train a committee of B classifiers, ℬ = {h_{1}, h_{2}, …, h_{B}}, and measure the disagreement among the members of the committee about an object’s class membership. Ideally, each member should be as different from the others as possible but still be in the version space (Melville & Mooney, 2004). In order to have this diversity, we give each member only a subset of the training examples. Since there might not be enough training data, we need to use bootstrapping and select samples with replacement. Hence this method is often called Query by Bagging (QBB).
One way to measure the level of disagreement is to calculate the margin using the class probabilities estimated by the committee (Melville & Mooney, 2004): (10) $${r}_{\text{QBBM}}(\mathcal{U};h)=\underset{{}_{\mathit{x}\in \mathcal{U}}}{{\displaystyle \mathrm{arg}\mathrm{max}}}\left\{\left(\underset{y\in \mathcal{Y}}{{\displaystyle \mathrm{max}}}\text{\hspace{0.17em}}p(y\mathit{x};\mathcal{B})\underset{z\in \mathcal{Y}\backslash \left\{{y}^{*}\right\}}{{\displaystyle \mathrm{max}}}p(z\mathit{x};\mathcal{B})\right)\right\}$$ where (11) $${y}^{*}=\underset{y\in \mathcal{Y}}{\mathrm{arg}\mathrm{max}}p(z\mathit{x};B)$$ (12) $$p(z\mathit{x};\mathcal{B})=\frac{1}{B}{\displaystyle \sum _{b\in \mathcal{B}}}p(y\mathit{x};{h}_{b})$$
This looks similar to one of the uncertainty sampling heuristics, except now we use p(yx; ℬ) instead of p(yx; h). That is, we first average out the class probabilities predicted by the members before minimizing the margin. McCallum & Nigam (1998) offered an alternative disagreement measure which involves picking the candidate with the largest mean Kullback–Leibler (KL) divergence from the average: (13) $${r}_{\text{QBBKL}}(\mathcal{U};h)=\underset{\mathit{x}\in \mathcal{U}}{{\displaystyle \mathrm{arg}\mathrm{max}}}\left\{\frac{1}{B}{\displaystyle \sum _{b=1}^{B}}{D}_{\text{KL}}({p}_{b}\parallel {p}_{\mathcal{B}})\right\}$$ where D_{KL}(p_{b}‖p_{B}) is the KL divergence from p_{B} (the probability distribution that is averaged across the committee B), to p_{b} (the distribution predicted by a member b ∈ B): (14) $${D}_{\text{KL}}({p}_{b}\parallel {p}_{\mathcal{B}})={\displaystyle \sum _{y\in \mathcal{Y}}}p(y\mathit{x};{h}_{b})\text{\hspace{0.05em}\hspace{0.17em}}\mathrm{ln}\frac{p(y\mathit{x};{h}_{b})}{p(y\mathit{x};\mathcal{B})}$$ For convenience, we summarize the five heuristics discussed above in Table 1.
Abbreviation  Heuristic  Objective function 

confidence  Least confidence  $\underset{{}_{\mathit{x}\in \mathcal{U}}}{{\displaystyle \mathrm{arg}\mathrm{max}}}\left\{{{\displaystyle \mathrm{max}}}_{y\in \mathcal{Y}}\text{\hspace{0.17em}}p(y\mathit{x};h)\right\}$ 
entropy  Highest entropy  $\underset{{}_{\mathit{x}\in \mathcal{U}}}{{\displaystyle \mathrm{arg}\mathrm{max}}}\left\{{\displaystyle {\sum}_{y\in \mathcal{Y}}}\text{\hspace{0.17em}}p(y\mathit{x};h)\mathrm{log}[p(y\mathit{x};h)]\right\}$ 
margin  Smallest margin  $\underset{{}_{\mathit{x}\in \mathcal{U}}}{{\displaystyle \mathrm{arg}\mathrm{max}}}\left\{\left({{\displaystyle \mathrm{max}}}_{y\in \mathcal{Y}}\text{\hspace{0.17em}}p(y\mathit{x};h){{\displaystyle \mathrm{max}}}_{z\in \mathcal{Y}\backslash \left\{{y}^{*}\right\}\text{\hspace{0.17em}}}p(z\mathit{x};h)\right)\right\}$ 
qbbmargin  Smallest QBB margin  $\underset{{}_{\mathit{x}\in \mathcal{U}}}{{\displaystyle \mathrm{arg}\mathrm{max}}}\left\{\left({{\displaystyle \mathrm{max}}}_{y\in \mathcal{Y}}\text{\hspace{0.17em}}p(y\mathit{x};\mathcal{B}){{\displaystyle \mathrm{max}}}_{z\in \mathcal{Y}\backslash \left\{{y}^{*}\right\}\text{\hspace{0.17em}}}p(z\mathit{x};\mathcal{B})\right)\right\}$ 
qbbkl  Largest QBB KL  $\underset{{}_{\mathit{x}\in \mathcal{U}}}{{\displaystyle \mathrm{arg}\mathrm{max}}}\left\{\frac{1}{B}{\displaystyle {\sum}_{b=1}^{B}}{D}_{\text{KL}}({p}_{b}\parallel {p}_{\mathcal{B}})\right\}$ 
Note:
Notations: p(yx; h) is the probability of that an object with feature vector x has label y under classifier h, ℬ is the set of ℬ classifiers {h_{1}, h_{2}, …, h_{B}}, 𝒴 is the set of possible labels, y^{*} is the most certain label, 𝒰 is the set of unlabeled instances, D_{KL}(p‖q) is the Kullback–Leibler divergence of p from q, and p_{ℬ} is the class distribution averaged across classifiers in ℬ. For consistency, with heuristics that use minimization, we flip the sign of the score so that we can always take the argmax to get the best candidate.
Combining Suggestions
Out of the five heuristics discussed, which one should we use in practice when we would like to apply active learning to a particular problem? There have been some attempts in the literature to do a theoretical analysis of their performance. Proofs are however scarce, and when there is one available, they normally only work under restrictive assumptions. For example, Freund et al. (1997) showed that the query by committee algorithm (a slight variant of our two QBB heuristics) guarantees an exponential decrease in the prediction error with the training size, but only when there is no noise. In general, whether any of these heuristics is guaranteed to beat passive learning is still an open question.
Even though we do not know which one is the best, we can still combine suggestions from all of the heuristics. This can be thought of as the problem of prediction with expert advice, where each expert is an active learning heuristic. In this paper we explore two different approaches: we can either consider the advice of only one expert at each time step (with bandit algorithms), or we can aggregate the advice of all the experts (with social choice theory).
Combining suggestions with bandit theory
First let us turn our attention to the multiarmed bandit problem in probability theory (Berry & Fristedt, 1985). The colorful name originates from the situation where a gambler stands in front of a slot machine with R levers. When pulled, each lever gives out a reward according to some unknown distribution. The goal of the game is to come up with a strategy that can maximize the gambler’s lifetime rewards. In the context of active learning, each lever is a heuristic with a different ability to identify the candidate whose labeling information is most valuable.
The main problem in multiarmed bandits is the tradeoff between exploring random heuristics and exploiting the best heuristic so far. There are many situations in which we find our previously held beliefs to be completely wrong. By always exploiting, we could miss out on the best heuristic. On the other hand, if we explore too much, it could take us a long time to reach the desired accuracy.
Bandit algorithms do not need to know the internal workings of the heuristics, but only the reward received from using any of them. At each time step, we receive a reward from a heuristic, and based on the history of all the rewards, the bandit algorithm can decide on which heuristic to pick next. Formally, we need to learn the function (15) $$b:{({J}_{\mathcal{R}}\times [0,1])}^{n}\to {J}_{\mathcal{R}}$$ where b is the bandit algorithm, the reward is normalized between 0 and 1, J_{ℛ} is the index set over the set of heuristics ℛ, and n is the time horizon.
What would be an appropriate reward w in this setting? We propose using the incremental increase in the performance of the test set after the candidate is added to the training set. This, of course, means that we need to keep a separate labeled test set around, just for the purpose of computing the rewards. We could, as is common practice in machine learning, use cross validation or bootstrap on ℒ_{T} to estimate the generalization performance. However for simplicity of presentation we use a separate test set ℒ_{S}.
Figure 1 and Algorithm 2 outline how bandits can be used in poolbased active learning. The only difference between the bandit algorithms lies in the Select function that selects which heuristic to use, and the Update function that updates the algorithm’s selection parameters when receiving a new reward.
There have been some attempts to combine active learning suggestions in the literature. Baram, ElYaniv & Luz (2004) used the EXP4 multiarmed bandit algorithm to automate the selection process. They proposed a reward called the classification entropy maximization, which can be shown to grow at a similar rate to the true accuracy in binary classification with support vector machines (SVMs). We will not compare our results directly with those in Baram, ElYaniv & Luz (2004) since we would like to evaluate algorithms that can work with both binary and multiclass classification. Our experiments also use logistic regression which produces probability estimates directly, rather than SVMs which can only produce unnormalized scores. Hsu & Lin (2015) studied an improved version of EXP4, called EXP4.P, and used importance weighting to estimate the true classifier performance using only the training set. In this paper, we empirically compare the following four bandit algorithms: Thompson sampling, OCUCB, klUCB, and EXP3++.
Input: unlabeled set 𝒰, labeled training set ℒ_{T}, labeled test set ℒ_{S}, classifier h, desired training size n, set of active learning heuristics ℛ, and bandit algorithm b with two functions Select and Update. 
while ℒ_{T} < n do 
Select a heuristic r_{*} ∈ ℛ according to Select. 
Select the most informative candidate x_{*} from 𝒰 using the chosen heuristic r_{*} (𝒰; h). 
Ask the expert to label x_{*}. Call the label y_{*}. 
Add the newly labeled example to the training set: ${\mathcal{L}}_{T}\leftarrow {\mathcal{L}}_{T}\cup \{({x}_{*}\mathrm{,}{y}_{*})\}$. 
Remove the newly labeled example from the unlabeled set: $\mathcal{U}\leftarrow \mathcal{U}\backslash \left\{{x}_{*}\right\}$. 
Retrain the classifier h(x) using ℒ_{T}. 
Run the updated classifier on the test set ℒ_{S} to compute the increase in the performance w. 
Update the parameters of b with Update(w). 
end 
Thompson sampling
The oldest bandit algorithm is Thompson sampling (Thompson, 1933) which solves the explorationexploitation tradeoff from a Bayesian perspective.
Let W_{i} be the reward of heuristic r_{i} ∈ ℛ. Observe that even with the best heuristic, we still might not score perfectly due to having a poor classifier trained on finite data. Conversely, a bad heuristic might be able to pick an informative candidate due to pure luck. Thus there is always a certain level of randomness in the reward received. Let us treat the reward W_{i} as a normally distributed random variable with mean 𝜈_{i} and variance ${\text{\tau}}_{i}^{2}$: (16) $$({W}_{i}{\nu}_{i})\sim \mathcal{N}({\nu}_{i}\mathrm{,}{\text{\tau}}_{i}^{2})$$
If we knew both 𝜈_{i} and τ_{i}^{2} for all heuristics, the problem would become trivially easy since we just need to always use the heuristic that has the highest mean reward. In practice, we do not know the true mean of the reward 𝜈_{i}, so let us add a second layer of randomness and assume that the mean itself follows a normal distribution: (17) $${\nu}_{i}\sim \mathcal{N}({\text{\mu}}_{i}\mathrm{,}{\text{\sigma}}_{i}^{2})$$
To make the problem tractable, let us assume that the variance τ_{i}^{2} in the first layer is a known constant. The goal now is to find a good algorithm that can estimate μ_{i} and σ_{i}^{2}.
We start with a prior on μ_{i} and σ_{i}^{2} for each heuristic r_{i}. The choice of prior does not usually matter in the long run. Since initially we do not have any information about the performance of each heuristic, the appropriate prior value for μ_{i} is 0, i.e., there is no evidence (yet) that any of the heuristics offers an improvement to the performance.
In each round, we draw a random sample 𝜈_{i}′ from the normal distribution $\mathcal{N}({\text{\mu}}_{i}\mathrm{,}{\text{\sigma}}_{i}^{2})$ for each i and select heuristic r_{*} that has the highest sampled value of the mean reward: (18) $${r}_{*}=\underset{i}{\text{arg\hspace{0.17em}max}}\text{\hspace{0.17em}}{{v}^{\prime}}_{i}$$
We then use this heuristic to select the object that is deemed to be the most informative, add it to the training set, and retrain the classifier. Next we use the updated classifier to predict the labels of objects in the test set. Let w be the reward observed. We now have a new piece of information that we can use to update our prior belief about the mean μ_{*} and the variance σ_{*}^{2} of the mean reward. Using Bayes’ theorem, we can show that the posterior distribution of the mean reward remains normal, (19) $$({\nu}_{*}\mid {W}_{*}=w)~({{\mu}^{\prime}}_{*},{{\sigma}^{\prime}}_{*}{}^{2})$$ with the following new mean and variance: (20) $${\mu}_{\ast}^{\prime}=\frac{{\text{\mu}}_{*}{\text{\tau}}_{*}^{2}+w{\text{\sigma}}_{*}^{2}}{{\text{\sigma}}_{*}^{2}+{\text{\tau}}_{*}^{2}}\text{\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}}{\sigma}_{\ast}^{\prime}{}^{2}=\frac{{\text{\sigma}}_{*}^{2}{\text{\tau}}_{*}^{2}}{{\text{\sigma}}_{*}^{2}+{\text{\tau}}_{*}^{2}}$$
Algorithm 3 summarizes the Select and Update functions used in Thompson sampling.
function Select() 
for i ∈ {1, 2, …, R} do 
𝜈_{i}′ draw a sample from $\mathcal{N}({\text{\mu}}_{i}\mathrm{,}{\text{\sigma}}_{i}^{2})$ 
end 
Select the heuristic with the highest sampled value: ${r}_{*}\leftarrow \underset{i}{\text{argmax}}\text{\hspace{0.17em}}{{\nu}^{\prime}}_{i}$ 
function Update(w) 
${\text{\mu}}_{*}\leftarrow \frac{{\text{\mu}}_{*}{\text{\tau}}_{*}^{2}+w{\text{\sigma}}_{*}^{2}}{{\text{\sigma}}_{*}^{2}+{\text{\tau}}_{*}^{2}}\text{\hspace{1em}\hspace{1em}\hspace{1em}}{\text{\sigma}}_{*}^{2}\leftarrow \frac{{\text{\sigma}}_{*}^{2}{\text{\tau}}_{*}^{2}}{{\text{\sigma}}_{*}^{2}+{\text{\tau}}_{*}^{2}}$ 
Upper confidence bounds
Next we consider the Upper Confidence Bound (UCB) algorithms which use the principle of “optimism in the face of uncertainty.” In choosing which heuristic to use, we first estimate the upper bound of the reward (that is, we make an optimistic guess) and pick the one with the highest bound. If our guess turns out to be wrong, the upper bound of the chosen heuristic will decrease, making it less likely to get selected in the next iteration.
There are many different algorithms in the UCB family, e.g., UCB1TUNED & UCB2 (Auer, CesaBianchi & Fischer, 2002a), VUCB (Audibert, Munos & Szepesvári, 2009), OCUCB (Lattimore, 2015), and klUCB (Cappé et al., 2013). They differ only in the way the upper bound is calculated. In this paper, we only consider the last two. In Optimally Confident UCB (OCUCB), Lattimore (2015) suggests that we pick the heuristic that maximizes the following upper bound: (21) $${r}_{*}=\underset{i}{{\displaystyle \mathrm{arg}\mathrm{max}}}\left(\overline{{w}_{i}}+\sqrt{\frac{\text{\alpha}}{{T}_{i}(t)}\mathrm{ln}\left(\frac{\text{\psi}n}{t}\right)}\right)$$ where $\overline{{w}_{i}}$ is the average of the rewards from r_{i} that we have observed so far, t is the time step, T_{i}(t) is the number times we have selected heuristic r_{i} before step t, and n is the maximum number of steps that we are going to take. There are two tunable parameters, α and ψ, which the author suggests setting to 3 and 2, respectively.
In klUCB, Cappé et al. (2013) suggest that we can instead consider the KLdivergence between the distribution of the current estimated reward and that of the upper bound. In the case of normally distributed rewards with known variance σ^{2}, the chosen heuristic would be (22) $${r}_{*}=\underset{i}{{\displaystyle \mathrm{arg}\mathrm{max}}}\left(\overline{{w}_{i}}+\sqrt{2{\text{\sigma}}^{2}\frac{\mathrm{ln}({T}_{i}(t))}{t}}\right)$$
Algorithms 4 and 5 summarize these two UCB approaches. Note that the size of the reward w is not used in Update (w) of UCB, except to select the best arm.
fuction Select() 
${r}_{*}\leftarrow \underset{i}{\mathrm{arg}\mathrm{max}}\overline{{w}_{i}}+\sqrt{\frac{3}{{T}_{i}(t)}\mathrm{ln}(\frac{2n}{t})}$ 
function Update(w) 
$t\leftarrow t+1$ 
${T}_{*}(t)\leftarrow {T}_{*}(t1)+1$ 
function Select() 
${r}_{*}\leftarrow \underset{i}{\mathrm{arg}\mathrm{max}}\overline{{w}_{i}}+\sqrt{2{\sigma}^{2}\frac{\mathrm{ln}({T}_{i}(t))}{t}}$ 
function Update(w) 
$t\leftarrow t+1$ 
${T}_{*}(t)\leftarrow {T}_{*}(t1)+1$ 
EXP3++
The exponentialweight algorithm for exploration and exploitation (EXP3) was first developed by Auer et al. (2002b) to solve the nonstochastic bandit problem where we make no statistical assumptions about the reward distribution. This is also often known as the adversarial setting, where we have an adversary who generates an arbitrary sequence of rewards for each heuristic in advance. Like Thompson sampling, the algorithm samples from a probability distribution at each step to pick a heuristic. Here however, we construct the distribution with exponential weighting (hence the name EXP3). We shall test Seldin & Slivkins (2014)’s EXP3++ algorithm (see Algorithm 6). This is a generalization of the original EXP3 and it has been shown to perform well in both the stochastic (where the reward of each heuristic follows an unknown reward distribution) and the adversarial regime.
function Select() 
$\text{\beta}=\frac{1}{2}\sqrt{\frac{\mathrm{ln}R}{tR}}$ 
for i ∈ {1, 2, …, R} do 
${\text{\xi}}_{i}=\frac{18\text{\hspace{0.17em}In(}t{\text{)}}^{2}}{t\mathrm{min}{(1,\frac{1}{t}({L}_{i}min(L)))}^{2}}$ 
${\text{\xi}}_{i}=\mathrm{min}(\frac{1}{2R},\text{\beta},{\text{\xi}}_{i})$ 
${\text{\rho}}_{i}=\frac{{\text{e}}^{\text{\beta}*{L}_{i}}}{{\displaystyle {\sum}_{j}}{\text{e}}^{\text{\beta}*{L}_{j}}}$ 
end 
${r}_{*}\leftarrow $ draw a sample from ℛ with probability distribution ρ. 
function Update (w) 
$t\leftarrow t+1$ 
${T}_{*}(t)\leftarrow {T}_{*}(t1)+1$ 
${L}_{*}\leftarrow \frac{{L}_{*}+(1w)}{(1{\displaystyle {\sum}_{j}}{\text{\epsilon}}_{j}){W}_{*}+{\text{\epsilon}}_{*}}$ 
Combining suggestions with social choice theory
A drawback of the above bandit methods is that at each iteration, we could only use one suggestion from one particular heuristic. EXP4 and EXP4.P algorithms can take advice from all heuristics by maintaining a weight on each of them. However, being a bandit method, they require designing a reward scheme. If the reward is the performance on a test set, we would need to keep around a separate subset of the data, which is expensive and sometimes impossible to obtain in practice. This leads us to social choice theory, which can combine suggestions like EXP4 and EXP4.P, while not needing the concept of a reward. Originally developed by political scientists like Nicolas de Condorcet and JeanCharles de Borda, this field of study is concerned with how we aggregate preferences of a group of people to determine, for example, the winner in an election (List, 2013). It has the nice property that everyone (or in our context, every active learning heuristic) has a voice.
For each heuristic, we assign a score to every candidate with the score function s(x, h) like before. We are neither interested in the actual raw scores nor the candidate with the highest score. Instead, we only need a ranking of the candidates, which is achieved by a function $k(s,\mathcal{U})$ that provides a ranking of the unlabeled examples according to their scores. For example, k could assign the candidate with the highest score a rank of 1, the next best candidate a rank of 2, and so on. An aggregation function c will then combine all the rankings into a combined ranking, (23) $$c:\text{\sigma}{({J}_{\mathcal{U}})}^{R}\to \text{\sigma}({J}_{\mathcal{U}})$$ where $\text{\sigma}({J}_{\mathcal{U}})$ is a permutation over the index set of the unlabeled pool 𝒰 and R is the number of heuristics. From these we can pick the highestranked candidate to annotate. See Table 2 for an example.
Score  s(x; h)  0.1  0.9  0.3  0.8 
Rank  k(s, U)  4  1  3  2 
The main difference between this approach and the bandit algorithms is that we do not consider the reward history when combining the rankings. Here each heuristic is assumed to always have an equal weight. A possible extension, which is not considered in this paper, is to use the past performance to reweight the heuristics before aggregating at each step. Figure 2 and Algorithm 7 provide an overview of how social choice theory is used in poolbased active learning.
Input: unlabeled set 𝒰, labeled training set ℒ_{T}, classifier h, set of active learning suggestions ℛ, ranking function k, and rank aggregator c. 
repeat: 
for r ∈ ℛ do 
Rank all the candidates in 𝒰 with k. 
end 
Aggregate all the rankings into one ranking using the aggregator c. 
Select the highestranked candidate x_{*} from 𝒰. 
Ask the expert to label x_{*}. Call the label y_{*}. 
Add the newly labeled example to the training set: ${\mathcal{L}}_{T}\leftarrow {\mathcal{L}}_{T}\cup \{({x}_{*}\mathrm{,}{y}_{*})\}$. 
Remove the newly labeled example from the unlabeled set: $\mathcal{U}\leftarrow \mathcal{U}\backslash \left\{{x}_{*}\right\}$. 
Retrain the classifier h(x) using ℒ_{T}. 
until we have enough training examples. 
The central question in social choice theory is how we can come up with a good preference aggregation rule. We shall examine three aggregation rules: Borda count, the geometric mean, and the Schulze method.
In the simplest approach, Borda count, we assign an integer point to each candidate. The lowestranked candidate receives a point of 1, and each candidate receives one more point than the candidate below. To aggregate, we simply add up all the points each candidate receives from every heuristic. The candidate with the most points is declared the winner and is to be labeled next. We can think of Borda count, then, as ranking the candidate according to the arithmetic mean.
An alternative approach is to use the geometric mean, where instead of adding up the points, we multiply them. Bedö & Ong (2016) showed that the geometric mean maximizes the Spearman correlation between the ranks. Note that this method requires the ranks to be scaled so that they lie strictly between 0 and 1. This can be achieved by simply dividing the ranks by (U + 1), where U is the number of candidates.
The third approach we consider is the Schulze method (Schulze, 2011). Out of the three methods considered, this is the only one that fulfills the Condorcet criterion, i.e., the winner chosen by the algorithm is also the winner when compared individually with each of the other candidates. However, the Schulze method is more computationally intensive since it requires examining all pairs of candidates. First we compute the number of heuristics that prefer candidate x_{i} to candidate x_{j}, for all possible pairs (x_{i}, x_{j}). Let us call this d(x_{i}, x_{j}). Let us also define a path from candidate x_{i} to x_{j} as the sequence of candidates, {x_{i}, x_{1}, x_{2}, …, x_{j}}, that starts with x_{i} and ends with x_{j}, where, as we move along the path, the number of heuristics that prefer the current candidate over the next candidate must be strictly decreasing. Intuitively, the path is the rank of a subset of candidates, where x_{i} is the highestranked candidate and x_{j} is at the lowestranked.
Associated with each path is a strength p, which is the minimum of d(x_{i}, x_{j}) for all consecutive x_{i} and x_{j} along the path. The core part of the algorithm involves finding the path of the maximal strength from each candidate to every other. Let us call p(x_{i}, x_{j}) the strength of strongest path between x_{i} and x_{j}. Candidate x_{i} is a potential winner if $p({x}_{i}\mathrm{,}{x}_{j})\ge p({x}_{j}\mathrm{,}{x}_{i})$ for all other x_{j}. This problem has a similar flavor to the problem of finding the shortest path. In fact, the implementation uses a variant of the Floyd–Warshall algorithm to find the strongest path. This is the most efficient implementation that we know of, taking cubic time in the number of candidates.
We end this section with a small illustration of how the three aggregation algorithms work in Table 3.
(a) An example of how the three heuristics rank four candidates A, B, C, and D. For instance, heuristic r_{1} considers B to be the highest rank candidate, followed by A, C, and D.  

Heuristic  Ranking 
r_{1}  B A C D 
r_{2}  A C B D 
r_{3}  B D C A 
(b) Aggregated ranking with Borda count and geometric mean. The scores are determined by the relative ranking in each heuristic. For example, A is ranked second by r_{1}, first by r_{1}, and last by r_{3}, thus giving us a score of 3, 4, and 1, respectively. In both methods, candidate B receives the highest aggregated score.  

Candidate  Borda count  Geometric mean 
A  3 + 4 + 1 = 8  3 × 4 × 1 = 12 
B  4 + 2 + 4 = 10  4 × 2 × 4 = 32 
C  2 + 3 + 2 = 7  2 × 3 × 2 = 12 
D  1 + 1 + 3 = 5  1 × 1 × 3 = 3 
(c) Aggregated ranking with the Schulze method. The table shows the strongest path strength p(x_{i}, x_{j}) between all pairs of candidates. For example, p(B, D) = 3 because the path B → D is the strongest path from B to D, where three heuristics prefer B over D. Candidate B is the winner since p(B, A) > p(A, B), p(B, C) > p(C, B), and p(B, D) > p(D, B).  

From/To  A  B  C  D 
A  –  1  2  2 
B  2  –  2  3 
C  1  1  –  2 
D  2  0  1  – 
Experimental Protocol
We use 11 classification datasets taken from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/) (Lichman, 2013), with a large multiclass classification dataset which we extracted from the SDSS project (DOI 10.5281/zenodo.58500) (Alam et al., 2015). The code for the experiments can be found on our GitHub repository (https://github.com/chengsoonong/mclasssky). Table 4 shows the size and the number of classes in each dataset, along with the proportion of the samples belonging to the majority class and the maximum achievable performance using logistic regression. These datasets were chosen such that we have an equal number of binary and multiclass datasets, and a mixture of small and large datasets.
Dataset  Size  No. of classes  No. of features  Majority class (%)  Max performance (MPBA) (%) 

Glass  214  6  10  33  65 
Ionosphere  351  2  34  64  89 
Iris  150  3  4  33  90 
Magic  19,020  2  11  65  84 
Miniboone  129,596  2  50  72  88 
Pageblock  5,473  5  10  90  79 
Pima  733  2  8  66  71 
SDSS  2,801,002  3  11  61  90 
Sonar  208  2  60  53  78 
Vehicle  846  4  18  26  81 
Wine  178  3  13  40  94 
WPBC  194  2  34  76  58 
Note:
The following datasets are from the UCI Machine Learning Repository: glass, ionosphere, iris, magic, miniboone, pageblock, pima, sonar, vehicle, wine, and wpbc. In particular, the vehicle dataset comes from the Turing Institute, Glasgow, Scotland. The sdss dataset was extracted from Data Release 12 of SDSSIII.
For each dataset, we use Scikitlearn (Pedregosa et al., 2011) to train a logistic regression model using a 10fold stratified shuffled crossvalidation. Here “stratified” means that the proportion of the classes remains constant in each split. We standardize all features to have zero mean and unit variance. Although all examples have already been labeled, we simulate the active learning task by assuming that certain examples do not have any labels. For each fold, the unlabeled pool size is 70% of data up to a maximum of 10,000 examples, and the test pool consists of the remaining examples up to a maximum of 20,000. We assume all test examples are labeled. We initialize the classifier by labeling 10 random instances and using them as the initial training set. The heuristics are fast enough such that we can assign a score to every unlabeled instance at every time step. We use logistic regression with a Gaussian kernel approximation and an L2 regularizer. In the binary case, the loss function is (24) $$L=\frac{1}{2}{\mathbf{\text{\theta}}}^{T}\mathbf{\text{\theta}}+C{\displaystyle \sum _{i=1}^{n}}\mathrm{ln}\left(1+\mathrm{exp}({y}_{i}({\mathbf{\text{\theta}}}^{T}f({x}_{i})))\right)$$ where x_{i} is the feature vector of the ith example, y_{i} ∈ {0, 1} is the label of x_{i}, and n is the training size. The term $\frac{1}{2}{\mathbf{\text{\theta}}}^{T}\mathbf{\text{\theta}}$ is the regularization term to ensure that the weight vector θ is not too large, and C is a regularization hyperparameter in [10^{−6}, 10^{6}] which we find using grid search. To speed up the training time while using the Gaussian kernel, we approximate the feature map of the kernel with Random Kitchen Sinks (Rahimi & Recht, 2008), transforming the raw features x_{i} into a fixed 100dimensional feature vector f (x_{i}). In the multiclass case, we use the OnevsRest strategy, where for every class we build a binary classifier that determines whether a particular example belongs to that class or not. For the QBB algorithms, we train a committee of seven classifiers, where each member is given a sample of maximum 100 examples that have already been labeled.
For the bandit algorithms, we use the increase in the MPBA on the test set as the reward. The MPBA can be thought of as the expected value of the average recall, where we treat the recall as a random variable that follows a Beta distribution. Compared to the raw accuracy score, this metric takes into account class imbalance. This is because we first calculate the recall in each class and then take the average, thus giving each class an equal weight. Refer to Appendix A for the derivation of the MPBA, which extends Brodersen et al. (2010)’s formula from the binary to the multiclass setting.
In total, we test 17 query strategies. This includes passive learning, eight active learning heuristics, five bandit algorithms, and three aggregation methods. The bandit algorithms include the four described in “Combining Suggestions with Bandit Theory” and a baseline called explore which simply selects a random heuristic at each time step. In other words, we ignore the rewards and explore 100% of the time. For all bandit and rank aggregation methods, we take advice from six representative experts: passive, confidence, margin, entropy, qbbmargin, and qbbkl. We have not explored how adding the heuristics with information density weighting to the bandits would impact the performance. See Table 5 for a list of abbreviations associated with the methods.
Abbreviation  Type  Description 

passive  Heuristic  Passive learning 
confidence  Heuristic  Least confidence heuristic 
wconfidence  Heuristic  Least confidence heuristic with information density weighting 
margin  Heuristic  Smallest margin heuristic 
wmargin  Heuristic  Smallest margin heuristic with information density weighting 
entropy  Heuristic  Highest entropy heuristic 
wentropy  Heuristic  Highest entropy heuristic with information density weighting 
qbbmargin  Heuristic  Smallest QBB margin heuristic 
qbbkl  Heuristic  Largest QBB KLdivergence heuristic 
explore  Bandit  Bandit algorithm with 100% exploration 
thompson  Bandit  Thompson sampling 
ocucb  Bandit  Optimally confidence UCB algorithm 
klucb  Bandit  klUCB algorithm 
exp3++  Bandit  EXP3++ algorithm 
borda  Aggregation  Aggregation with Borda count 
geometric  Aggregation  Aggregation with the geometric mean 
schulze  Aggregation  Aggregation with the Schulze method 
Given that there are 12 datasets, each with 17 learning curves, we need a measure that can summarize in one number how well a particular heuristic or policy does. Building on Baram, ElYaniv & Luz (2004)’s deficiency measure, we define the strength of an active learner or a combiner relative to passive learning as (25) $$Strength(h;m)=1\frac{{\displaystyle {\sum}_{t=1}^{n}\left(m(\mathrm{max})m(h\mathrm{,}t)\right)}}{{\displaystyle {\sum}_{t=1}^{n}\left(m(\mathrm{max})m(passive\mathrm{,}t)\right)}}$$ where m is a chosen metric (e.g., accuracy rate, MPBA), m(max) is the best possible performance^{1}, and m(h, t) is the performance achieved using the first t examples selected by heuristic h. We can think of the summation as the area between the best possible performance line and the learning curve of h. The better the heuristic is, the faster it would approach this maximum line, and thus the smaller the area. Finally, so that we can compare the performance across datasets, we normalize the measure with the area obtained from using just passive learning. Refer to Fig. 3 for a visualization of the strength measure.
We evaluate the algorithm performance with two metrics: the accuracy score and the MPBA. The accuracy score is the percentage of instances in the test set where the predicted label matches the true label. If a dataset has a dominant class, then the accuracy score of instances within that class will also dominate the overall accuracy score. The MPBA, on the other hand, puts an equal weight on each class and thus favors algorithms that can predict the label of all classes equally well.
The heuristics with information density weighting and Thompson sampling have a few additional hyperparameters. To investigate the effect of these hyperparameters, we pick one binary dataset (glass) and one multiclass dataset (ionosphere) to investigate. Both of these are small enough to allow us to iterate through many hyperparameter values quickly. With wconfidence, wmargin, and wentropy, we set γ in the Gaussian kernel to be the inverse of the 95th percentile of all pairwise distances. This appears to work well, as shown in Fig. 4. For thompson, the prior values for μ, σ^{2} and the value of τ^{2} seem to have little effect on the final performance (see Fig. 5). We set the initial μ to 0.5, the initial σ^{2} to 0.02, and τ^{2} to 0.02.
Results
Figures 6 and 7 show the strengths of all methods that we consider, while Figs. 8 and 9 provide selected learning curves. Plots for the six small datasets with fewer than 500 examples (glass, ionosphere, iris, sonar, wine, and wpbc) are shown in Figs. 6 and 8. Plots for the two mediumsized datasets (pima and vehicle) and the four large datasets (magic, miniboone, pageblocks, and sdss) are shown in Figs. 7 and 9. Each figure contains two subfigures, one reporting the raw accuracy score, while the other showing the MPBA score.
Active learning methods generally beat passive learning in four of the six small datasets—glass, ionosphere, iris, and wine. This can be seen by the fact that the boxplots are mostly above the zero line in Fig. 6. For sonar and wpbc, the results are mixed—active learning has little to no effect here. The wpbc dataset is particularly noisy—our classifier cannot achieve an MPBA score greater than 60% (see Fig. 8). Thus it is not surprising that active learning does not perform well here since there is not much to learn to begin with.
The advantage of active learning becomes more apparent with the larger datasets like magic, miniboone, pageblocks, and sdss. Here there is a visible gap between the passive learning curve and the active learning curve for most methods. For instance, using a simple heuristic such as confidence in the pageblocks dataset results in an average MPBA score of 74% after 1,000 examples, while passive learning can only achieves 67% (see Fig. 9F).
Out of the eight active learning heuristics tested, the heuristics with the information density weighting (wconfidence, wmargin, and wentropy) generally perform worse than the ones without the weighting. qbbkl performs the best in pageblocks while it can barely beat passive learning in other datasets. The remaining heuristics—confidence, margin, entropy, and qbbmargin—perform equally well in all datasets.
We find no difference in performance between the bandit algorithms and the rank aggregation methods. Combining active learners does not seem to hurt the performance, even if we include a poorly performing heuristic such as qbbkl.
For bandit algorithms, it is interesting to note that thompson favors certain heuristics a lot more than others, while the behavior of exp3++, ocucb, and klucb is almost indistinguishable from explore, where we explore 100% of the time (see Fig. 10). Changing the initial values of μ, σ^{2}, and τ^{2} changes the order of preference slightly, but overall, which heuristics thompson picks seems to correlate with the heuristic performance. For example, as shown in Fig. 11, passive and qbbkl tend to get chosen less often than others in the ionosphere dataset.
Discussion
The experimental results allow us to answer the following questions:

Can active learning beat passive learning? Yes, active learning can perform much better than passive learning, especially when the unlabeled pool is large (e.g., sdss, miniboone, pageblock). When the unlabeled pool is small, the effect of active learning becomes less apparent, as there are now fewer candidates to choose from. This can be seen in Fig. 12, where we show that artificially reducing the unlabeled pool results in a reduction in the final performance. At the same time, having a small test set also makes the gap between the active learning curve and the passive learning curve smaller (see Figs. 12C and 12F). This further contributes to the poorer performance on the smaller datasets. In any case, when a dataset is small, we can label everything so active learning is usually not needed.
Can active learning degrade performance? Yes, there is no guarantee that active learning will always beat passive learning. For example, wentropy actually slows down the learning in the many datasets. However, this only happens with certain heuristics, like those using the information density weighting.
What is the best single active learning heuristic? All of confidence, margin, entropy, and qbbmargin have a similar performance. However confidence is perhaps the simplest to compute and thus is a good default choice in practice.

What are the challenges in using bandit algorithms?

Designing a good reward scheme is difficult. This paper uses the increase in the classifier performance as the reward. However this type of reward is nonstationary (i.e., it gets smaller after each step as learning saturates) and the rewards will thus eventually go to zero.
In practice, we do not have a representative test set that can be used to compute the reward. As a workaround, Hsu & Lin (2015) computed the reward on the training set and then used importance weighting to remove any potential bias. For this to work, we need to ensure that every training example and every active learning suggestion have a nonzero probability of being selected in each step.
Finally, some bandit algorithms such as Thompson sampling assumes that the reward follows a certain distribution (e.g., Gaussian). However, this assumption is unrealistic.


What are the challenges in using rank aggregation algorithms?

We need to compute the scores from all heuristics at every time step. This might not be feasible if there are too many heuristics or if we include heuristics that require a large amount of compute power (e.g., variance minimization).
The Schulze method uses O(n^{2}) space, where n is the number of candidates. This might lead to memory issues if we need to rank a large number of candidates from the unlabeled pool.
Before aggregating the rankings, we throw away the score magnitudes, which could cause a loss of information.
Unlike bandit algorithms, all of the rank aggregators always give each heuristic an equal weight.


Which method should I use in practice to combine active learners? Since there is no difference in performance between various combiners, we recommend using a simple rank aggregator like Borda count or geometric mean if we do not want to select a heuristic a priori. Rank aggregators do not need a notion of a reward—we simply give all suggestions an equal weight when combining. Thus we neither need to a keep a separate test set, nor do we need to worry about designing a good reward scheme.
Our investigation has a few limitations. Firstly, we empirically compare algorithms that only work with singlelabel classification problems. Nowadays, many problems require multilabel learning, in which each example is allowed to be in more than one class. Our methods can be extended to work with multilabel datasets with the following modifications. We first need a multilabel classifier. This can be as simple as a collection of binary classifiers, each of which produces the probability that an example belongs to a particular class. For each class, we can use an active learning heuristic to assign a score to each unlabeled example as before. However now we need to aggregate the scores among the classes. As suggested by Reyes, Morell & Ventura (2018), we can use any aggregation method like Borda count to combine these scores. In effect, the multilabel learning problem adds an extra layer of aggregation into the pipeline.
Another limitation of our methods is that our active learning methods are myopic. That is, in each iteration, we only pick one instance to give to a human expert for labeling. In many practical applications like astronomy, batchmode active learning is preferred, as it is much more cost efficient to obtain multiple labels simultaneously. One naive extension is to simply choose the m highest ranked objects using our current methods. However, it is possible to have two unlabeled objects whose class membership we are currently uncertain about, but because they have very similar feature vectors, labeling only one of them would allow us to predict the label of the other one easily. More sophisticated batchmode active learning approaches have been proposed to take into account other factors such as the diversity of a batch and the representativeness of each batch example. These approaches include looking at the angles between hyperplanes in support vector machines (Brinker, 2003), using cluster analysis (Xu, Akella & Zhang, 2007), and using an evolutionary algorithm (Reyes & Ventura, 2018). How to aggregate suggestions from these approaches is an interesting problem for future work.
Conclusion
In this paper we compared 16 active learning methods with passive learning. Our three main findings are: active learning is better than passive learning; combining active learners does not in general degrade the performance; and social choice theory provides more practical algorithms than bandit theory since we do not need to design a reward scheme.
Appendix A: Posterior Balanced Accuracy
Most realworld datasets are unbalanced. In the SDSS dataset, for example, there are 4.5 times as many galaxies as quasars. The problem of class imbalance is even more severe in the pageblocks dataset, where one class makes up 90% of the data and the remaining four classes only make up 10%. An easy fix is to under sample the dominant class when creating the training and test sets. This, of course, means that the size of these sets are limited by the size of the minority class.
When we do not want to alter the underlying class distribution or when larger training and test sets are desired, we need a performance measure that can correct for the class imbalance. Brodersen et al. (2010) show that the posterior balanced accuracy distribution can overcome the bias in the binary case. We now extend this idea to the multiclass setting.
Suppose we have k classes. For each class i between 1 and k, there are N_{i} objects in the universe. Given a classifier, we can predict the label of every object and compare our prediction with the true label. Let G_{i} be the number of objects in class i that are correctly predicted. Then we define the recall A_{i} of class i as (26) $${A}_{i}=\frac{{G}_{i}}{{N}_{i}}$$ The problem is that it is not feasible to get the actual values of G_{i} and N_{i} since that would require us to obtain the true label of every object in the universe. Thus we need a method to estimate these quantities when we only have a sample. Initially we have no information about G_{i} and N_{i}, so we can assume that each A_{i} follows a uniform prior distribution between 0 and 1. This is the same as a Beta distribution with shape parameters α = β = 1: (27) $${A}_{i}\sim \text{Beta}(\mathrm{1,1})$$
The probability density function (PDF) of A_{i} is then (28) $$\begin{array}{l}{f}_{{A}_{i}}(a)=\frac{\Gamma (\alpha +\beta )}{\Gamma (\alpha )\Gamma (\beta )}{a}^{\alpha 1}{(1a)}^{\beta 1}\hfill \\ \text{\hspace{1em}\hspace{1em}\hspace{1em}}\propto {a}^{11}{(1a)}^{11}\hfill \end{array}$$ where Γ(α) is the gamma function.
After we have trained the classifier, suppose we have a test set containing n_{i} objects in class i. Running the classifier on this test set is the same as conducting k binomial experiments, where, in the ith experiment, the sample size is n_{i} and the probability of success is simply A_{i}. Let g_{i} be the number of correctly labeled objects belonging to class i in the test set. Then, conditional on the recall rate A_{i}, g_{i} follows a binomial distribution: (29) $$({g}_{i}{A}_{i})\sim \text{Bin}({n}_{i}\mathrm{,}{A}_{i})$$
The probability mass function of $({g}_{i}{A}_{i}=a)$ is thus (30) $$\begin{array}{l}{p}_{{g}_{i}{A}_{i}}({g}_{i})=\left(\begin{array}{c}{n}_{i}\\ {g}_{i}\end{array}\right){a}^{{g}_{i}}{(1a)}^{{n}_{i}{g}_{i}}\hfill \\ \text{\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}}\propto {a}^{{g}_{i}}{(1a)}^{{n}_{i}{g}_{i}}\hfill \end{array}$$
In the Bayesian setting, Eq. (28) is the prior and Eq. (30) is the likelihood. To get the posterior PDF, we simply multiply the prior with the likelihood: (31) $$\begin{array}{l}{f}_{{A}_{i}g}(a)\propto {f}_{{A}_{i}}(a)\times {f}_{{g}_{i}{A}_{i}}({g}_{i})\hfill \\ \text{\hspace{1em}\hspace{1em}\hspace{1em}}\propto {a}^{11}{(1a)}^{11}\times {(1a)}^{{n}_{i}{g}_{i}}\hfill \\ \text{\hspace{1em}\hspace{1em}\hspace{1em}}={a}^{1+{g}_{i}1}{(1a)}^{1+{n}_{i}{g}_{i}1}\hfill \end{array}$$ Thus, with respect to the binomial likelihood function, the Beta distribution is conjugate to itself. The posterior recall rate A_{i} also follows a Beta distribution, now with parameters (32) $$({A}_{i}{g}_{i})\sim \text{Beta}(1+{g}_{i},1+{n}_{i}{g}_{i})$$ Our goal is to have a balanced accuracy rate, A, that puts an equal weight in each class. One way to achieve this is to take the average of the individual recalls: (33) $$\begin{array}{l}A=\frac{1}{k}{\displaystyle \sum _{i=1}^{k}}{A}_{i}\\ \text{\hspace{1em}}=\frac{1}{k}{A}_{T}\end{array}$$ Here we have defined A_{T} to be the sum of the individual recalls. We call $(A\mathit{g})$ the posterior balanced accuracy, where g = (g_{1}, …, g_{k}). Most of the time, we simply want to calculate its expected value: (34) $$\begin{array}{l}\mathbb{E}[A\mathit{g}]=\frac{1}{k}\text{\hspace{0.05em}}\mathbb{E}[{A}_{T}\mathit{g}]\\ \text{\hspace{1em}\hspace{1em}\hspace{1em}}=\frac{1}{k}{\displaystyle \int}a\cdot {f}_{{A}_{T}g}(a)\text{\hspace{0.05em}}da\end{array}$$ Let us call this the MPBA. Note that there is no closed form solution for the PDF ${f}_{{A}_{T}g}(a)$. However assuming that A_{T} is a sum of k independent Beta random variables, ${f}_{{A}_{T}g}(a)$ can be approximated by numerically convolving k Beta distributions. The independence assumption is reasonable here, since there should be little to no correlation between the individual recall rates. For example, knowing that a classifier is really good at recognizing stars does not tell us much about how well that classifier can recognize galaxies.
Having the knowledge of f_{A}_{g} (a) will allow us to make violin plots, construct confidence intervals and do hypothesis tests. To get an expression for this, let us first rewrite the cumulative distribution function as (35) $$\begin{array}{l}{F}_{Ag}(a)=\mathbb{P}(A\le a\mathit{g})\\ \text{\hspace{1em}\hspace{1em}\hspace{1em}}=\mathbb{P}\left(\frac{1}{k}{A}_{T}\le a\mathit{g}\right)\\ \text{\hspace{1em}\hspace{1em}\hspace{1em}}=\mathbb{P}({A}_{T}\le ka\mathit{g})\\ \text{\hspace{1em}\hspace{1em}\hspace{1em}}={F}_{{A}_{T}\mathit{g}}(ka)\end{array}$$
Differentiating (Eq. (35)) with respect to a, we obtain the PDF of (Ag): (36) $$\begin{array}{l}{f}_{Ag}(a)=\frac{\partial}{\partial a}{F}_{Ag}(ka)\\ \text{\hspace{1em}\hspace{1em}\hspace{1em}}=\frac{\partial}{\partial a}(ka)\cdot \frac{\partial}{\partial ka}{F}_{{A}_{T}g}(ka)\\ \text{\hspace{1em}\hspace{1em}\hspace{1em}}=k\cdot {f}_{{A}_{T}g}(ka)\end{array}$$ A Python implementation for the posterior balanced accuracy can be found on our GitHub repository (https://github.com/chengsoonong/mclasssky).