Semisupervised oblique predictive clustering trees
 Published
 Accepted
 Received
 Academic Editor
 Sebastian Ventura
 Subject Areas
 Algorithms and Analysis of Algorithms, Artificial Intelligence, Data Mining and Machine Learning
 Keywords
 Semisupervised learning, Oblique decision trees, Predictive clustering trees, Structured output prediction
 Copyright
 © 2021 Stepišnik and Kocev
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
 Cite this article
 2021. Semisupervised oblique predictive clustering trees. PeerJ Computer Science 7:e506 https://doi.org/10.7717/peerjcs.506
Abstract
Semisupervised learning combines supervised and unsupervised learning approaches to learn predictive models from both labeled and unlabeled data. It is most appropriate for problems where labeled examples are difficult to obtain but unlabeled examples are readily available (e.g., drug repurposing). Semisupervised predictive clustering trees (SSLPCTs) are a prominent method for semisupervised learning that achieves good performance on various predictive modeling tasks, including structured output prediction tasks. The main issue, however, is that the learning time scales quadratically with the number of features. In contrast to axisparallel trees, which only use individual features to split the data, oblique predictive clustering trees (SPYCTs) use linear combinations of features. This makes the splits more flexible and expressive and often leads to better predictive performance. With a carefully designed criterion function, we can use efficient optimization techniques to learn oblique splits. In this paper, we propose semisupervised oblique predictive clustering trees (SSLSPYCTs). We adjust the split learning to take unlabeled examples into account while remaining efficient. The main advantage over SSLPCTs is that the proposed method scales linearly with the number of features. The experimental evaluation confirms the theoretical computational advantage and shows that SSLSPYCTs often outperform SSLPCTs and supervised PCTs both in singletree setting and ensemble settings. We also show that SSLSPYCTs are better at producing meaningful feature importance scores than supervised SPYCTs when the amount of labeled data is limited.
Introduction
The most common tasks in machine learning are supervised and unsupervised learning. In supervised learning, we are presented with a set of examples described with their properties (i.e., descriptive variables or features) as well as with a target property (i.e., output variables, target variables, or labels). The goal of a supervised learning method is to learn a mapping from the descriptive values to the output values that generalizes well to examples that were not used for learning. In unsupervised learning, on the other hand, no output values are provided for the examples. Instead, unsupervised methods aim to extract some underlying structure of the examples (e.g., discover clusters of similar examples, learn low dimensional representations, etc.).
Semisupervised learning combines these two approaches (Chapelle, Schölkopf & Zien, 2006). We are presented with a set of examples, where a (smaller) part of them are associated with output values (labeled examples), and a (larger) part of them are not (unlabeled examples). Semisupervised methods learn a mapping from examples to the output values (like supervised methods), but also include unlabeled examples in the learning process (like unsupervised methods). The semisupervised approach is typically used when learning examples are too scarce for supervised methods to learn a model that generalizes well, and, at the same time, unlabeled examples are relatively easy to obtain. This often happens in problems from life sciences, where the process of labeling the examples requires wetlab experiments that are timeconsuming and expensive. For example, consider the problem of discovering a new drug for a certain disease. Testing the effects of compounds on the progression of the disease requires screening experiments, so labeling the examples (compounds) is expensive. On the other hand, millions of unlabeled compounds are present and described in online databases. Ideally, a semisupervised method can use a handful of labeled compounds, combine them with the unlabeled compounds, and learn a model that can predict the effect of a compound on the disease progression, to facilitate the discovery of a novel drug.
The most common approaches to semisupervised learning are wrapper methods (Van Engelen & Hoos, 2020), such as selftraining (Kang, Kim & Cho, 2016), where a model iteratively labels the unlabeled examples and includes these pseudolabels in the learning set in the next iteration. Alternatively, in cotraining (Zhou & Li, 2007) there are two models that iteratively label the data for each other. Typically, these two models are different or at least learn on different views of the data. Among the intrinsically semisupervised methods (Van Engelen & Hoos, 2020), semisupervised predictive clustering trees (Levatić, 2017) are a prominent method. They can be used to solve a variety of predictive tasks, including multitarget regression and (hierarchical) multilabel classification (Levatić, 2017; Levatić et al., 2017; Levatić et al., 2018; Levati et al., 2020). They achieve good predictive performance and, as a bonus, the learned models can be interpreted, either by inspecting the learned trees or calculating feature importances from ensembles of trees (Petkovi, Deroski & Kocev, 2020). However, the method scales poorly with data dimensionality—the model learning can take a very long time on datasets with many features or targets.
Standard decision/regression trees (Breiman et al., 1984) split data based on the features in a way that minimizes the impurity of the target in the resulting clusters (e.g., variance for regression, entropy for classification). In the end nodes (leaves), predictions for the target are made. Predictive clustering trees (Blockeel, Raedt & Ramon, 1998; Blockeel et al., 2002) (PCTs) generalize standard trees by differentiating between three types of attributes: features, clustering attributes, and targets. Features are used to divide the examples; these are the attributes encountered in the split nodes. Clustering attributes are used to calculate the heuristic that guides the search of the best split at a given node, and targets are predicted in the leaves. The role of the targets in standard trees is therefore split between the clustering attributes and targets in PCTs. In theory, the clustering attributes can be selected independently of the features and the targets. However, the learned tree should make accurate predictions for the targets, so minimizing the impurity of the clustering attributes should help minimize the impurity of the targets. This attribute differentiation gives PCTs a lot of flexibility. They have been used for predicting various structured outputs (Kocev et al., 2013), including multitarget regression, multilabel classification, and hierarchical multilabel classification. Embeddings of the targets have been used as clustering attributes in order to reduce the time complexity of tree learning (Stepišnik & Kocev, 2020a). Semisupervised PCTs use both targets and features as clustering attributes. This makes leaves homogeneous in both the input and the output space, which allows unlabeled examples to influence the learning process.
PCTs use individual features to split the data, which means the split hyperplanes in the input spaces are axisparallel. SPYCTs (Stepišnik & Kocev, 2020b; Stepinik & Kocev, 2020) are a redesign of standard PCTs and use linear combinations of features to achieve oblique splits of the data—the split hyperplanes are arbitrary. The potential advantage of oblique splits compared to axisparallel splits is presented in Fig. 1. SPYCTs offer stateoftheart predictive performance, scale better with the number of clustering attributes, and can exploit sparse data to speed up computation.
In this paper, we propose SPYCTs for semisupervised learning. We follow the same semisupervised approach that regular PCTs do, which includes features in the heuristic function for evaluating the quality of a split. This makes the improved scaling of SPYCTs over PCTs especially beneficial, which is the main motivation for our proposal. We modify the oblique split learning objective functions of SPYCTs to account for missing target values. We evaluate the proposed approach on multiple benchmark datasets for different predictive modeling tasks.
In the remainder of the paper, we first describe the proposed semisupervised methods and present the experimental setting for their evaluation. Next, we present and discuss the results of our experiments and, finally, conclude the paper by providing several takehome messages.
Method description
In this section, we present our proposal for semisupervised learning of SPYCTs (SSLSPYCTs). We start by introducing the notation used in the manuscript. Let X^{l} ∈ ℝ^{L×D} and X^{u} ∈ ℝ^{U×D} be the matrices containing the D features of the L labeled and U unlabeled examples, respectively. Let Y ∈ ℝ^{L×T} be the matrix containing the T targets associated with the L labeled examples. And let X = [(X^{l})^{T}(X^{u})^{T}]^{T} ∈ ℝ^{(L+U)×D} be the matrix combining the features of both labeled and unlabeled examples. Finally, let p ∈ ℝ^{D+T} be the vector of clustering weights, used to put different priorities to different clustering attributes (features and targets) when learning a split.
There are two variants of SPYCTs that learn the split hyperplanes in different ways.

The SVM variant first groups the examples into two clusters based on the clustering attributes using kmeans clustering, then learns a linear SVM on the features with cluster indicators as targets to approximate this split.

The gradient variant uses a fuzzy membership indicator to define a differentiable objective function which measures the impurity on both sides of the split hyperplane. The hyperplane is then optimized with gradient descent to minimize the impurity.
The basis of the semisupervised approach is to use both features and targets as clustering attributes, so that unlabeled examples influence the learning process through the heuristic score calculation, despite the missing target values. For the SVM variant, this means that examples are clustered based on both target and feature values. For the gradient variant, the split is optimized to minimize the impurity of both features and targets on each side of the hyperplane. The overview of the SSLSPYCT learning algorithm is presented in Algorithm 1. The weights w ∈ ℝ^{D} and bias b ∈ ℝ define the split hyperplane, and they are obtained differently for each SPYCT variant as follows.
 SVM variant

The first step is to cluster examples into two groups using kmeans clustering. The initial centroids are selected randomly from the labeled examples. Since the clustering is performed based on both features and targets, the cluster centroids consist of feature and target parts, i.e., ${c}^{0}=\left[\begin{array}{cc}\hfill {X}_{i,:}^{l}\hfill & \hfill {Y}_{i,:}\hfill \end{array}\right]\in {\mathbb{R}}^{D+T},\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{c}^{1}=\left[\begin{array}{cc}\hfill {X}_{j,:}^{l}\hfill & \hfill {Y}_{j,:}\hfill \end{array}\right]\in {\mathbb{R}}^{D+T}.$ Next, we calculate the Euclidean distance to the two centroids for each of the examples. For unlabeled examples, we only calculate the distance to the feature part of the centroids (${c}_{0}^{x}$ and ${c}_{1}^{x}$): $d\left(i,j\right)=\sum _{k=1}^{D}{p}_{k}{\left({X}_{j,k}{c}_{k}^{i}\right)}^{2}+\alpha \sum _{k=1}^{T}{p}_{D+k}{\left({Y}_{j,k}{c}_{D+k}^{i}\right)}^{2},$ where i ∈ {0, 1} is the cluster indicator, 1 ≤ j ≤ L + U is the example index, and α = 1 if the example is labeled (i.e., j ≤ L) and α = 0 if it is unlabeled. The examples are split into two clusters according to the closer centroid. In the case of ties in the distance, the examples are assigned (uniformly) randomly to a cluster. Let s ∈ {0, 1}^{L+U} be the vector indicating the cluster membership. The new centroids are then the means of the examples assigned to each cluster. The means of the target parts of the centroids are calculated only using the labeled examples, i.e., ${c}_{j}^{i}=\frac{\sum _{k=1}^{L+U}\mathbb{I}\left[{s}_{k}=i\right]{X}_{k,j}}{\sum _{k=1}^{L+U}\mathbb{I}\left[{s}_{k}=i\right]},\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\text{if}1\le j\le D,$ ${c}_{j}^{i}=\frac{\sum _{k=1}^{L}\mathbb{I}\left[{s}_{k}=i\right]{Y}_{k,jD}}{\sum _{k=1}^{L}\mathbb{I}\left[{s}_{k}=i\right]},\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\text{if}D<j\le D+T.$ This procedure is repeated for a specified number of iterations. After the final clusters are determined, a linear SVM is used to approximate this split based on the features. Specifically, the following optimization problem is solved: $\underset{w,b}{min}\left\rightw{}_{1}+C\sum _{k=1}^{L+U}max{\left(0,1{s}_{k}\left({X}_{k,:}\cdot w+b\right)\right)}^{2},$ where parameter C ∈ ℝ determines the strength of regularization.
 Gradient variant

We start with randomly initialized weights (w) and bias (b) and calculate the fuzzy membership vector s = σ(Xw + b) ∈ [0, 1]^{L+U}. The value s_{i} tells us how much the corresponding example belongs to the “positive” group, whereas the value 1 − s_{i} tells us how much it belongs to the “negative” group. To calculate the impurity of a group, we calculate the weighted variance for every feature and every target. For the targets, only labeled examples are used in the calculation. Weighted variance of a vector v ∈ ℝ^{n} with weights a ∈ ℝ^{n} is defined as $var\left(v,a\right)=\frac{\sum _{i}^{n}{a}_{i}{\left({v}_{i}mean\left(v,a\right)\right)}^{2}}{A}=mean\left({v}^{2},a\right)mean{\left(v,a\right)}^{2},$ where $A={\sum}_{i}^{n}{a}_{i}$ is the sum of weights and $mean\left(v,a\right)=\frac{1}{A}{\sum}_{i}^{n}{a}_{i}{v}_{i}$ is the weighted mean of v. The impurity of the positive group is then calculated as $imp\left(s,p\right)=\sum _{k=1}^{D}{p}_{k}var\left({X}_{:,k},s\right)+\sum _{k=1}^{T}{p}_{D+k}var\left({Y}_{:,k},s\right).$ To get the impurity of the negative group imp(1 − s, p), we simply swap the fuzzy membership weights with 1 − s. The split fitness function we wish to optimize is then $f\left(w,b\right)=S\cdot imp\left(s,p\right)+\left(L+US\right)\cdot imp\left(1s,p\right),$ where s = σ(Xw + b) and S = ∑_{i}s_{i}. The terms S and L + U − S represent the sizes of the positive and negative subsets and are added to guide the split search towards balanced splits. The final optimization problem for learning the split hyperplane is $\underset{w,b}{min}\left\rightw{}_{\frac{1}{2}}+Cf\left(w,b\right),$ where C again controls the strength of regularization. The objective function is differentiable, and we can efficiently solve the problem using the Adam (Kingma & Ba, 2014) gradient descent optimization method.
The clustering weights are uniform for the targets for tasks of binary classification, multiclass classification, multilabel classification, regression, and multitarget regression. For hierarchical multilabel classification, the weights for target labels positioned lower in the hierarchy are smaller. This gives more importance to labels higher in the hierarchy when splitting the examples.
Features and clustering attributes are standardized to mean 0 and standard deviation 1 prior to learning each split. For the features, this is done to make split learning more stable. For the clustering attributes, this is performed before the application of the clustering weights, so that only clustering weights control the relative influences of the different clustering attributes on the objective function.
We also implement a parameter ω that determines the degree of supervision. The clustering weights, corresponding to features (p_{i} for 1 ≤ i ≤ D), are scaled so that their sum is 1 − ω, and clustering weights, corresponding to targets (p_{i} for D < i ≤ D + T, are scaled so that their sum is ω. This enables us to determine the relative importance of features and targets when splitting the data. With the borderline values selected for ω (0 or 1), we get the extreme behavior in terms of the amount of supervision. Setting the value of ω to 0 means that the target impurity is ignored and tree construction is effectively unsupervised, i.e., without supervision. Alternatively, setting the value of ω to 1 means that feature impurity is ignored when learning splits, hence, the unlabeled examples do not affect the split selection. The tree construction in this case is fully supervised.
The splitting of the examples (i.e., the tree construction) stops when at least one of the following stopping criteria is reached. We can specify the minimum number of examples required in leaf nodes (at least one labeled example is always required otherwise predictions cannot be made). We can also require a split to reduce the impurity by a specified amount or specify the maximum depth of the tree.
After the splitting stops, a leaf node is created. The prototype of the targets of the remaining examples is calculated and it is stored for use as the prediction for the examples reaching that leaf. Since the targets in SOP are represented as tuples/vectors, the prototypes are calculated as columnwise mean values of the targets (Y). They can be used directly as predictions (in regression problems) or used to calculate the majority class (in binary and multiclass classification), or used to predict all labels with the mean above a certain threshold (in hierarchical and flat multilabel classification).
The time complexity of learning a split in standard PCTs is O(DNlogN + NDK) (Kocev et al., 2013), where K is the number of clustering attributes. For the SVM and gradient variant of SPYCTs, the time complexities are O(N(I_{c}K + I_{o}D)) and O(NI_{o}(D + K)), respectively (Stepinik & Kocev, 2020), where I_{o} is the number of w, b optimization iterations and I_{c} is the number of clustering iterations (SVM variant). When learning SSL variants (SSLPCTs and SSLSPYCTs), clustering attributes consist of both features and targets, therefore K = D + T. This means that SSLPCTs scale quadratically with the number of features, whereas both variants of SSLSPYCTs scale linearly. SSLSPYCTs are therefore much more computationally efficient, and can additionally take advantage of sparse data by performing calculations with sparse matrices. Our implementation of the proposed method is freely licensed and available for use and download at https://gitlab.com/TStepi/spyct.
Experimental design
We evaluated our approach on 30 benchmark dataset for different predictive modeling tasks: binary classification (BC), multiclass classification (MCC), multilabel classification (MLC), and hierarchical multilabel classification (HMLC), singletarget regression (STR) and multitarget regression (MTR). The datasets are freely available and were obtained from the following repositories: openml (https://www.openml.org), mulan (http://mulan.sourceforge.net/datasets.html), dtaics (https://dtai.cs.kuleuven.be/clus/hmcens/) and ktijs (http://kt.ijs.si/DragiKocev/PhD/resources/doku.php?id=hmc_classification). The selected datasets have diverse properties in terms of application domains, number of examples, number of features, and number of targets. Their properties and sources are presented in Table 1.
dataset  source  task  N  D  T 

bioresponse  openml  BC  3751  1776  1 
mushroom  openml  BC  8124  22  1 
phoneme  openml  BC  5404  5  1 
spambase  openml  BC  4601  57  1 
speeddating  openml  BC  8378  120  1 
cardiotocography  openml  MCC  2126  35  10 
gesture  openml  MCC  9873  32  5 
isolet  openml  MCC  7797  617  26 
mfeatpixel  openml  MCC  2000  240  10 
plantstexture  openml  MCC  1599  64  100 
bibtex  mulan  MLC  7395  1836  159 
birds  mulan  MLC  645  260  19 
bookmarks  mulan  MLC  87856  2150  208 
delicious  mulan  MLC  16105  500  983 
scene  mulan  MLC  2407  294  6 
ara_interpro_GO  dtaics  HMLC  11763  2815  630 
diatoms  ktijs  HMLC  3119  371  398 
enron  ktijs  HMLC  1648  1001  56 
imclef07d  ktijs  HMLC  11006  80  46 
yeast_seq_FUN  dtaics  HMLC  3932  478  594 
cpmp2015  openml  STR  2108  23  1 
pol  openml  STR  15000  48  1 
qsar197  openml  STR  1243  1024  1 
qsar12261  openml  STR  1842  1024  1 
satellite_image  openml  STR  6435  36  1 
atp1d  mulan  MTR  337  411  6 
enb  mulan  MTR  768  8  2 
oes97  mulan  MTR  334  263  16 
rf2  mulan  MTR  9125  576  8 
scm1d  mulan  MTR  9803  280  16 
We focus on the comparison of our proposed SSLSPYCT method with the original supervised method SPYCT and the semisupervised learning of axisparallel PCTs: the SSLPCT (Levatić, 2017). These two baselines are the most related supervised and semisupervised methods of the proposed approach, respectively. For completeness, we also include supervised PCTs in the comparison. Note that SPYCTs and PCTs are the only available methods able to address all of the structured output prediction tasks in a uniform manner. We evaluate the methods in single tree setting and in bagging ensembles (Breiman, 1996) of 50 trees.
For SPYCTs we use the same configuration as it was used in Stepinik & Kocev (2020c). Tree depth is not limited, leaves only need to have 1 (labeled) example, and splits are accepted if they reduce impurity by at least 5% in at least one of the subsets. The maximum number of optimization iterations is set to 100 for both variants, and the SVM variant uses at most 10 clustering iterations. The strength of regularization (C) is set to 10. For the gradient variant, the Adam optimizer uses parameters β_{1} = 0.9, β_{2} = 0.999, and ϵ = 10^{−8}. These are the default values from the PyTorch (https://pytorch.org/docs/1.1.0/_modules/torch/optim/adam.html) library.
For semisupervised methods, we select the ω parameter with 3fold internal crossvalidation on the training set. We select the best value from the set {0, 0.25, 0.5, 0.75, 1}. We investigate the influence of the number of labeled examples L on the performance of the semisupervised methods. We set L to the following numbers of available labeled examples: {25, 50, 100, 250, 500}. We evaluate the methods with a slightly modified 10fold crossvalidation corresponding to inductive evaluation setting. First, a dataset is divided into 10 folds. One fold is used as the test set. From the other 9 folds, L examples are randomly selected as labeled examples, and the rest are used as unlabeled examples. This process is repeated 10 times so that each fold is used once as the test set. On the two MTR datasets that have fewer than 500 examples (atp1d and oes97) experiments with L = 500 are not performed.
To measure the predictive performance of the methods on STR and MTR datasets, we use the coefficient of determination ${R}^{2}\left(y,\stackrel{\u02c6}{y}\right)=1\frac{{\sum}_{i}{\left({y}_{i}{\stackrel{\u02c6}{y}}_{i}\right)}^{2}}{{\sum}_{i}{\left({y}_{i}\stackrel{\u0304}{y}\right)}^{2}},$ where y is the vector of true target values, $\stackrel{\u0304}{y}$ is their mean, and $\stackrel{\u02c6}{y}$ is the vector of predicted values. For MTR problems, we calculate the mean of R^{2} scores per target. For BIN and MCC tasks, we use F1 score, macro averaged in the MCC case.
Methods solving MLC and HMLC tasks typically return a score for each label and each example, a higher score meaning that an example is more likely to have that label. Let y ∈ {0, 1}^{n×l} be the matrix of label indicators and $\stackrel{\u02c6}{y}\in {\mathcal{R}}^{n\times l}$ the matrix of label scores returned by a method. We measured the performance of methods with weighted label ranking average precision $LRAP\left(y,\stackrel{\u02c6}{y}\right)=\frac{1}{n}\sum _{i=0}^{n1}\sum _{j:{y}_{ij}=1}\frac{{w}_{j}}{{W}_{i}}\frac{{L}_{ij}}{{R}_{ij}},$ where ${L}_{ij}=\left\left\{k:{y}_{ik}=1\wedge {\stackrel{\u02c6}{y}}_{ik}\ge {\stackrel{\u02c6}{y}}_{ij}\right\}\right$ is the number of real labels assigned to example i that the method ranked higher than label j, ${R}_{ij}=\left\left\{k:{\stackrel{\u02c6}{y}}_{ik}\ge {\stackrel{\u02c6}{y}}_{ij}\right\}\right$ is the number of all labels ranked higher than label j, w_{j} is the weight we put to label j and W_{i} is the sum of weights of all labels assigned to example i. For the MLC datasets, we put equal weights to all labels, whereas, for the HMLC datasets, we weighted each label with 0.75^{d}, with d being the depth of the label in the hierarchy (Kocev et al., 2013). For hierarchies that are directed acyclic graphs, the depth of a node is calculated as the average depth of its parent nodes plus one. The same weights are also used as the clustering weights for the targets for all methods.
Results and Discussion
Predictive performance comparison
We first present the results obtained on the rf2 dataset in Fig. 2. Here, the semisupervised approach outperforms supervised learning for both SPYCT variants. This is the case in both singletree and ensemble settings and for all considered numbers of labeled examples. These results demonstrate the potential of the proposed SSL methods.
For a highlevel comparison of the predictive performance of the proposed SSL methods and the baselines, we use average ranking diagrams (Demsar, 2006). The results are presented in Fig. 3. The first observation is that SSLSPYCTGRAD achieves the best rank for all numbers of labeled examples in both single tree and ensemble settings. The only exception are single trees with 25 labeled examples, where it has the secondbest rank, just slightly behind SSLSPYCTSVM. Additionally, SSLSPYCTSVM also ranks better than both its supervised variant and SSLPCT for all values of L and both single tree and ensemble settings. For standard PCTs, the semisupervised version performed better than the supervised version in a single tree setting with very few labeled examples (L = 25, 50), otherwise, their performances were similar. This is consistent with the previous studies (Levatić et al., 2017; Levatić et al., 2018; Levati et al., 2020).
Next, we dig deeper into the comparison of SSLSPYCT variants to the supervised SPYCTs and SSLPCTs. We performed pairwise comparisons among the competing pairs with sign tests (Demsar, 2006) on the number of wins. An algorithm “wins” on a dataset if its performance, averaged over the 10 crossvalidation folds, is better than the performance of its competitor. The maximum number of wins is therefore 30 (28 for L = 500). Tables 2 and 3 present the results for single tree and ensemble settings, respectively.
The results show that in the single tree setting, SSLSPYCTs tend to perform better than their supervised counterparts, though the difference is rarely statistically significant. When used in ensembles, the improvement of the SSLSPYCTSVM variant over its supervised counterpart is small. With the gradient variant, the improvement is greater, except for the largest number of labeled examples. Compared to SSLPCTs, the improvements are generally greater. This holds for both single trees and especially ensembles, where the differences are almost always statistically significant. As the average ranking diagrams in Fig. 3 already suggested, the gradient variant is especially successful.
Overall, the results also show that SPYCTs are a more difficult baseline to beat than SSLPCTs. This is especially true in ensembles, where the studies of SSLPCTs show that the improvement over supervised PCT ensembles is negligible (Levatić et al., 2017; Levatić et al., 2018; Levati et al., 2020). On the other hand, our results show SSLSPYCTGRAD can improve even the ensemble performance. Another important observation is that supervised variants never significantly outperform the SSL variants. This confirms that dynamically selecting the ω parameter prevents scenarios where unlabeled examples are detrimental to the predictive performance and supervised learning works better.
Learning time comparison
To compare the learning times of the proposed SSL methods and SSLPCTs, we selected one large dataset for each predictive task. We focused on the large datasets where the differences highlight the scalability of the methods with respect to the numbers of features and targets. We compare learning times of tree ensembles, as they also serve as a (more reliable) comparison for learning times of single trees. Fig. 4 shows the learning times on the selected datasets. The results confirm our theoretical analysis and show that the proposed SSLSPYCTs are learned significantly faster than SSLPCTs. The differences are especially large on datasets with many features and/or targets (e.g., ara_interpro_GO). The learning times are most similar on the gesture dataset, which has only 32 features, so the theoretical advantage of SSLSPYCTs is less accentuated. Notwithstanding, the proposed methods are faster also on this dataset.
#wins  L = 25  L = 50  L = 100  L = 250  L = 500 

1SSLSPYCTGRAD  20  19  21  19  14 
vs. 1SPYCTGRAD  
1SSLSPYCTSVM  20  18  18  17  20 
vs. 1SPYCTSVM  
1SSLSPYCTGRAD  21  18  22  18  20 
vs. 1PCTSSL  
1SSLSPYCTSVM  22  18  18  19  16 
vs. 1PCTSSL 
#wins  L = 25  L = 50  L = 100  L = 250  L = 500 

50SSLSPYCTGRAD  24  19  19  20  14 
vs. 50SPYCTGRAD  
50SSLSPYCTSVM  16  16  19  15  16 
vs. 50SPYCTSVM  
50SSLSPYCTGRAD  25  22  22  21  21 
vs. 50SSLPCT  
50SSLSPYCTSVM  23  21  22  21  15 
vs. 50SSLPCT 
Investigating the ω parameter
The ω parameter controls the amount of influence of the unlabeled examples on the learning process. Fig. 5 shows the distributions of the ω values selected with the internal 3fold crossvalidation. We can see that the selected values varied greatly, sometimes different values were chosen even for different folds of the same dataset. This confirms the need to determine ω with internal crossvalidation for each dataset separately. Additionally, we notice that larger ω values tend to be selected with more labeled examples and by ensembles compared to single trees. With larger numbers of labeled examples, it makes sense that the model can rely more heavily on the labeled part of the data and unlabeled examples are not as beneficial. For ensembles, this indicates that they can extract more useful information from few labeled examples compared to single trees. Whereas this seems clear for larger datasets, bootstrapping on few examples is not obviously beneficial. The fact that ensembles tend to select larger ω values (especially the SVM variant) also explains why the differences in predictive performance between supervised and semisupervised variants are smaller in ensembles compared to single trees. We also investigated whether the selected ω values were influenced by the predictive modeling task (regression vs. classification, single target vs. multiple targets), but we found no noticeable differences between the ω distributions.
Investigating feature importances
We can extract feature importance scores from learned SPYCT trees (Stepinik & Kocev, 2020). The importances are calculated based on absolute values of weights assigned to individual features in all the split nodes in a tree (or ensemble of trees). For a single oblique PCT, they are calculated as follows: $imp\left(T\right)={\sum}_{s\in T}\frac{{s}_{n}}{N}\frac{{s}_{w}}{\parallel {s}_{w}{\parallel}_{1}},$ where s iterates over split nodes in tree T, s_{w} is the weight vector defining the split hyperplane, s_{n} is the number of learning examples that were present in the node and N is the total number of learning examples. The contributions of each node to the final feature importance scores are weighted according to the number of examples that were used to learn the split. This puts more emphasis on weights higher in the tree, which affect more examples. To get feature importance scores of an ensemble, we simply average feature importances of individual trees in the ensemble.
These scores tell us how much the model relies on individual features and can also be used to identify important features for a given task. We investigated if SSLSPYCTs are more successful at identifying important features compared to supervised SPYCTs in problems with limited labeled data. To do this, we followed the setup from Stepinik & Kocev (2020c) and added random features (noise) to the datasets. For each original feature, we added a random one so that the total number of features was doubled. The values of the added features were independently sampled from a standard normal distribution. Then we learned SPYCTs and SSLSPYCTs and compared the extracted feature importances.
Figure 6 presents the results on the qsar197 dataset. For convenience, we also show the predictive performances of SPYCT and SSLSPYCT methods. A good feature importance scoring would put the scores of random features (orange) to zero, whereas some real features (blue) would have noticeably higher scores. Low scores of many real features are not concerning, as datasets often include features that are not very useful for predicting the target. This example shows that SSLSPYCTs can be better at identifying useful features than supervised SPYCTs. The difference here is greater with the gradient variant, especially with 50250 labeled examples. This is also reflected in the predictive performance of the methods.
In general, the quality of feature importance scores obtained from a model was correlated with the model’s predictive performance. This is expected and means that the conclusions here are similar. In terms of feature importance scores, SSLSPYCTs are often similar to supervised SPYCTs, but there are several examples (e.g., Fig. 6) where they are significantly better and worth the extra effort.
Conclusion
In this paper, we propose semisupervised learning of oblique predictive clustering trees. We follow the approach of standard semisupervised predictive clustering trees and adapt both SVM and gradient variants of SPYCTs and make them capable of learning from unlabeled examples. The main motivation for the proposed methods was the improved computational scaling of SPYCTs compared to PCTs which is highlighted in the proposed SSL approach, where features are also taken into account when evaluating the splits.
We experimentally evaluated the proposed methods on 30 benchmark datasets for various predictive modeling tasks in both single tree and ensemble settings. The experiments confirmed the substantial theoretical computational advantage the proposed SSLSPYCT methods have over standard SSLPCTs. The results also showed that the proposed methods often achieve better predictive performance than both supervised SPYCTs and SSLPCTs. The performance edge was preserved even in ensemble settings, where SSLPCTs typically did not outperform supervised PCTs. Finally, we demonstrated that SSLSPYCTs can be significantly better at obtaining meaningful feature importance scores.
The main drawback of SSLSPYCTs (which is shared with SSLPCTs) is the requirement to determine the ω parameter dynamically with internal crossvalidation. This increases the learning time compared to supervised learning but prevents occasions where introducing unlabeled examples into the learning process hurts the predictive performance. We investigated the selected values for ω and found that higher values tend to be selected when there is more labeled data available, and by ensembles compared to single trees. But the selected values were still very varied, which confirms the need for dynamic selection of ω.
For future work, we plan to investigate SPYCTs in boosting ensembles for both supervised and semisupervised learning. Variants of gradient boosting (Friedman, 2001) have proven especially successful in many applications recently. We will also try improving the interpretability of the learned models with Shapley additive explanations (SHAP, Lundberg et al. (2020)). Because our method is treebased we might be able to calculate the Shapley values efficiently, similarly to how they are calculated for axisparallel tree methods.
Supplemental Information
Raw results of our experiments
Each row shows the results of one method on one dataset with one number of labeled examples. The results are averaged over the 10 folds of the crossvalidation and contain model sizes (number of leaf nodes), learning times, and predictive performance (measure depends on the task  see paper for details).
The archive also includes a small python function that we used to generate the data subsets for the experiments.