Models with different loop and length types were tuned individually
The tuned parameters with the average accuracy of the trained models by CV. The column names roughly correspond to the parameters tuned by the Carat package in R, e.g. “Interaction Depth” corresponds to the “interaction.depth” variable.
GBM model accuracies exceed the blindBLAST accuracies for most of the loop and length types
All the colored lines on the y-axis are the average GBM accuracy of the 3-repeats-10-fold cross-validation. The black line is the average blindBLAST accuracy of the 3-repeats-10-fold CV. The error bars correspond to the standard deviations of accuracies for individual folds. As the number of decision trees (# trees) and single weak learner complexity as the number of branches (interaction.depth) in the GBM model increase, the models achieve greater accuracy than blindBLAST in most loop and length types. The degree to which each tree contributes its knowledge to the model in the boosting process (shrinkage) not shown is set to max (0.01, 0.1*min(1, nl/10000)), depending on the case number of the training set nl for each loop and length type. Compared to the performance of blindBLAST, the best model achieves higher mean accuracy and generally lower model variance.
Loop and length types with sparse data give the highest accuracy standard deviations
In A and D, the loops with the low accuracies and high accuracy standard deviations correspond to those loops with small numbers of CDR members. In B, for loops with relatively small accuracy std, the loops with lower accuracy are also those with the higher number of clusters in the loops. In C, the x-axis is the ratio between the CDR member size in the most populated cluster and the CDR member size in the second populated cluster, therefore can be a metric of how balanced of the cluster member size in a loop and length type. For any loops with small standard deviations for accuracies and large enough member sizes, a more balanced loop and length type tends to have smaller accuracy.
Models with different loop and length types should be tuned individually
The finally tuned parameters with the average accuracy of the trained models by CV. The column names “interaction.depth”, “n.trees”, “shrinkage”, “n.minobsinnode” are the parameters tuned.
Misclassification types by blindBLAST performance group
The first column “error count” is the average number of error cases in the 3-repeat-10-fold CV results, averaged by the repeats. The number in each row corresponds to the misclassification specified by the query cluster and the template cluster. The “mean simu error count” and sd are values from the random assignment simulation with the same number of test cases and template candidate cases. The last column is the significance value derived from the empirical distribution of the error count. The misclassifications with the same color are the pairs with the order of clusters switched.
GBM may not be better at recovering the sparse clusters than blindBLAST:
The error counts of misclassifications with query clusters being those with less than 50 samples are extracted. Out of 31 extracted clusters, sixteen have less error counts which cost its recovery using GBM compared to blindBLAST while 15 have more error counts using GBM compared to blindBLAST, so the number of cases of worse and better are approximately equal.