This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Antibodies are proteins generated by the adaptive immune system to recognize and counteract a plethora of pathogens through specific binding. This adaptive binding is mediated by structural diversity in the six complementary determining region (CDR) loops (H1, H2, H3, L1, L2 and L3), which also makes accurate structural modeling of CDRs challenging. Both homology and de novo modeling approaches have been used; to date, the former has achieved greater accuracy for the non-H3 loops. The better performance of homology modeling in non-H3 CDRs is due to the fact that most of the non-H3 CDR loops of the same length and type can be grouped into a few structural clusters. Most antibody-modeling suites utilize homology modeling for the non-H3 CDRs, differing only in the alignment algorithm and how/if they utilize structural clusters. While RosettaAntibody and SAbPred do not explicitly assign query CDR sequences to clusters, two other approaches, PIGS and Kotai Antibody Builder, utilize sequence-based rules to assign CDR sequences to clusters. While the manually curated sequence rules can identify better structural templates, because their curation requires extensive literature search and human effort, they lag behind the deposition of new antibody structures and are infrequently updated. In this study, we propose a machine learning approach (Gradient Boosting Machine [GBM]) to learn the structural clusters of non-H3 CDRs from sequence alone. We argue the GBM method gives simplicity in feature selection and immediate integration of new data compared to manual sequence rules curation. We compare the classification results using the GBM method to that of RosettaAntibody in a 3-repeat 10-fold cross-validation scheme on the cluster-annotated antibody database PyIgClassify and we observe an improvement in the classification accuracy from 78.8±0.2% to 85.1±0.2%. We find the GBM models can reduce the errors in specific cluster membership misclassifications if the involved clusters have relatively abundant data. Based on the factors identified, we suggest methods that can enrich structural classes with sparse data can possibly further improve prediction accuracy in future studies.
This is a submission to PeerJ for review.
Models with different loop and length types were tuned individually
The tuned parameters with the average accuracy of the trained models by CV. The column names roughly correspond to the parameters tuned by the Carat package in R, e.g. “Interaction Depth” corresponds to the “interaction.depth” variable.
GBM model accuracies exceed the blindBLAST accuracies for most of the loop and length types
All the colored lines on the y-axis are the average GBM accuracy of the 3-repeats-10-fold cross-validation. The black line is the average blindBLAST accuracy of the 3-repeats-10-fold CV. The error bars correspond to the standard deviations of accuracies for individual folds. As the number of decision trees (# trees) and single weak learner complexity as the number of branches (interaction.depth) in the GBM model increase, the models achieve greater accuracy than blindBLAST in most loop and length types. The degree to which each tree contributes its knowledge to the model in the boosting process (shrinkage) not shown is set to max (0.01, 0.1*min(1, nl/10000)), depending on the case number of the training set nl for each loop and length type. Compared to the performance of blindBLAST, the best model achieves higher mean accuracy and generally lower model variance.
Loop and length types with sparse data give the highest accuracy standard deviations
In A and D, the loops with the low accuracies and high accuracy standard deviations correspond to those loops with small numbers of CDR members. In B, for loops with relatively small accuracy std, the loops with lower accuracy are also those with the higher number of clusters in the loops. In C, the x-axis is the ratio between the CDR member size in the most populated cluster and the CDR member size in the second populated cluster, therefore can be a metric of how balanced of the cluster member size in a loop and length type. For any loops with small standard deviations for accuracies and large enough member sizes, a more balanced loop and length type tends to have smaller accuracy.
Misclassification types by blindBLAST performance group
The first column “error count” is the average number of error cases in the 3-repeat-10-fold CV results, averaged by the repeats. The number in each row corresponds to the misclassification specified by the query cluster and the template cluster. The “mean simu error count” and sd are values from the random assignment simulation with the same number of test cases and template candidate cases. The last column is the significance value derived from the empirical distribution of the error count. The misclassifications with the same color are the pairs with the order of clusters switched.
GBM may not be better at recovering the sparse clusters than blindBLAST:
The error counts of misclassifications with query clusters being those with less than 50 samples are extracted. Out of 31 extracted clusters, sixteen have less error counts which cost its recovery using GBM compared to blindBLAST while 15 have more error counts using GBM compared to blindBLAST, so the number of cases of worse and better are approximately equal.