Non-H3 CDR template selection in antibody modeling through machine learning

Xiyao Long; Jeliazko R Jeliazkov; Jeffrey J Gray

doi:10.7287/peerj.preprints.26996v1

Non-H3 CDR template selection in antibody modeling through machine learning

Xiyao Long¹, Jeliazko R Jeliazkov², Jeffrey J Gray ^1,2,3,4

1 Chemical and Biomolecular Engineering, The Johns Hopkins University, Baltimore, Maryland, United States

2 Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, Maryland, United States

3 Sidney Kimmel Comprehensive Cancer Center, The Johns Hopkins University, Baltimore, Maryland, United States

4 Institute for Nanobiotechnology, The Johns Hopkins University, Baltimore, Maryland, United States

DOI: 10.7287/peerj.preprints.26996v1

Published: 2018-06-20
Accepted: 2018-06-20

Subject Areas: Bioinformatics, Computational Biology
Keywords: protein structure, structure prediction, Rosetta, antibodies

Copyright: © 2018 Long et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Long X, Jeliazkov JR, Gray JJ. 2018. Non-H3 CDR template selection in antibody modeling through machine learning. PeerJ Preprints 6:e26996v1 https://doi.org/10.7287/peerj.preprints.26996v1

Abstract

Antibodies are proteins generated by the adaptive immune system to recognize and counteract a plethora of pathogens through specific binding. This adaptive binding is mediated by structural diversity in the six complementary determining region (CDR) loops (H1, H2, H3, L1, L2 and L3), which also makes accurate structural modeling of CDRs challenging. Both homology and de novo modeling approaches have been used; to date, the former has achieved greater accuracy for the non-H3 loops. The better performance of homology modeling in non-H3 CDRs is due to the fact that most of the non-H3 CDR loops of the same length and type can be grouped into a few structural clusters. Most antibody-modeling suites utilize homology modeling for the non-H3 CDRs, differing only in the alignment algorithm and how/if they utilize structural clusters. While RosettaAntibody and SAbPred do not explicitly assign query CDR sequences to clusters, two other approaches, PIGS and Kotai Antibody Builder, utilize sequence-based rules to assign CDR sequences to clusters. While the manually curated sequence rules can identify better structural templates, because their curation requires extensive literature search and human effort, they lag behind the deposition of new antibody structures and are infrequently updated. In this study, we propose a machine learning approach (Gradient Boosting Machine [GBM]) to learn the structural clusters of non-H3 CDRs from sequence alone. We argue the GBM method gives simplicity in feature selection and immediate integration of new data compared to manual sequence rules curation. We compare the classification results using the GBM method to that of RosettaAntibody in a 3-repeat 10-fold cross-validation scheme on the cluster-annotated antibody database PyIgClassify and we observe an improvement in the classification accuracy from 78.8±0.2% to 85.1±0.2%. We find the GBM models can reduce the errors in specific cluster membership misclassifications if the involved clusters have relatively abundant data. Based on the factors identified, we suggest methods that can enrich structural classes with sparse data can possibly further improve prediction accuracy in future studies.

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

Models with different loop and length types were tuned individually

The tuned parameters with the average accuracy of the trained models by CV. The column names roughly correspond to the parameters tuned by the Carat package in R, e.g. “Interaction Depth” corresponds to the “interaction.depth” variable.

DOI: 10.7287/peerj.preprints.26996v1/supp-1

Download

GBM model accuracies exceed the blindBLAST accuracies for most of the loop and length types

All the colored lines on the y-axis are the average GBM accuracy of the 3-repeats-10-fold cross-validation. The black line is the average blindBLAST accuracy of the 3-repeats-10-fold CV. The error bars correspond to the standard deviations of accuracies for individual folds. As the number of decision trees (# trees) and single weak learner complexity as the number of branches (interaction.depth) in the GBM model increase, the models achieve greater accuracy than blindBLAST in most loop and length types. The degree to which each tree contributes its knowledge to the model in the boosting process (shrinkage) not shown is set to max (0.01, 0.1*min(1, nl/10000)), depending on the case number of the training set nl for each loop and length type. Compared to the performance of blindBLAST, the best model achieves higher mean accuracy and generally lower model variance.

DOI: 10.7287/peerj.preprints.26996v1/supp-2

Download

Loop and length types with sparse data give the highest accuracy standard deviations

In A and D, the loops with the low accuracies and high accuracy standard deviations correspond to those loops with small numbers of CDR members. In B, for loops with relatively small accuracy std, the loops with lower accuracy are also those with the higher number of clusters in the loops. In C, the x-axis is the ratio between the CDR member size in the most populated cluster and the CDR member size in the second populated cluster, therefore can be a metric of how balanced of the cluster member size in a loop and length type. For any loops with small standard deviations for accuracies and large enough member sizes, a more balanced loop and length type tends to have smaller accuracy.

DOI: 10.7287/peerj.preprints.26996v1/supp-3

Download

Models with different loop and length types should be tuned individually

The finally tuned parameters with the average accuracy of the trained models by CV. The column names “interaction.depth”, “n.trees”, “shrinkage”, “n.minobsinnode” are the parameters tuned.

DOI: 10.7287/peerj.preprints.26996v1/supp-4

Download

Misclassification types by blindBLAST performance group

The first column “error count” is the average number of error cases in the 3-repeat-10-fold CV results, averaged by the repeats. The number in each row corresponds to the misclassification specified by the query cluster and the template cluster. The “mean simu error count” and sd are values from the random assignment simulation with the same number of test cases and template candidate cases. The last column is the significance value derived from the empirical distribution of the error count. The misclassifications with the same color are the pairs with the order of clusters switched.

DOI: 10.7287/peerj.preprints.26996v1/supp-5
Download

GBM may not be better at recovering the sparse clusters than blindBLAST:

The error counts of misclassifications with query clusters being those with less than 50 samples are extracted. Out of 31 extracted clusters, sixteen have less error counts which cost its recovery using GBM compared to blindBLAST while 15 have more error counts using GBM compared to blindBLAST, so the number of cases of worse and better are approximately equal.

DOI: 10.7287/peerj.preprints.26996v1/supp-6
Download

Additional Information

Competing Interests

JJG is an unpaid board member of the Rosetta Commons. Under institutional participation agreements between the University of Washington, acting on behalf of the Rosetta Commons, Johns Hopkins University may be entitled to a portion of revenue received on licensing Rosetta software including programs described here. As a member of the Scientific Advisory Board of Cyrus Biotechnology, JJG is granted stock options. Cyrus Biotechnology distributes the Rosetta software, which may include methods described in this paper.

Author Contributions

Xiyao Long conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft.

Jeliazko R Jeliazkov conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft.

Jeffrey J Gray conceived and designed the experiments, contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft.

Data Deposition

The following information was supplied regarding data availability:

The code is available on Github in the repository named xlong2/machine-learning-cdr; the URL is https://github.com/xlong2/machine-learning-cdr.

Funding
XL, JRJ and JJG were funded by NIH R01-GM078221. JRJ was additionally funded by NIH F31-GM123616. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Supplemental Information

Models with different loop and length types were tuned individually

GBM model accuracies exceed the blindBLAST accuracies for most of the loop and length types

Loop and length types with sparse data give the highest accuracy standard deviations

Models with different loop and length types should be tuned individually

Misclassification types by blindBLAST performance group

GBM may not be better at recovering the sparse clusters than blindBLAST:

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article