"PeerJ Preprints" is a venue for early communication or feedback before peer review. Data may be preliminary.

A peer-reviewed article of this Preprint also exists.

View peer-reviewed version

Supplemental Information

Models with different loop and length types were tuned individually

The tuned parameters with the average accuracy of the trained models by CV. The column names roughly correspond to the parameters tuned by the Carat package in R, e.g. “Interaction Depth” corresponds to the “interaction.depth” variable.

DOI: 10.7287/peerj.preprints.26996v1/supp-1

GBM model accuracies exceed the blindBLAST accuracies for most of the loop and length types

All the colored lines on the y-axis are the average GBM accuracy of the 3-repeats-10-fold cross-validation. The black line is the average blindBLAST accuracy of the 3-repeats-10-fold CV. The error bars correspond to the standard deviations of accuracies for individual folds. As the number of decision trees (# trees) and single weak learner complexity as the number of branches (interaction.depth) in the GBM model increase, the models achieve greater accuracy than blindBLAST in most loop and length types. The degree to which each tree contributes its knowledge to the model in the boosting process (shrinkage) not shown is set to max (0.01, 0.1*min(1, nl/10000)), depending on the case number of the training set nl for each loop and length type. Compared to the performance of blindBLAST, the best model achieves higher mean accuracy and generally lower model variance.

DOI: 10.7287/peerj.preprints.26996v1/supp-2

Loop and length types with sparse data give the highest accuracy standard deviations

In A and D, the loops with the low accuracies and high accuracy standard deviations correspond to those loops with small numbers of CDR members. In B, for loops with relatively small accuracy std, the loops with lower accuracy are also those with the higher number of clusters in the loops. In C, the x-axis is the ratio between the CDR member size in the most populated cluster and the CDR member size in the second populated cluster, therefore can be a metric of how balanced of the cluster member size in a loop and length type. For any loops with small standard deviations for accuracies and large enough member sizes, a more balanced loop and length type tends to have smaller accuracy.

DOI: 10.7287/peerj.preprints.26996v1/supp-3

Models with different loop and length types should be tuned individually

The finally tuned parameters with the average accuracy of the trained models by CV. The column names “interaction.depth”, “n.trees”, “shrinkage”, “n.minobsinnode” are the parameters tuned.

DOI: 10.7287/peerj.preprints.26996v1/supp-4

Misclassification types by blindBLAST performance group

The first column “error count” is the average number of error cases in the 3-repeat-10-fold CV results, averaged by the repeats. The number in each row corresponds to the misclassification specified by the query cluster and the template cluster. The “mean simu error count” and sd are values from the random assignment simulation with the same number of test cases and template candidate cases. The last column is the significance value derived from the empirical distribution of the error count. The misclassifications with the same color are the pairs with the order of clusters switched.

DOI: 10.7287/peerj.preprints.26996v1/supp-5

GBM may not be better at recovering the sparse clusters than blindBLAST:

The error counts of misclassifications with query clusters being those with less than 50 samples are extracted. Out of 31 extracted clusters, sixteen have less error counts which cost its recovery using GBM compared to blindBLAST while 15 have more error counts using GBM compared to blindBLAST, so the number of cases of worse and better are approximately equal.

DOI: 10.7287/peerj.preprints.26996v1/supp-6

Additional Information

Competing Interests

JJG is an unpaid board member of the Rosetta Commons. Under institutional participation agreements between the University of Washington, acting on behalf of the Rosetta Commons, Johns Hopkins University may be entitled to a portion of revenue received on licensing Rosetta software including programs described here. As a member of the Scientific Advisory Board of Cyrus Biotechnology, JJG is granted stock options. Cyrus Biotechnology distributes the Rosetta software, which may include methods described in this paper.

Author Contributions

Xiyao Long conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft.

Jeliazko R Jeliazkov conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft.

Jeffrey J Gray conceived and designed the experiments, contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft.

Data Deposition

The following information was supplied regarding data availability:

The code is available on Github in the repository named xlong2/machine-learning-cdr; the URL is


XL, JRJ and JJG were funded by NIH R01-GM078221. JRJ was additionally funded by NIH F31-GM123616. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Add your feedback

Before adding feedback, consider if it can be asked as a question instead, and if so then use the Question tab. Pointing out typos is fine, but authors are encouraged to accept only substantially helpful feedback.

Some Markdown syntax is allowed: _italic_ **bold** ^superscript^ ~subscript~ %%blockquote%% [link text](link URL)
By posting this you agree to PeerJ's commenting policies
  Visitors   Views   Downloads