2018 YPIC Challenge: A case study in characterizing an unknown protein sample

Department of Genome Sciences, University of Washington, Seattle, WA, United States
Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp / Antwerp University Hospital, Edegem, Belgium
DOI
10.7287/peerj.preprints.27802v1
Subject Areas
Bioinformatics, Molecular Biology
Keywords
mass spectrometry, proteomics, spectral clustering, de novo, spectral networking
Copyright
© 2019 Pino et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Pino L, Lin A, Bittremieux W. 2019. 2018 YPIC Challenge: A case study in characterizing an unknown protein sample. PeerJ Preprints 7:e27802v1

Abstract

For the 2018 YPIC Challenge contestants were invited to try to decipher two unknown English questions encoded by a synthetic protein expressed in Escherichia coli. In addition to deciphering the sentence, contestants were asked to determine the 3D structure and determine any post-translation modifications left by the host organism. We present how we analyzed this unknown sample using a tryptic digest with dynamic exclusion disabled to increase the signal-to-noise ratio of the measured molecules. Subsequently, spectral clustering was used to generate high-quality consensus spectra and condense the acquired MS/MS spectral data. De novo spectrum identification was used to determine the English questions encoded by the synthetic protein, and any post-translational modifications introduced by E. coli on the synthetic protein were detected using spectral networking. Although the synthetic protein sample for the 2018 YPIC Challenge is not of biological interest, the experimental and computational strategy presented here can be directly used to analyze samples for which no protein sequence information is available or when the identity of the sample is unknown. All software and code to perform the bioinformatics analysis is available as open source, and a self-contained Jupyter notebook is provided to fully recreate the analysis.

Author Comment

This is a preprint submission to PeerJ Preprints.

Supplemental Information

Simulated peptide length for alternative proteases

Length of simulated peptides for various corpuses using alternative proteases including chymotrypsin, Glu-C, Lys-C, Arg-C, Asp-N, pepsin, and a combined digestion with trypsin. NeXtProt is a database of human proteins, whereas Cryptonomicon and 50 Shades of Grey are two English fiction novels.

DOI: 10.7287/peerj.preprints.27802v1/supp-1