2018 YPIC Challenge: A case study in characterizing an unknown protein sample
- Published
- Accepted
- Subject Areas
- Bioinformatics, Molecular Biology
- Keywords
- mass spectrometry, proteomics, spectral clustering, de novo, spectral networking
- Copyright
- © 2019 Pino et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2019. 2018 YPIC Challenge: A case study in characterizing an unknown protein sample. PeerJ Preprints 7:e27802v2 https://doi.org/10.7287/peerj.preprints.27802v2
Abstract
For the 2018 YPIC Challenge contestants were invited to try to decipher two unknown English questions encoded by a synthetic protein expressed in Escherichia coli. In addition to deciphering the sentence, contestants were asked to determine the 3D structure and detect any post-translation modifications left by the host organism.
We present our experimental and computational strategy to characterize this sample by identifying the unknown protein sequence and detecting the presence of post-translational modifications. The sample was acquired with dynamic exclusion disabled to increase the signal-to-noise ratio of the measured molecules, after which spectral clustering was used to generate high-quality consensus spectra. De novo spectrum identification was used to determine the synthetic protein sequence, and any post-translational modifications introduced by E. coli on the synthetic protein were analyzed via spectral networking. This workflow resulted in a de novo sequence coverage of 70%, on par with sequence database searching performance. Additionally, the spectral networking analysis indicated that no systematic modifications were introduced on the synthetic protein by E. coli.
The strategy presented here can be directly used to analyze samples for which no protein sequence information is available or when the identity of the sample is unknown. All software and code to perform the bioinformatics analysis is available as open source, and self-contained Jupyter notebooks are provided to fully recreate the analysis.
Author Comment
Changes: validation of the results using sequence database searching.
Supplemental Information
Simulated peptide length for alternative proteases
Length of simulated peptides for various corpuses using alternative proteases including chymotrypsin, Glu-C, Lys-C, Arg-C, Asp-N, pepsin, and a combined digestion with trypsin. NeXtProt is a database of human proteins, whereas Cryptonomicon and 50 Shades of Grey are two English fiction novels.