MATLAB software for extracting protein name and sequence information from FASTA formatted proteome file

Unaffiliated researcher, Singapore, Singapore
DOI
10.7287/peerj.preprints.27856v2
Subject Areas
Biochemistry, Bioinformatics, Computational Biology, Genomics, Molecular Biology
Keywords
proteome, FASTA, nucleotide sequence, character array, protein database, amino acid sequence, parse information, molecular weight, MATLAB, molecular cloning
Copyright
© 2019 Ng
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Ng W. 2019. MATLAB software for extracting protein name and sequence information from FASTA formatted proteome file. PeerJ Preprints 7:e27856v2

Abstract

FASTA file format is a common file type for distributing proteome information, especially those obtained from Uniprot. While MATLAB could automatically read fasta files using the built-in function, fastaread, important information such as protein name and organism name remain enmeshed in a character array. Hence, difficulty exists in automatic extraction of protein names from fasta proteome file to help in building a database with fields comprising protein name and its amino acid sequence. The objective of this work was in developing a MATLAB software that could automatically extract protein name and amino acid sequence information from fasta proteome file and assign them to a new database that comprises fields such as protein name, amino acid sequence, number of amino acid residues, molecular weight of protein and nucleotide sequence of protein. Information on number of amino acid residues came from the use of the length built-in function in MATLAB analyzing the length of the amino acid sequence of a protein. The final two fields were provided by MATLAB built-in functions molweight and aa2nt, respectively. Molecular weight of proteins is useful for a variety of applications while nucleotide sequence is essential for gene synthesis applications in molecular cloning. Finally, the MATLAB software is also equipped with an error check function to help detect letters in the amino acid sequence that are not part of the family of 20 natural amino acids. Sequences with such letters would constitute as error inputs to molweight and aa2nt, and would not be processed. Collectively, given that important information such as protein name is enmeshed in a character array in fasta proteome file, this work sets out to develop a MATLAB software that could automatically extract protein name and amino acid sequence information, and assigns them to a new protein database. Using built-in functions, number of amino acid residues, molecular weight and nucleotide sequence of each protein were calculated; thereby, yielding a new protein database with improved functionalities that could support a variety of biology workflows ranging from sequence alignment to molecular cloning.

Author Comment

This is an improved version of the previous manuscript, and it incorporates software with improved efficiency.

Supplemental Information

MATLAB software for creating a proteome database

This zip file contains the MATLAB function files of the software.

DOI: 10.7287/peerj.preprints.27856v2/supp-1