In light of the rapid decrease in fossils fuel reserves and an increasing demand for energy, novel methods are required to explore alternative biofuel production processes to alleviate these pressures. A wide variety of molecules which can either be used as biofuels or as biofuel precursors are produced using microbial enzymes. However, the common challenges in the industrial implementation of enzyme catalysis for biofuel production are the unavailability of a comprehensive biofuel enzyme resource, low efficiency of known enzymes, and limited availability of enzymes which can function under extreme conditions in the industrial processes.
We have developed a comprehensive database of known enzymes with proven or potential applications in biofuel production through text mining of PubMed abstracts and other publicly available information. A total of 131 enzymes with a role in biofuel production were identified and classified into six enzyme classes and four broad application categories namely ‘Alcohol production’, ‘Biodiesel production’, ‘Fuel Cell’ and ‘Alternate biofuels’. A prediction tool ‘Benz’ was developed to identify and classify novel homologues of the known biofuel enzyme sequences from sequenced genomes and metagenomes. ‘Benz’ employs a hybrid approach incorporating HMMER 3.0 and RAPSearch2 programs to provide high accuracy and high speed for prediction.
Using the Benz tool, 153,754 novel homologues of biofuel enzymes were identified from 23 diverse metagenomic sources. The comprehensive data of curated biofuel enzymes, their novel homologs identified from diverse metagenomes, and the hybrid prediction tool Benz are presented as a web server which can be used for the prediction of biofuel enzymes from genomic and metagenomic datasets. The database and the Benz tool is publicly available at
The global increase in energy demand and decline in the available stock of fossil fuels has become a challenge and requires a search for alternate sources of fuels and energy. In this scenario, enzyme catalyzed conversion of biomass to biofuels provides an ideal source of clean, ecological friendly and sustainable energy (
The third major category of biofuels includes the microbial and enzymatic fuel cells. Microbial fuel cells are devices where microbes grow on the organic content and generate electric current. However, enzymatic fuel cells utilize cell free enzymes as electrodes for achieving the same functionality. In the previous studies, enzymatic fuel cells have been constructed using enzyme-mediated redox reactions at either or both of the electrodes (
Despite the merits of producing biofuels using enzymes, only a limited number of enzymes have been commercially exploited. It is primarily because of the unavailability of efficient enzymes that can perform the desired conversion and their inability to adapt to the application conditions which may not be optimum for a given enzyme, hence leading to a decrease in efficiency (
Therefore, to look for efficient and novel variants of enzymes involved in the different steps of biofuel production (referred to as ‘biofuel enzymes’ in the subsequent text), the first task is to expand our knowledgebase by exploring the naturally occurring biofuel enzymes and to search for homologues of these enzymes from natural environments. The application of high-throughput sequencing technologies has revealed the sequences of thousands of genomes which promise to facilitate the discovery and identification of novel biofuel enzymes. Furthermore, metagenomics has been developed into a culture independent approach that explores the diversity and complexity of microbial genomes in their natural environments and provides the information on novel genes and pathways from yet unculturable genomes. Thus, genome sequencing and metagenomics would assist in increasing the enzyme repertoire by revealing novel biofuel enzymes, and can also provide us with the functional variants of the existing enzymes. The availability of multiple metagenomic databases provides a useful opportunity to discover novel homologs of existing biofuel enzymes.
Several studies and databases have reported enzymes that can be used for biofuel production (
The flowchart of the methodology used for the construction of BioFuelDB is shown in
The initial database of enzymes was constructed by searching the available ‘English’ abstracts containing the terms ‘biofuel AND enzyme’, ‘biodiesel AND enzyme’, ‘alcohol AND enzyme’, ‘ethanol AND enzyme’, ‘methanol AND enzyme’, ‘fuel cell AND enzyme’ and ‘alternate biofuel’ at NCBI PubMed and were imported to into a MySQL Database version 14.14 (
Protein sequences for the curated list of enzymes were obtained from SwissProt database for all 131 enzymes. Sequences marked as ‘putative’, ‘probable’ or ‘hypothetical’ were removed from this initial set of sequences. The enzymes for which no SwissProt sequences remained after the removal of such sequences were discarded. SwissProt, which is a curated protein sequence database and provides high-quality annotation, was the preferred source for extracting the sequences over TrEMBL, which is a computer-annotated supplement to Swiss-Prot. However, for enzymes with less than five sequences available in the SwissProt database, sequences from TrEMBL were included. After following the above two steps, all enzymes with at least five representative sequences were included, and this database was termed as the Primary database. This primary database consisted of 8,263 sequences representing 131 selected enzymes.
Information about the reaction(s) catalyzed by the enzyme, its substrate(s), product(s), KEGG Orthology, and KEGG Pathways was obtained from the KEGG database (
The Benz tool was developed to identify novel homologues of the known biofuel enzyme sequences from sequenced genomes and metagenomes. This tool employs a hybrid approach incorporating HMMER 3.0 and RAPSearch2 programs to provide high accuracy and high speed for prediction (
Two test datasets were constructed to evaluate the performance of Benz program. In the first test dataset, a database of test sequences was prepared from the ‘hypothetical’, ‘probable’ and ‘putative’ sequences which were discarded during the preparation of the Primary database. This dataset consisted of 25,630 protein sequences. The results of the Benz output were compared with the known annotation of the sequences as well as with the results of the BLAST (blastp program with evalue cutoff 1e-6). To construct the second test dataset, ORFs from three different metagenomes (MG-RAST ids: mgm4466309, mgm4516289 and mgm4559623) were downloaded from the MG-RAST web server. A total of 549,870 ORF sequences from the three metagenomes were analyzed using Benz, and the results were compared using a BLAST search. The following standard parameters were used for accessing the efficiency of the program.
where ‘
A total of 23 metagenomic datasets consisting of 22,470,288 ORFs were downloaded from MG-RAST and analyzed using “Benz” for the discovery of biofuel enzymes. These 23 selected metagenomes include sequences from diverse environments including marine, extreme saline, fresh water, aquatic, grasslands, hot springs, coral reef, extreme aquatic habitat (drilling), forests, village, activated sludge, hydrothermal vents, lakes, anthropogenic terrestrial biome and cropland.
The distribution of 131 enzymes in four application-based categories provides an overall summary of the availability of enzymes for the production of various types of biofuels classified in the four categories. Out of the four categories, ‘Alcohol production’ contains the highest (74), followed by Biodiesel (30), Fuel Cell (27) and ‘Alternate’ which contains the lowest (19) number of enzymes (
Application category | Number of enzymes | ||||||
---|---|---|---|---|---|---|---|
EC1 | EC2 | EC3 | EC4 | EC5 | EC6 | Total | |
Alcohol production | 20 | 8 | 33 | 10 | 2 | 1 | 74 |
Biodiesel | 4 | 13 | 7 | 4 | 0 | 2 | 30 |
Fuel cell | 19 | 0 | 3 | 2 | 1 | 2 | 27 |
Alternate biofuels | 5 | 7 | 4 | 3 | 0 | 0 | 19 |
Analysis of enzyme distribution across various EC classes in Primary database reveals some interesting observation. EC5 and EC6, representing ‘Isomerases’ and ‘Ligases’, contained least number of enzymes (three and four, respectively) in the database, whereas, the highest number of enzymes belonged to EC1 and EC3 classes, representing ‘Oxidoreductases’ and ‘Hydrolases’ (44 and 42, respectively) (
In case of the first test dataset, out of 24,009 sequences, Benz classified 23,014 sequences as biofuel enzymes, of which 14,456 were classified as ‘consensus’ results. From these consensus results, average accuracy values of 95.56% and 92.20% were observed for the enzyme classes and application categories, respectively (
In case of dataset 2, the predictions of Benz, RAPSearch and HMMER were compared with the results from BLAST search since the annotations of the sequences of this dataset were unknown. Out of 549,870 sequences, BLAST classified 16,678 sequences as biofuel enzymes, whereas Benz classified 23,317 as biofuel enzymes. As the BLAST annotation was available only for 16,678 sequences, the performance measurement of Benz was performed only on these set of sequences, and the other predictions made by Benz could not be evaluated. Benz classified 7,292 sequences as ‘consensus’ sequences with the average accuracy of 98.64% and 97.89%, respectively, for various EC classes and application categories (
As evident from the performance evaluation of Benz on test datasets 1 and 2, the consensus results are reliable with high sensitivity, specificity, accuracy, and MCC values, although this performance comes at the cost of percentage prediction. Rapsearch2 also provides a good performance while maintaining a high percentage prediction value. HMMER3, however, displays a variable performance on different datasets as well as across different application categories and EC classes. Thus, ‘consensus’ results can be considered more reliable, whereas HMMER3 predictions can be utilized by the user to detect more variant enzymes. Further, the availability of a profile-based search provides additional options to the users to search for biofuel enzymes using an alternative approach.
A total of 23 metagenomes from a variety of biomes were downloaded from the MG-RAST server (
Metagenome ID | Metagenome source | Total ORFs | Alcohol | Biodiesel | Fuel cell | Others | Total biofuels |
---|---|---|---|---|---|---|---|
mgm4440324 | Marine biome | 36,701 | 88 | 22 | 54 | 23 | 153 |
mgm4440329 | Hypersaline | 150,513 | 175 | 37 | 93 | 36 | 278 |
mgm4441050 | Hypersaline | 3,517 | 18 | 6 | 23 | 3 | 44 |
mgm4441102 | Hydrothermal vent | 368,502 | 2,316 | 742 | 2,070 | 1,118 | 5,231 |
mgm4443684 | Freshwater | 388,210 | 1,594 | 497 | 1,430 | 568 | 3,394 |
mgm4448052 | Aquatic biome | 414,473 | 1,995 | 654 | 1,674 | 920 | 4,401 |
mgm4449252 | Grassland | 78,039 | 801 | 224 | 475 | 261 | 1,433 |
mgm4460449 | Hot spring | 762,819 | 5,456 | 1,928 | 4,388 | 2,419 | 11,972 |
mgm4466309 | Coral reef | 176,426 | 1,414 | 493 | 1,041 | 457 | 2,922 |
mgm4467029 | Large lake | 376,200 | 2,029 | 810 | 1,155 | 1,836 | 5,012 |
mgm4477803 | Lake | 5,383,950 | 9,814 | 2,911 | 7,907 | 2,706 | 19,965 |
mgm4478241 | Extreme aquatic babitat (Drilling) | 222,722 | 2,702 | 417 | 1,597 | 736 | 4,028 |
mgm4479942 | Village biome | 83,867 | 761 | 216 | 529 | 274 | 1,455 |
mgm4487639 | Forest biome | 457,998 | 882 | 215 | 451 | 152 | 1,455 |
mgm4494621 | Activated sludge | 5,054,731 | 14,444 | 4,727 | 12,546 | 3,063 | 28,882 |
mgm4516289 | Aquatic biome | 253,233 | 1,490 | 526 | 1,010 | 674 | 3,085 |
mgm4523306 | Anthropogenic terrestrial biome | 3,007 | 18 | 2 | 5 | 1 | 24 |
mgm4527699 | Cropland biome | 755,188 | 7,616 | 1,160 | 2,573 | 2,149 | 11,336 |
mgm4528623 | Cropland biome | 238,739 | 871 | 275 | 658 | 376 | 1,860 |
mgm4537095 | Mediterranean forests, woodlands, shrub | 360,982 | 1,653 | 541 | 824 | 415 | 2,629 |
mgm4559623 | Aquatic biome | 120,211 | 771 | 155 | 401 | 291 | 1,285 |
mgm4571849 | Aquatic biome | 4,464,190 | 16,310 | 3,234 | 6,262 | 4,120 | 24,525 |
mgm4571867 | Aquatic biome | 2,316,070 | 11,920 | 2,351 | 4,675 | 2,793 | 18,385 |
Metagenome ID | Metagenome source | EC 1 | EC 2 | EC 3 | EC 4 | EC 5 | EC 6 | Total |
---|---|---|---|---|---|---|---|---|
mgm4440324 | Marine biome | 52 | 52 | 7 | 20 | 2 | 20 | 153 |
mgm4440329 | Hypersaline | 118 | 66 | 26 | 20 | 5 | 43 | 278 |
mgm4441050 | Hypersaline | 25 | 5 | 1 | 5 | 0 | 8 | 44 |
mgm4441102 | Hydrothermal vent | 2,047 | 1,179 | 281 | 866 | 114 | 744 | 5,231 |
mgm4443684 | Freshwater | 1,271 | 577 | 383 | 485 | 75 | 603 | 3,394 |
mgm4448052 | Aquatic biome | 1,542 | 831 | 726 | 544 | 40 | 718 | 4,401 |
mgm4449252 | Grassland | 590 | 222 | 285 | 118 | 32 | 186 | 1,433 |
mgm4460449 | Hot spring | 3,992 | 2,876 | 878 | 1,619 | 214 | 2,393 | 11,972 |
mgm4466309 | Coral reef | 1,075 | 629 | 240 | 426 | 53 | 499 | 2,922 |
mgm4467029 | Large lake | 1,569 | 928 | 1,373 | 636 | 61 | 445 | 5,012 |
mgm4477803 | Lake | 8,874 | 3,901 | 1,479 | 2,595 | 401 | 2,715 | 19,965 |
mgm4478241 | Extreme aquatic habitat (Drilling) | 1,732 | 849 | 195 | 372 | 10 | 870 | 4,028 |
mgm4479942 | Village biome | 671 | 238 | 194 | 132 | 29 | 191 | 1,455 |
mgm4487639 | Forest biome | 694 | 316 | 94 | 169 | 25 | 157 | 1,455 |
mgm4494621 | Activated sludge | 12,812 | 5,781 | 1,027 | 4,268 | 236 | 4,758 | 28,882 |
mgm4516289 | Aquatic biome | 1,096 | 585 | 492 | 404 | 62 | 446 | 3,085 |
mgm4523306 | Anthropogenic terrestrial biome | 6 | 3 | 12 | 2 | 0 | 1 | 24 |
mgm4527699 | Cropland biome | 3,322 | 1,869 | 3,383 | 1,437 | 475 | 850 | 11,336 |
mgm4528623 | Cropland biome | 627 | 292 | 457 | 194 | 59 | 231 | 1,860 |
mgm4537095 | Mediterranean forests, woodlands, shrub | 843 | 906 | 37 | 191 | 10 | 642 | 2,629 |
mgm4559623 | Aquatic biome | 470 | 248 | 210 | 140 | 35 | 182 | 1,285 |
mgm4571849 | Aquatic biome | 8,303 | 5,317 | 5,008 | 2,039 | 1,072 | 2,786 | 24,525 |
mgm4571867 | Aquatic biome | 5,667 | 3,932 | 4,404 | 1,871 | 907 | 1,604 | 18,385 |
From all considered metegenomes, around 0.5%–1.5% of total ORFs were identified as Biofuel enzymes, which is a significant indication of their prevalence. Cropland, forest and other biomes (such as grassland and village biome) with degrading biomass were found to contain a higher proportion (1.8%) of enzymes under the ‘Bioalcohol’ category. This implies the presence of a large number of alcohol producing (fermentative) enzymes in the microbiome of these environments (
The Primary database consisting of 8,236 sequences of biofuel enzymes and 153,754 sequence homologs of biofuel enzyme (metalog), which were mined from the metagenomes, were combined to form the BioFuelDB database which was incorporated in the Web Server.
The ‘Explore’ page of the web server allows the user to search the enzymes from the BioFuelDB database by enzyme name, Enzyme Commission (EC) number, enzyme’s systematic name or enzyme’s KEGG Reaction ID(s). The user can also browse the database by application category of the enzymes, i.e., ‘Alcohol Production’, ‘Biodiesel Production’, ‘Fuel Cell’, ‘Alternate Biofuels’ or all the enzymes irrespective of the application category. Selecting an enzyme name takes the user to a new page where the various information about the enzyme are displayed such as, systematic name, EC number, common name(s) of the enzyme, application category, chemical reaction undertaken by the enzyme, KEGG reaction ID(s), substrates and products of the enzyme’s reaction, the biological pathways in which the enzyme is involved and KEGG Orthology. Furthermore, the page provides the UniProt sequences of the enzyme in FASTA format as well as the PubMed references on which the application of the enzyme in the given application category was demonstrated.
The ‘Prediction’ page of the BioFuelDB web server is designed to identify novel homologues of the enzymes available in the BioFuelDB database. This page employs a hybrid tool ‘Benz’ consisting of ‘Biofuel-PfamDB’ and ‘RapsearchDB’ databases at the backend for the prediction of novel homologs of biofuel enzymes which can be used for the production of biofuels. The user can either provide raw FASTA sequence(s) of proteins or ORFs in the input box or upload a query FASTA file through the ‘Upload’ interface. The query can be made either against all enzymes in the database, enzymes from any one application category, or against any one selected enzyme. For RAPSearch, the default e-value for the search is E-6, and for HMM it is E-21.
The output is reported in tab-separated format with nine columns namely: ‘Query’, ‘HMM Hit’ (matching hit from the HMMER profile), ‘HMM
The ‘MetaLog’ page provides links to download the metagenomic homologues of the 131 biofuel enzymes (length > 50 amino acids) present in the Primary database enzymes predicted from 23 varied metagenomes using the Benz tool.
The main motivation for this work was the unavailability of any specialized database which provides comprehensive information on enzymes involved in different types of biofuel production. The mining of literature revealed that only a limited number of enzymes involved in biofuel production are currently known from a limited number of genomes. Therefore, as the first step, we constructed the ‘BioFuelDB’ knowledgebase of all enzymes involved in biofuel production from the available literature. However, the limited repertoire of these enzymes becomes a limitation while selecting the enzyme variants which can perform the desired reaction under the industrial conditions that may not be optimum for the given enzyme, hence leading to decrease in efficiency (
In the present scenario, metagenomic data generated from different environments comprising of sequences from culturable and unculturable microbial genomes can be mined to improve the repertoire of biofuel enzymes by revealing novel biofuel enzymes as well as the functional variants of the existing enzymes. In this study, the identification of 153,754 enzymes from 23 metagenomes indicates the possibility of finding such enzymes by exploiting the metagenomic data from several hundreds of metagenomes. Furthermore, the metagenomes are so rich in microbial diversity and functional genes that it is almost certain to identify the novel variant of a given enzyme (
To our knowledge, BioFuelDB is the first comprehensive dataset of biofuel enzymes. We anticipate that it would act as a comprehensive resource of biofuel enzymes and would assist researchers to explore novel variants of biofuel enzymes from different environments. However, the efficiency of the novel variants can only be ascertained through laboratory experiments, but the high quality of the initial primary database and stringent search criteria of Benz tool ensures that all the predicted enzyme sequences can be used as leads for experimental validations. The database and tools are available freely at the website
The authors declare there are no competing interests.
The following information was supplied regarding data availability:
The data related to this work is available at