PeerJ:Bioinformaticshttps://peerj.com/articles/index.atom?journal=peerj&subject=540Bioinformatics articles published in PeerJIntegrating single-cell and bulk sequencing data to identify glycosylation-based genes in non-alcoholic fatty liver disease-associated hepatocellular carcinomahttps://peerj.com/articles/170022024-03-182024-03-18Zhijia ZhouYanan GaoLongxin DengXiaole LuYancheng LaiJieke WuShaodong ChenChengzhong LiHuiqing Liang
Background
The incidence of non-alcoholic fatty liver disease (NAFLD) associated hepatocellular carcinoma (HCC) has been increasing. However, the role of glycosylation, an important modification that alters cellular differentiation and immune regulation, in the progression of NAFLD to HCC is rare.
Methods
We used the NAFLD-HCC single-cell dataset to identify variation in the expression of glycosylation patterns between different cells and used the HCC bulk dataset to establish a link between these variations and the prognosis of HCC patients. Then, machine learning algorithms were used to identify those glycosylation-related signatures with prognostic significance and to construct a model for predicting the prognosis of HCC patients. Moreover, it was validated in high-fat diet-induced mice and clinical cohorts.
Results
The NAFLD-HCC Glycogene Risk Model (NHGRM) signature included the following genes: SPP1, SOCS2, SAPCD2, S100A9, RAMP3, and CSAD. The higher NHGRM scores were associated with a poorer prognosis, stronger immune-related features, immune cell infiltration and immunity scores. Animal experiments, external and clinical cohorts confirmed the expression of these genes.
Conclusion
The genetic signature we identified may serve as a potential indicator of survival in patients with NAFLD-HCC and provide new perspectives for elucidating the role of glycosylation-related signatures in this pathologic process.
Background
The incidence of non-alcoholic fatty liver disease (NAFLD) associated hepatocellular carcinoma (HCC) has been increasing. However, the role of glycosylation, an important modification that alters cellular differentiation and immune regulation, in the progression of NAFLD to HCC is rare.
Methods
We used the NAFLD-HCC single-cell dataset to identify variation in the expression of glycosylation patterns between different cells and used the HCC bulk dataset to establish a link between these variations and the prognosis of HCC patients. Then, machine learning algorithms were used to identify those glycosylation-related signatures with prognostic significance and to construct a model for predicting the prognosis of HCC patients. Moreover, it was validated in high-fat diet-induced mice and clinical cohorts.
Results
The NAFLD-HCC Glycogene Risk Model (NHGRM) signature included the following genes: SPP1, SOCS2, SAPCD2, S100A9, RAMP3, and CSAD. The higher NHGRM scores were associated with a poorer prognosis, stronger immune-related features, immune cell infiltration and immunity scores. Animal experiments, external and clinical cohorts confirmed the expression of these genes.
Conclusion
The genetic signature we identified may serve as a potential indicator of survival in patients with NAFLD-HCC and provide new perspectives for elucidating the role of glycosylation-related signatures in this pathologic process.Deciphering the genomes of motility-deficient mutants of Vibrio alginolyticus 138-2https://peerj.com/articles/171262024-03-182024-03-18Kazuma UesakaKeita InabaNoriko NishiokaSeiji KojimaMichio HommaKunio Ihara
The motility of Vibrio species plays a pivotal role in their survival and adaptation to diverse environments and is intricately associated with pathogenicity in both humans and aquatic animals. Numerous mutant strains of Vibrio alginolyticus have been generated using UV or EMS mutagenesis to probe flagellar motility using molecular genetic approaches. Identifying these mutations promises to yield valuable insights into motility at the protein structural physiology level. In this study, we determined the complete genomic structure of 4 reference specimens of laboratory V. alginolyticus strains: a precursor strain, V. alginolyticus 138-2, two strains showing defects in the lateral flagellum (VIO5 and YM4), and one strain showing defects in the polar flagellum (YM19). Subsequently, we meticulously ascertained the specific mutation sites within the 18 motility-deficient strains related to the polar flagellum (they fall into three categories: flagellar-deficient, multi-flagellar, and chemotaxis-deficient strains) by whole genome sequencing and mapping to the complete genome of parental strains VIO5 or YM4. The mutant strains had an average of 20.6 (±12.7) mutations, most of which were randomly distributed throughout the genome. However, at least two or more different mutations in six flagellar-related genes were detected in 18 mutants specifically selected as chemotaxis-deficient mutants. Genomic analysis using a large number of mutant strains is a very effective tool to comprehensively identify genes associated with specific phenotypes using forward genetics.
The motility of Vibrio species plays a pivotal role in their survival and adaptation to diverse environments and is intricately associated with pathogenicity in both humans and aquatic animals. Numerous mutant strains of Vibrio alginolyticus have been generated using UV or EMS mutagenesis to probe flagellar motility using molecular genetic approaches. Identifying these mutations promises to yield valuable insights into motility at the protein structural physiology level. In this study, we determined the complete genomic structure of 4 reference specimens of laboratory V. alginolyticus strains: a precursor strain, V. alginolyticus 138-2, two strains showing defects in the lateral flagellum (VIO5 and YM4), and one strain showing defects in the polar flagellum (YM19). Subsequently, we meticulously ascertained the specific mutation sites within the 18 motility-deficient strains related to the polar flagellum (they fall into three categories: flagellar-deficient, multi-flagellar, and chemotaxis-deficient strains) by whole genome sequencing and mapping to the complete genome of parental strains VIO5 or YM4. The mutant strains had an average of 20.6 (±12.7) mutations, most of which were randomly distributed throughout the genome. However, at least two or more different mutations in six flagellar-related genes were detected in 18 mutants specifically selected as chemotaxis-deficient mutants. Genomic analysis using a large number of mutant strains is a very effective tool to comprehensively identify genes associated with specific phenotypes using forward genetics.Clinical significance of small nuclear ribonucleoprotein U1 subunit 70 in patients with hepatocellular carcinomahttps://peerj.com/articles/168762024-03-152024-03-15Dong JiangXia-Ling ZhuYan AnYi-ran Li
Background & Aims
Small nuclear ribonucleoprotein U1 subunit 70 (SNRNP70) as one of the components of the U1 small nuclear ribonucleoprotein (snRNP) is rarely reported in cancers. This study aims to estimate the application potential of SNRNP70 in hepatocellular carcinoma (HCC) clinical practice.
Methods
Based on the TCGA database and cohort of HCC patients, we investigated the expression patterns and prognostic value of SNRNP70 in HCC. Then, the combination of SNRNP70 and alpha-fetoprotein (AFP) in 278 HCC cases was analyzed. Next, western blotting and immunohistochemistry were used to detect the expression of SNRNP70 in nucleus and cytoplasm. Finally, Cell Counting Kit-8 (CCK-8) and scratch wound healing assays were used to detect the effect of SNRNP70 on the proliferation and migration of HCC cells.
Results
SNRNP70 was highly expressed in HCC. Its expression was increasingly high during the progression of HCC and was positively related to immune infiltration cells. Higher SNRNP70 expression indicated a poor outcome of HCC patients. In addition, nuclear SNRNP70/AFP combination could be a prognostic biomarker for overall survival and recurrence. Cell experiments confirmed that knockdown of SNRNP70 inhibited the proliferation and migration of HCC cells.
Conclusion
SNRNP70 may be a new biomarker for HCC progression and HCC diagnosis as well as prognosis. SNRNP70 combined with serum AFP may indicate the prognosis and recurrence status of HCC patients after operation.
Background & Aims
Small nuclear ribonucleoprotein U1 subunit 70 (SNRNP70) as one of the components of the U1 small nuclear ribonucleoprotein (snRNP) is rarely reported in cancers. This study aims to estimate the application potential of SNRNP70 in hepatocellular carcinoma (HCC) clinical practice.
Methods
Based on the TCGA database and cohort of HCC patients, we investigated the expression patterns and prognostic value of SNRNP70 in HCC. Then, the combination of SNRNP70 and alpha-fetoprotein (AFP) in 278 HCC cases was analyzed. Next, western blotting and immunohistochemistry were used to detect the expression of SNRNP70 in nucleus and cytoplasm. Finally, Cell Counting Kit-8 (CCK-8) and scratch wound healing assays were used to detect the effect of SNRNP70 on the proliferation and migration of HCC cells.
Results
SNRNP70 was highly expressed in HCC. Its expression was increasingly high during the progression of HCC and was positively related to immune infiltration cells. Higher SNRNP70 expression indicated a poor outcome of HCC patients. In addition, nuclear SNRNP70/AFP combination could be a prognostic biomarker for overall survival and recurrence. Cell experiments confirmed that knockdown of SNRNP70 inhibited the proliferation and migration of HCC cells.
Conclusion
SNRNP70 may be a new biomarker for HCC progression and HCC diagnosis as well as prognosis. SNRNP70 combined with serum AFP may indicate the prognosis and recurrence status of HCC patients after operation.Exploring the antioxidant potential of chalcogen-indolizines throughout in vitro assayshttps://peerj.com/articles/170742024-03-152024-03-15Cleisson Schossler GarciaMarcia Juciele da RochaMarcelo Heinemann PresaCamila Simões PiresEvelyn Mianes BesckowFilipe PenteadoCaroline Signorini GomesEder João LenardãoCristiani Folharini BortolattoCésar Augusto Brüning
Reactive oxygen species (ROS) and reactive nitrogen species (RNS) are highly reactive molecules produced naturally by the body and by external factors. When these species are generated in excessive amounts, they can lead to oxidative stress, which in turn can cause cellular and tissue damage. This damage is known to contribute to the aging process and is associated with age-related conditions, including cardiovascular and neurodegenerative diseases. In recent years, there has been an increased interest in the development of compounds with antioxidant potential to assist in the treatment of disorders related to oxidative stress. In this way, compounds containing sulfur (S) and/or selenium (Se) have been considered promising due to the relevant role of these elements in the biosynthesis of antioxidant enzymes and essential proteins with physiological functions. In this context, studies involving heterocyclic nuclei have significantly increased, notably highlighting the indolizine nucleus, given that compounds containing this nucleus have been demonstrating considerable pharmacological properties. Thus, the objective of this research was to evaluate the in vitro antioxidant activity of eight S- and Se-derivatives containing indolizine nucleus and different substituents. The in vitro assays 1,1-diphenyl-2-picryl-hydrazil (DPPH) scavenger activity, ferric ion (Fe3+) reducing antioxidant power (FRAP), thiobarbituric acid reactive species (TBARS), and protein carbonylation (PC) were used to access the antioxidant profile of the compounds. Our findings demonstrated that all the compounds showed FRAP activity and reduced the levels of TBARS and PC in mouse brains homogenates. Some compounds were also capable of acting as DPPH scavengers. In conclusion, the present study demonstrated that eight novel organochalcogen compounds exhibit antioxidant activity.
Reactive oxygen species (ROS) and reactive nitrogen species (RNS) are highly reactive molecules produced naturally by the body and by external factors. When these species are generated in excessive amounts, they can lead to oxidative stress, which in turn can cause cellular and tissue damage. This damage is known to contribute to the aging process and is associated with age-related conditions, including cardiovascular and neurodegenerative diseases. In recent years, there has been an increased interest in the development of compounds with antioxidant potential to assist in the treatment of disorders related to oxidative stress. In this way, compounds containing sulfur (S) and/or selenium (Se) have been considered promising due to the relevant role of these elements in the biosynthesis of antioxidant enzymes and essential proteins with physiological functions. In this context, studies involving heterocyclic nuclei have significantly increased, notably highlighting the indolizine nucleus, given that compounds containing this nucleus have been demonstrating considerable pharmacological properties. Thus, the objective of this research was to evaluate the in vitro antioxidant activity of eight S- and Se-derivatives containing indolizine nucleus and different substituents. The in vitro assays 1,1-diphenyl-2-picryl-hydrazil (DPPH) scavenger activity, ferric ion (Fe3+) reducing antioxidant power (FRAP), thiobarbituric acid reactive species (TBARS), and protein carbonylation (PC) were used to access the antioxidant profile of the compounds. Our findings demonstrated that all the compounds showed FRAP activity and reduced the levels of TBARS and PC in mouse brains homogenates. Some compounds were also capable of acting as DPPH scavengers. In conclusion, the present study demonstrated that eight novel organochalcogen compounds exhibit antioxidant activity.The impact of FASTQ and alignment read order on structural variant calling from long-read sequencing datahttps://peerj.com/articles/171012024-03-152024-03-15Kyle J. LesackJames D. Wasmuth
Background
Structural variant (SV) calling from DNA sequencing data has been challenging due to several factors, including the ambiguity of short-read alignments, multiple complex SVs in the same genomic region, and the lack of “truth” datasets for benchmarking. Additionally, caller choice, parameter settings, and alignment method are known to affect SV calling. However, the impact of FASTQ read order on SV calling has not been explored for long-read data.
Results
Here, we used PacBio DNA sequencing data from 15 Caenorhabditis elegans strains and four Arabidopsis thaliana ecotypes to evaluate the sensitivity of different SV callers on FASTQ read order. Comparisons of variant call format files generated from the original and permutated FASTQ files demonstrated that the order of input data affected the SVs predicted by each caller. In particular, pbsv was highly sensitive to the order of the input data, especially at the highest depths where over 70% of the SV calls generated from pairs of differently ordered FASTQ files were in disagreement. These demonstrate that read order sensitivity is a complex, multifactorial process, as the differences observed both within and between species varied considerably according to the specific combination of aligner, SV caller, and sequencing depth. In addition to the SV callers being sensitive to the input data order, the SAMtools alignment sorting algorithm was identified as a source of variability following read order randomization.
Conclusion
The results of this study highlight the sensitivity of SV calling on the order of reads encoded in FASTQ files, which has not been recognized in long-read approaches. These findings have implications for the replication of SV studies and the development of consistent SV calling protocols. Our study suggests that researchers should pay attention to the input order sensitivity of read alignment sorting methods when analyzing long-read sequencing data for SV calling, as mitigating a source of variability could facilitate future replication work. These results also raise important questions surrounding the relationship between SV caller read order sensitivity and tool performance. Therefore, tool developers should also consider input order sensitivity as a potential source of variability during the development and benchmarking of new and improved methods for SV calling.
Background
Structural variant (SV) calling from DNA sequencing data has been challenging due to several factors, including the ambiguity of short-read alignments, multiple complex SVs in the same genomic region, and the lack of “truth” datasets for benchmarking. Additionally, caller choice, parameter settings, and alignment method are known to affect SV calling. However, the impact of FASTQ read order on SV calling has not been explored for long-read data.
Results
Here, we used PacBio DNA sequencing data from 15 Caenorhabditis elegans strains and four Arabidopsis thaliana ecotypes to evaluate the sensitivity of different SV callers on FASTQ read order. Comparisons of variant call format files generated from the original and permutated FASTQ files demonstrated that the order of input data affected the SVs predicted by each caller. In particular, pbsv was highly sensitive to the order of the input data, especially at the highest depths where over 70% of the SV calls generated from pairs of differently ordered FASTQ files were in disagreement. These demonstrate that read order sensitivity is a complex, multifactorial process, as the differences observed both within and between species varied considerably according to the specific combination of aligner, SV caller, and sequencing depth. In addition to the SV callers being sensitive to the input data order, the SAMtools alignment sorting algorithm was identified as a source of variability following read order randomization.
Conclusion
The results of this study highlight the sensitivity of SV calling on the order of reads encoded in FASTQ files, which has not been recognized in long-read approaches. These findings have implications for the replication of SV studies and the development of consistent SV calling protocols. Our study suggests that researchers should pay attention to the input order sensitivity of read alignment sorting methods when analyzing long-read sequencing data for SV calling, as mitigating a source of variability could facilitate future replication work. These results also raise important questions surrounding the relationship between SV caller read order sensitivity and tool performance. Therefore, tool developers should also consider input order sensitivity as a potential source of variability during the development and benchmarking of new and improved methods for SV calling.EPI-SF: essential protein identification in protein interaction networks using sequence featureshttps://peerj.com/articles/170102024-03-132024-03-13Sovan SahaPiyali ChatterjeeSubhadip BasuMita Nasipuri
Proteins are considered indispensable for facilitating an organism’s viability, reproductive capabilities, and other fundamental physiological functions. Conventional biological assays are characterized by prolonged duration, extensive labor requirements, and financial expenses in order to identify essential proteins. Therefore, it is widely accepted that employing computational methods is the most expeditious and effective approach to successfully discerning essential proteins. Despite being a popular choice in machine learning (ML) applications, the deep learning (DL) method is not suggested for this specific research work based on sequence features due to the restricted availability of high-quality training sets of positive and negative samples. However, some DL works on limited availability of data are also executed at recent times which will be our future scope of work. Conventional ML techniques are thus utilized in this work due to their superior performance compared to DL methodologies. In consideration of the aforementioned, a technique called EPI-SF is proposed here, which employs ML to identify essential proteins within the protein-protein interaction network (PPIN). The protein sequence is the primary determinant of protein structure and function. So, initially, relevant protein sequence features are extracted from the proteins within the PPIN. These features are subsequently utilized as input for various machine learning models, including XGB Boost Classifier, AdaBoost Classifier, logistic regression (LR), support vector classification (SVM), Decision Tree model (DT), Random Forest model (RF), and Naïve Bayes model (NB). The objective is to detect the essential proteins within the PPIN. The primary investigation conducted on yeast examined the performance of various ML models for yeast PPIN. Among these models, the RF model technique had the highest level of effectiveness, as indicated by its precision, recall, F1-score, and AUC values of 0.703, 0.720, 0.711, and 0.745, respectively. It is also found to be better in performance when compared to the other state-of-arts based on traditional centrality like betweenness centrality (BC), closeness centrality (CC), etc. and deep learning methods as well like DeepEP, as emphasized in the result section. As a result of its favorable performance, EPI-SF is later employed for the prediction of novel essential proteins inside the human PPIN. Due to the tendency of viruses to selectively target essential proteins involved in the transmission of diseases within human PPIN, investigations are conducted to assess the probable involvement of these proteins in COVID-19 and other related severe diseases.
Proteins are considered indispensable for facilitating an organism’s viability, reproductive capabilities, and other fundamental physiological functions. Conventional biological assays are characterized by prolonged duration, extensive labor requirements, and financial expenses in order to identify essential proteins. Therefore, it is widely accepted that employing computational methods is the most expeditious and effective approach to successfully discerning essential proteins. Despite being a popular choice in machine learning (ML) applications, the deep learning (DL) method is not suggested for this specific research work based on sequence features due to the restricted availability of high-quality training sets of positive and negative samples. However, some DL works on limited availability of data are also executed at recent times which will be our future scope of work. Conventional ML techniques are thus utilized in this work due to their superior performance compared to DL methodologies. In consideration of the aforementioned, a technique called EPI-SF is proposed here, which employs ML to identify essential proteins within the protein-protein interaction network (PPIN). The protein sequence is the primary determinant of protein structure and function. So, initially, relevant protein sequence features are extracted from the proteins within the PPIN. These features are subsequently utilized as input for various machine learning models, including XGB Boost Classifier, AdaBoost Classifier, logistic regression (LR), support vector classification (SVM), Decision Tree model (DT), Random Forest model (RF), and Naïve Bayes model (NB). The objective is to detect the essential proteins within the PPIN. The primary investigation conducted on yeast examined the performance of various ML models for yeast PPIN. Among these models, the RF model technique had the highest level of effectiveness, as indicated by its precision, recall, F1-score, and AUC values of 0.703, 0.720, 0.711, and 0.745, respectively. It is also found to be better in performance when compared to the other state-of-arts based on traditional centrality like betweenness centrality (BC), closeness centrality (CC), etc. and deep learning methods as well like DeepEP, as emphasized in the result section. As a result of its favorable performance, EPI-SF is later employed for the prediction of novel essential proteins inside the human PPIN. Due to the tendency of viruses to selectively target essential proteins involved in the transmission of diseases within human PPIN, investigations are conducted to assess the probable involvement of these proteins in COVID-19 and other related severe diseases.Characterization of PYL gene family and identification of HaPYL genes response to drought and salt stress in sunflowerhttps://peerj.com/articles/168312024-03-072024-03-07Zhaoping WangJiayan ZhouJian ZouJun YangWeiying Chen
In the context of global climate change, drought and soil salinity are some of the most devastating abiotic stresses affecting agriculture today. PYL proteins are essential components of abscisic acid (ABA) signaling and play critical roles in responding to abiotic stressors, including drought and salt stress. Although PYL genes have been studied in many species, their roles in responding to abiotic stress are still unclear in the sunflower. In this study, 19 HaPYL genes, distributed on 15 of 17 chromosomes, were identified in the sunflower. Fragment duplication is the main cause of the expansion of PYL genes in the sunflower genome. Based on phylogenetic analysis, HaPYL genes were divided into three subfamilies. Members in the same subfamily share similar protein motifs and gene exon-intron structures, except for the second subfamily. Tissue expression patterns suggested that HaPYLs serve different functions when responding to developmental and environmental signals in the sunflower. Exogenous ABA treatment showed that most HaPYLs respond to an increase in the ABA level. Among these HaPYLs, HaPYL2a, HaPYL4d, HaPYL4g, HaPYL8a, HaPYL8b, HaPYL8c, HaPYL9b, and HaPYL9c were up-regulated with PEG6000 treatment and NaCl treatment. This indicates that they may play a role in resisting drought and salt stress in the sunflower by mediating ABA signaling. Our findings provide some clues to further explore the functions of PYL genes in the sunflower, especially with regards to drought and salt stress resistance.
In the context of global climate change, drought and soil salinity are some of the most devastating abiotic stresses affecting agriculture today. PYL proteins are essential components of abscisic acid (ABA) signaling and play critical roles in responding to abiotic stressors, including drought and salt stress. Although PYL genes have been studied in many species, their roles in responding to abiotic stress are still unclear in the sunflower. In this study, 19 HaPYL genes, distributed on 15 of 17 chromosomes, were identified in the sunflower. Fragment duplication is the main cause of the expansion of PYL genes in the sunflower genome. Based on phylogenetic analysis, HaPYL genes were divided into three subfamilies. Members in the same subfamily share similar protein motifs and gene exon-intron structures, except for the second subfamily. Tissue expression patterns suggested that HaPYLs serve different functions when responding to developmental and environmental signals in the sunflower. Exogenous ABA treatment showed that most HaPYLs respond to an increase in the ABA level. Among these HaPYLs, HaPYL2a, HaPYL4d, HaPYL4g, HaPYL8a, HaPYL8b, HaPYL8c, HaPYL9b, and HaPYL9c were up-regulated with PEG6000 treatment and NaCl treatment. This indicates that they may play a role in resisting drought and salt stress in the sunflower by mediating ABA signaling. Our findings provide some clues to further explore the functions of PYL genes in the sunflower, especially with regards to drought and salt stress resistance.Scalable neighbour search and alignment with uvaiahttps://peerj.com/articles/168902024-03-062024-03-06Leonardo de Oliveira MartinsAlison E. MatherAndrew J. Page
Despite millions of SARS-CoV-2 genomes being sequenced and shared globally, manipulating such data sets is still challenging, especially selecting sequences for focused phylogenetic analysis. We present a novel method, uvaia, which is based on partial and exact sequence similarity for quickly extracting database sequences similar to query sequences of interest. Many SARS-CoV-2 phylogenetic analyses rely on very low numbers of ambiguous sites as a measure of quality since ambiguous sites do not contribute to single nucleotide polymorphism (SNP) differences. Uvaia overcomes this limitation by using measures of sequence similarity which consider partially ambiguous sites, allowing for more ambiguous sequences to be included in the analysis if needed. Such fine-grained definition of similarity allows not only for better phylogenetic analyses, but could also lead to improved classification and biogeographical inferences. Uvaia works natively with compressed files, can use multiple cores and efficiently utilises memory, being able to analyse large data sets on a standard desktop.
Despite millions of SARS-CoV-2 genomes being sequenced and shared globally, manipulating such data sets is still challenging, especially selecting sequences for focused phylogenetic analysis. We present a novel method, uvaia, which is based on partial and exact sequence similarity for quickly extracting database sequences similar to query sequences of interest. Many SARS-CoV-2 phylogenetic analyses rely on very low numbers of ambiguous sites as a measure of quality since ambiguous sites do not contribute to single nucleotide polymorphism (SNP) differences. Uvaia overcomes this limitation by using measures of sequence similarity which consider partially ambiguous sites, allowing for more ambiguous sequences to be included in the analysis if needed. Such fine-grained definition of similarity allows not only for better phylogenetic analyses, but could also lead to improved classification and biogeographical inferences. Uvaia works natively with compressed files, can use multiple cores and efficiently utilises memory, being able to analyse large data sets on a standard desktop.Unsupervised AI reveals insect species-specific genome signatureshttps://peerj.com/articles/170252024-03-062024-03-06Yui SawadaRyuhei MineiHiromasa TabataToshimichi IkemuraKennosuke WadaYoshiko WadaHiroshi NagataYuki Iwasaki
Insects are a highly diverse phylogeny and possess a wide variety of traits, including the presence or absence of wings and metamorphosis. These diverse traits are of great interest for studying genome evolution, and numerous comparative genomic studies have examined a wide phylogenetic range of insects. Here, we analyzed 22 insects belonging to a wide phylogenetic range (Endopterygota, Paraneoptera, Polyneoptera, Palaeoptera, and other insects) by using a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions in their genomic fragments (100-kb or 1-Mb sequences), which is an unsupervised machine learning algorithm that can extract species-specific characteristics of the oligonucleotide compositions (genome signatures). The genome signature is of particular interest in terms of the mechanisms and biological significance that have caused the species-specific difference, and can be used as a powerful search needle to explore the various roles of genome sequences other than protein coding, and can be used to unveil mysteries hidden in the genome sequence. Since BLSOM is an unsupervised clustering method, the clustering of sequences was performed based on the oligonucleotide composition alone, without providing information about the species from which each fragment sequence was derived. Therefore, not only the interspecies separation, but also the intraspecies separation can be achieved. Here, we have revealed the specific genomic regions with oligonucleotide compositions distinct from the usual sequences of each insect genome, e.g., Mb-level structures found for a grasshopper Schistocerca americana. One aim of this study was to compare the genome characteristics of insects with those of vertebrates, especially humans, which are phylogenetically distant from insects. Recently, humans seem to be the “model organism” for which a large amount of information has been accumulated using a variety of cutting-edge and high-throughput technologies. Therefore, it is reasonable to use the abundant information from humans to study insect lineages. The specific regions of Mb length with distinct oligonucleotide compositions have also been previously observed in the human genome. These regions were enriched by transcription factor binding motifs (TFBSs) and hypothesized to be involved in the three-dimensional arrangement of chromosomal DNA in interphase nuclei. The present study characterized the species-specific oligonucleotide compositions (i.e., genome signatures) in insect genomes and identified specific genomic regions with distinct oligonucleotide compositions.
Insects are a highly diverse phylogeny and possess a wide variety of traits, including the presence or absence of wings and metamorphosis. These diverse traits are of great interest for studying genome evolution, and numerous comparative genomic studies have examined a wide phylogenetic range of insects. Here, we analyzed 22 insects belonging to a wide phylogenetic range (Endopterygota, Paraneoptera, Polyneoptera, Palaeoptera, and other insects) by using a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions in their genomic fragments (100-kb or 1-Mb sequences), which is an unsupervised machine learning algorithm that can extract species-specific characteristics of the oligonucleotide compositions (genome signatures). The genome signature is of particular interest in terms of the mechanisms and biological significance that have caused the species-specific difference, and can be used as a powerful search needle to explore the various roles of genome sequences other than protein coding, and can be used to unveil mysteries hidden in the genome sequence. Since BLSOM is an unsupervised clustering method, the clustering of sequences was performed based on the oligonucleotide composition alone, without providing information about the species from which each fragment sequence was derived. Therefore, not only the interspecies separation, but also the intraspecies separation can be achieved. Here, we have revealed the specific genomic regions with oligonucleotide compositions distinct from the usual sequences of each insect genome, e.g., Mb-level structures found for a grasshopper Schistocerca americana. One aim of this study was to compare the genome characteristics of insects with those of vertebrates, especially humans, which are phylogenetically distant from insects. Recently, humans seem to be the “model organism” for which a large amount of information has been accumulated using a variety of cutting-edge and high-throughput technologies. Therefore, it is reasonable to use the abundant information from humans to study insect lineages. The specific regions of Mb length with distinct oligonucleotide compositions have also been previously observed in the human genome. These regions were enriched by transcription factor binding motifs (TFBSs) and hypothesized to be involved in the three-dimensional arrangement of chromosomal DNA in interphase nuclei. The present study characterized the species-specific oligonucleotide compositions (i.e., genome signatures) in insect genomes and identified specific genomic regions with distinct oligonucleotide compositions.Benchmarking a targeted 16S ribosomal RNA gene enrichment approach to reconstruct ancient microbial communitieshttps://peerj.com/articles/167702024-03-012024-03-01Raphael EisenhoferSterling WrightLaura Weyrich
The taxonomic characterization of ancient microbiomes is a key step in the rapidly growing field of paleomicrobiology. While PCR amplification of the 16S ribosomal RNA (rRNA) gene is a widely used technique in modern microbiota studies, this method has systematic biases when applied to ancient microbial DNA. Shotgun metagenomic sequencing has proven to be the most effective method in reconstructing taxonomic profiles of ancient dental calculus samples. Nevertheless, shotgun sequencing approaches come with inherent limitations that could be addressed through hybridization enrichment capture. When employed together, shotgun sequencing and hybridization capture have the potential to enhance the characterization of ancient microbial communities. Here, we develop, test, and apply a hybridization enrichment capture technique to selectively target 16S rRNA gene fragments from the libraries of ancient dental calculus samples generated with shotgun techniques. We simulated data sets generated from hybridization enrichment capture, indicating that taxonomic identification of fragmented and damaged 16S rRNA gene sequences was feasible. Applying this enrichment approach to 15 previously published ancient calculus samples, we observed a 334-fold increase of ancient 16S rRNA gene fragments in the enriched samples when compared to unenriched libraries. Our results suggest that 16S hybridization capture is less prone to the effects of background contamination than 16S rRNA amplification, yielding a higher percentage of on-target recovery. While our enrichment technique detected low abundant and rare taxa within a given sample, these assignments may not achieve the same level of specificity as those achieved by unenriched methods.
The taxonomic characterization of ancient microbiomes is a key step in the rapidly growing field of paleomicrobiology. While PCR amplification of the 16S ribosomal RNA (rRNA) gene is a widely used technique in modern microbiota studies, this method has systematic biases when applied to ancient microbial DNA. Shotgun metagenomic sequencing has proven to be the most effective method in reconstructing taxonomic profiles of ancient dental calculus samples. Nevertheless, shotgun sequencing approaches come with inherent limitations that could be addressed through hybridization enrichment capture. When employed together, shotgun sequencing and hybridization capture have the potential to enhance the characterization of ancient microbial communities. Here, we develop, test, and apply a hybridization enrichment capture technique to selectively target 16S rRNA gene fragments from the libraries of ancient dental calculus samples generated with shotgun techniques. We simulated data sets generated from hybridization enrichment capture, indicating that taxonomic identification of fragmented and damaged 16S rRNA gene sequences was feasible. Applying this enrichment approach to 15 previously published ancient calculus samples, we observed a 334-fold increase of ancient 16S rRNA gene fragments in the enriched samples when compared to unenriched libraries. Our results suggest that 16S hybridization capture is less prone to the effects of background contamination than 16S rRNA amplification, yielding a higher percentage of on-target recovery. While our enrichment technique detected low abundant and rare taxa within a given sample, these assignments may not achieve the same level of specificity as those achieved by unenriched methods.