PeerJ Preprints: Bioinformaticshttps://peerj.com/preprints/index.atom?journal=peerj&subject=540Bioinformatics articles published in PeerJ PreprintsData-driven classification of the certainty of scholarly assertionshttps://peerj.com/preprints/278292019-12-202019-12-20Mario PrietoHelena DeusAnita De WaardErik SchultesBeatriz García-JiménezMark D Wilkinson
The grammatical structures scholars use to express their assertions are intended to convey various degrees of certainty or speculation. Prior studies have suggested a variety of categorization systems for scholarly certainty; however, these have not been objectively tested for their validity, particularly with respect to representing the interpretation by the reader, rather than the intention of the author. In this study, we use a series of questionnaires to determine how researchers classify various scholarly assertions, using three distinct certainty classification systems. We find that there are three distinct categories of certainty along a spectrum from high to low. We show that these categories can be detected in an automated manner, using a machine learning model, with a cross-validation accuracy of 89.2% relative to an author-annotated corpus, and 82.2% accuracy against a publicly-annotated corpus. This finding provides an opportunity for contextual metadata related to certainty to be captured as a part of text-mining pipelines, which currently miss these subtle linguistic cues. We provide an exemplar machine-accessible representation - a Nanopublication - where certainty category is embedded as metadata in a formal, ontology-based manner within text-mined scholarly assertions.
The grammatical structures scholars use to express their assertions are intended to convey various degrees of certainty or speculation. Prior studies have suggested a variety of categorization systems for scholarly certainty; however, these have not been objectively tested for their validity, particularly with respect to representing the interpretation by the reader, rather than the intention of the author. In this study, we use a series of questionnaires to determine how researchers classify various scholarly assertions, using three distinct certainty classification systems. We find that there are three distinct categories of certainty along a spectrum from high to low. We show that these categories can be detected in an automated manner, using a machine learning model, with a cross-validation accuracy of 89.2% relative to an author-annotated corpus, and 82.2% accuracy against a publicly-annotated corpus. This finding provides an opportunity for contextual metadata related to certainty to be captured as a part of text-mining pipelines, which currently miss these subtle linguistic cues. We provide an exemplar machine-accessible representation - a Nanopublication - where certainty category is embedded as metadata in a formal, ontology-based manner within text-mined scholarly assertions.An introduction to phylosymbiosishttps://peerj.com/preprints/278792019-12-112019-12-11Shen Jean LimSeth R Bordenstein
Phylosymbiosis was recently formulated to support a hypothesis-driven framework for the characterization of a new, cross-system trend in host-associated microbiomes. Defining phylosymbiosis as “microbial community relationships that recapitulate the phylogeny of their host”, we review the relevant literature and data in the last decade, emphasizing frequently used methods and regular patterns observed in analyses. Quantitative support for phylosymbiosis is provided by statistical methods evaluating higher microbiome variation between host species than within host species, topological similarities between the host phylogeny and microbiome dendrogram, and a positive association between host genetic relationships and microbiome beta diversity. Significant degrees of phylosymbiosis are prevalent, but not universal, in microbiomes of plants and animals from terrestrial and aquatic habitats. Consistent with natural selection shaping phylosymbiosis, microbiome transplant experiments demonstrate reduced host performance and/or fitness upon host-microbiome mismatches. Hybridization can also disrupt phylosymbiotic microbiomes and cause hybrid pathologies. The pervasiveness of phylosymbiosis carries several important implications for advancing knowledge of eco-evolutionary processes that impact host-microbiome interactions and future applications of precision microbiology. Important future steps will be to examine phylosymbiosis beyond bacterial communities, apply evolutionary modeling for an increasingly sophisticated understanding of phylosymbiosis, and unravel the host and microbial mechanisms that contribute to the pattern. This review serves as a gateway to experimental, conceptual, and quantitative themes of phylosymbiosis and outlines opportunities ripe for investigations from a diversity of disciplines.
Phylosymbiosis was recently formulated to support a hypothesis-driven framework for the characterization of a new, cross-system trend in host-associated microbiomes. Defining phylosymbiosis as “microbial community relationships that recapitulate the phylogeny of their host”, we review the relevant literature and data in the last decade, emphasizing frequently used methods and regular patterns observed in analyses. Quantitative support for phylosymbiosis is provided by statistical methods evaluating higher microbiome variation between host species than within host species, topological similarities between the host phylogeny and microbiome dendrogram, and a positive association between host genetic relationships and microbiome beta diversity. Significant degrees of phylosymbiosis are prevalent, but not universal, in microbiomes of plants and animals from terrestrial and aquatic habitats. Consistent with natural selection shaping phylosymbiosis, microbiome transplant experiments demonstrate reduced host performance and/or fitness upon host-microbiome mismatches. Hybridization can also disrupt phylosymbiotic microbiomes and cause hybrid pathologies. The pervasiveness of phylosymbiosis carries several important implications for advancing knowledge of eco-evolutionary processes that impact host-microbiome interactions and future applications of precision microbiology. Important future steps will be to examine phylosymbiosis beyond bacterial communities, apply evolutionary modeling for an increasingly sophisticated understanding of phylosymbiosis, and unravel the host and microbial mechanisms that contribute to the pattern. This review serves as a gateway to experimental, conceptual, and quantitative themes of phylosymbiosis and outlines opportunities ripe for investigations from a diversity of disciplines.ViSiElse: An innovative R-package to visualize raw behavioral data over timehttps://peerj.com/preprints/276652019-11-252019-11-25Elodie M GarnierNastasia FouretMédéric Descoins
The scientific community encourages the use of raw data graphs to improve the reliability and transparency of the results presented in articles. However, the current methods used to visualize raw data are limited to one or two numerical variables per graph and/or small sample sizes. In the behavioral sciences, numerous variables must be plotted together in order to gain insight into the behavior in question. In this paper, we present ViSiElse, an R-package offering a new approach in the visualization of raw data. ViSiElse was developed with the open-source software R to visualize behavioral observations over time based on raw time data extracted from visually recorded sessions of experimental observations. ViSiElse gives a global overview of a process by creating a visualization of the timestamps for multiple actions and all participants into a single graph; individual or group behavior can then be easily assessed. Additional features allow users to further inspect their data by including summary statistics and time constraints.
The scientific community encourages the use of raw data graphs to improve the reliability and transparency of the results presented in articles. However, the current methods used to visualize raw data are limited to one or two numerical variables per graph and/or small sample sizes. In the behavioral sciences, numerous variables must be plotted together in order to gain insight into the behavior in question. In this paper, we present ViSiElse, an R-package offering a new approach in the visualization of raw data. ViSiElse was developed with the open-source software R to visualize behavioral observations over time based on raw time data extracted from visually recorded sessions of experimental observations. ViSiElse gives a global overview of a process by creating a visualization of the timestamps for multiple actions and all participants into a single graph; individual or group behavior can then be easily assessed. Additional features allow users to further inspect their data by including summary statistics and time constraints.The investigation of 2D monolayers as potential chelation agents in Alzheimer’s diseasehttps://peerj.com/preprints/279422019-11-202019-11-20Neha PavuluruXuan Luo
In this study, we conducted Density Functional Theory calculations comparing the binding energy of the copper- Amyloid-beta complex to the binding energies of potential chelation materials. We used the first-coordination sphere of the truncated high-pH Amyloid-beta protein subject to computational limits. Binding energy and charge transfer calculations were evaluated for copper’s interaction with potential chelators: monolayer boron nitride, monolayer molybdenum disulfide, and monolayer silicene. Silicene produced the highest binding energies to copper, and the evidence of charge transfer between copper and the monolayer proves that there is a strong ionic bond present. Although our three monolayers did not directly present chelation potential, the absolute differences between the binding energies of the silicene binding sites and the Amyloid-beta binding site were minimal proving that further research in silicene chelators may be useful for therapy in Alzheimer’s disease.
In this study, we conducted Density Functional Theory calculations comparing the binding energy of the copper- Amyloid-beta complex to the binding energies of potential chelation materials. We used the first-coordination sphere of the truncated high-pH Amyloid-beta protein subject to computational limits. Binding energy and charge transfer calculations were evaluated for copper’s interaction with potential chelators: monolayer boron nitride, monolayer molybdenum disulfide, and monolayer silicene. Silicene produced the highest binding energies to copper, and the evidence of charge transfer between copper and the monolayer proves that there is a strong ionic bond present. Although our three monolayers did not directly present chelation potential, the absolute differences between the binding energies of the silicene binding sites and the Amyloid-beta binding site were minimal proving that further research in silicene chelators may be useful for therapy in Alzheimer’s disease.Digestiflow: from BCL to FASTQ with easehttps://peerj.com/preprints/277172019-11-112019-11-11Manuel HoltgreweMikko NieminenClemens MesserschmidtDieter Beule
Management raw sequencing data and its preprocessing (conversion into sequences and demultiplexing) remains a challenging topic for groups running sequencing devices. They face many challenges in such efforts and solutions ranging from manual management of spreadsheets to very complex and customized LIMS systems handling much more than just sequencing raw data. In this manuscript, we describe the software package DigestiFlow that focuses on the management of Illumina flow cell sample sheets and raw data. It allows for automated extraction of information from flow cell data and management of sample sheets. Furthermore, it allows for the automated and reproducible conversion of Illumina base calls to sequences and the demultiplexing thereof using bcl2fastq and Picard Tools, followed by quality control report generation.
Management raw sequencing data and its preprocessing (conversion into sequences and demultiplexing) remains a challenging topic for groups running sequencing devices. They face many challenges in such efforts and solutions ranging from manual management of spreadsheets to very complex and customized LIMS systems handling much more than just sequencing raw data. In this manuscript, we describe the software package DigestiFlow that focuses on the management of Illumina flow cell sample sheets and raw data. It allows for automated extraction of information from flow cell data and management of sample sheets. Furthermore, it allows for the automated and reproducible conversion of Illumina base calls to sequences and the demultiplexing thereof using bcl2fastq and Picard Tools, followed by quality control report generation.ImageJ and 3D Slicer: open source 2/3D morphometric softwarehttps://peerj.com/preprints/279982019-10-022019-10-02Fiona PyeNussaȉbah B RajaBryan ShirleyÁdám T KocsisNiklas HohmannDuncan J E MurdockEmilia Jarochowska
In a world where an increasing number of resources are hidden behind paywalls and monthly subscriptions, it is becoming crucial for the scientific community to invest energy into freely available, community-maintained systems. Open-source software projects offer a solution, with freely available code which users can utilise and modify, under an open source licence. In addition to software accessibility and methodological repeatability, this also enables and encourages the development of new tools.
As palaeontology moves towards data driven methodologies, it is becoming more important to acquire and provide high quality data through reproducible systematic procedures. Within the field of morphometrics, it is vital to adopt digital methods that help mitigate human bias from data collection. In addition, mathematically founded approaches can reduce subjective decisions which plague classical data. This can be further developed through automation, which increases the efficiency of data collection and analysis.
With these concepts in mind, we introduce two open-source shape analysis software, that arose from projects within the medical imaging field. These are ImageJ, an image processing program with batch processing features, and 3D Slicer which focuses on 3D informatics and visualisation. They are easily extensible using common programming languages, with 3D Slicer containing an internal python interactor, and ImageJ allowing the incorporation of several programming languages within its interface alongside its own simplified macro language. Additional features created by other users are readily available, on GitHub or through the software itself.
In the examples presented, an ImageJ plugin “FossilJ” has been developed which provides semi-automated morphometric bivalve data collection. 3D Slicer is used with the extension SPHARM-PDM, applied to synchrotron scans of coniform conodonts for comparative morphometrics, for which small assistant tools have been created in Python.
In a world where an increasing number of resources are hidden behind paywalls and monthly subscriptions, it is becoming crucial for the scientific community to invest energy into freely available, community-maintained systems. Open-source software projects offer a solution, with freely available code which users can utilise and modify, under an open source licence. In addition to software accessibility and methodological repeatability, this also enables and encourages the development of new tools.As palaeontology moves towards data driven methodologies, it is becoming more important to acquire and provide high quality data through reproducible systematic procedures. Within the field of morphometrics, it is vital to adopt digital methods that help mitigate human bias from data collection. In addition, mathematically founded approaches can reduce subjective decisions which plague classical data. This can be further developed through automation, which increases the efficiency of data collection and analysis.With these concepts in mind, we introduce two open-source shape analysis software, that arose from projects within the medical imaging field. These are ImageJ, an image processing program with batch processing features, and 3D Slicer which focuses on 3D informatics and visualisation. They are easily extensible using common programming languages, with 3D Slicer containing an internal python interactor, and ImageJ allowing the incorporation of several programming languages within its interface alongside its own simplified macro language. Additional features created by other users are readily available, on GitHub or through the software itself.In the examples presented, an ImageJ plugin “FossilJ” has been developed which provides semi-automated morphometric bivalve data collection. 3D Slicer is used with the extension SPHARM-PDM, applied to synchrotron scans of coniform conodonts for comparative morphometrics, for which small assistant tools have been created in Python.Functional characterization of a new maize heat shock transcription factor gene ZmHsf01 playing important roles in thermotolerancehttps://peerj.com/preprints/279872019-09-262019-09-26Huaning ZhangGuoliang LiYuanyuan ZhangYujie ZhangHongbo ShaoDong HuXiulin Guo
Background. The yield of maize crop is influenced seriously by heat waves. Plant heat shock transcription factors (Hsfs) play a key regulatory role in heat shock signal transduction pathway. Method. In this study, a new heat shock transcription factor gene, ZmHsf01 (accession number: MK888854) , was cloned from maize young leaves using homologous cloning method. The transcriptional level of ZmHsf01 were detected by qRT-PCR in different tissues or under heat shock, abscisic acid (ABA) and hydrogen peroxide (H2O2) treatment. The transgenic yeast and Arabidopsis were used to study the gene function of ZmHsf01. Result. The coding sequence (CDS) of ZmHsf01 was 1176 bp and encoded a protein that consisted of 391 amino acids. The homologous analysis result showed that ZmHsf01 and SbHsfA2d had the highest protein sequence identity. Subcellular localization experiments demonstrated that ZmHsf01 is localized to the nucleus. ZmHsf01 was expressed in many maize tissues and was up-regulated by heat stress. ZmHsf01 was up-regulated in roots and down-regulated in leaves by ABA and H2O2treatments. In yeast, ZmHsf01-overexpressing cells showed increased thermotolerance. In Arabidopsis seedlings, ZmHsf01 complemented the thermotolerance defects of athsfa2 mutant and ZmHsf01-overexpressing lines presented enhanced basal and acquired thermotolerance. Compared to wild type (WT) seedlings, ZmHsf01-overexpressing lines showed increased chlorophyll content after heat stress. The expression level of heat shock protein genes was up-regulated higher in ZmHsf01-overexpressing Arabidopsis seedlings than that in WT. These results suggested that ZmHsf01 plays a vital role in plant response to heat stress.
Background. The yield of maize crop is influenced seriously by heat waves. Plant heat shock transcription factors (Hsfs) play a key regulatory role in heat shock signal transduction pathway. Method. In this study, a new heat shock transcription factor gene, ZmHsf01 (accession number: MK888854), was cloned from maize young leaves using homologous cloning method. The transcriptional level of ZmHsf01 were detected by qRT-PCR in different tissues or under heat shock, abscisic acid (ABA) and hydrogen peroxide (H2O2) treatment. The transgenic yeast and Arabidopsis were used to study the gene function of ZmHsf01. Result. The coding sequence (CDS) of ZmHsf01 was 1176 bp and encoded a protein that consisted of 391 amino acids. The homologous analysis result showed that ZmHsf01 and SbHsfA2dhad the highest protein sequence identity. Subcellular localization experiments demonstrated that ZmHsf01 is localized to the nucleus. ZmHsf01 was expressed in many maize tissues and was up-regulated by heat stress. ZmHsf01 was up-regulated in roots and down-regulated in leaves by ABA and H2O2treatments. In yeast, ZmHsf01-overexpressing cells showed increased thermotolerance. In Arabidopsis seedlings, ZmHsf01 complemented the thermotolerance defects of athsfa2 mutant and ZmHsf01-overexpressing lines presented enhanced basal and acquired thermotolerance. Compared to wild type (WT) seedlings, ZmHsf01-overexpressing lines showed increased chlorophyll content after heat stress. The expression level of heat shock protein genes was up-regulated higher in ZmHsf01-overexpressing Arabidopsis seedlings than that in WT. These results suggested that ZmHsf01 plays a vital role in plant response to heat stress.Large-scale unsupervised clustering of Orca vocalizations: a model for describing orca communication systemshttps://peerj.com/preprints/279792019-09-242019-09-24Marion PoupardPaul BestJan SchlüterHelena SymondsPaul SpongHervé Glotin
Killer whales (Orcinus orca) can produce 3 types of signals: clicks, whistles and vocalizations. This study focuses on Orca vocalizations from northern Vancouver Island (Hanson Island) where the NGO Orcalab developed a multi-hydrophone recording station to study Orcas. The acoustic station is composed of 5 hydrophones and extends over 50 km 2 of ocean. Since 2015 we are continuously streaming the hydrophone signals to our laboratory in Toulon, France, yielding nearly 50 TB of synchronous multichannel recordings. In previous work, we trained a Convolutional Neural Network (CNN) to detect Orca vocalizations, using transfer learning from a bird activity dataset. Here, for each detected vocalization, we estimate the pitch contour (fundamental frequency). Finally, we cluster vocalizations by features describing the pitch contour. While preliminary, our results demonstrate a possible route towards automatic Orca call type classification. Furthermore, they can be linked to the presence of particular Orca pods in the area according to the classification of their call types. A large-scale call type classification would allow new insights on phonotactics and ethoacoustics of endangered Orca populations in the face of increasing anthropic pressure.
Killer whales (Orcinus orca) can produce 3 types of signals: clicks, whistles and vocalizations. This study focuses on Orca vocalizations from northern Vancouver Island (Hanson Island) where the NGO Orcalab developed a multi-hydrophone recording station to study Orcas. The acoustic station is composed of 5 hydrophones and extends over 50 km 2 of ocean. Since 2015 we are continuously streaming the hydrophone signals to our laboratory in Toulon, France, yielding nearly 50 TB of synchronous multichannel recordings. In previous work, we trained a Convolutional Neural Network (CNN) to detect Orca vocalizations, using transfer learning from a bird activity dataset. Here, for each detected vocalization, we estimate the pitch contour (fundamental frequency). Finally, we cluster vocalizations by features describing the pitch contour. While preliminary, our results demonstrate a possible route towards automatic Orca call type classification. Furthermore, they can be linked to the presence of particular Orca pods in the area according to the classification of their call types. A large-scale call type classification would allow new insights on phonotactics and ethoacoustics of endangered Orca populations in the face of increasing anthropic pressure.A guide to carrying out a phylogenomic target sequence capture projecthttps://peerj.com/preprints/279682019-09-182019-09-18Tobias AndermannMaria Fernanda Torres JimenezPável Matos-MaravíRomina BatistaJosé L Blanco-PastorA. Lovisa S GustafssonLogan KistlerIsabel M LiberalBengt OxelmanChristine D BaconAlexandre Antonelli
High-throughput DNA sequencing techniques enable time- and cost-effective sequencing of large portions of the genome. Instead of sequencing and annotating whole genomes, many phylogenetic studies focus sequencing efforts on large sets of pre-selected loci, which further reduces costs and bioinformatic challenges while increasing sequencing depth. One common approach that enriches loci before sequencing is often referred to as target sequence capture. This technique has been shown to be applicable to phylogenetic studies of greatly varying evolutionary depth and has proven to produce powerful, large multi-locus DNA sequence datasets of selected loci, suitable for phylogenetic analyses. However, target capture requires careful theoretical and practical considerations, which will greatly affect the success of the experiment. Here we provide an easy-to-follow flowchart for adequately designing phylogenomic target capture experiments, and we discuss necessary considerations and decisions from the first steps in the lab to the final bioinformatic processing of the sequence data. We particularly discuss issues and challenges related to the taxonomic scope, sample quality, and available genomic resources of target capture projects and how these issues affect all steps from bait design to the bioinformatic processing of the data. Altogether this review outlines a roadmap for future target capture experiments and is intended to assist researchers with making informed decisions for designing and carrying out successful phylogenetic target capture studies
High-throughput DNA sequencing techniques enable time- and cost-effective sequencing of large portions of the genome. Instead of sequencing and annotating whole genomes, many phylogenetic studies focus sequencing efforts on large sets of pre-selected loci, which further reduces costs and bioinformatic challenges while increasing sequencing depth. One common approach that enriches loci before sequencing is often referred to as target sequence capture. This technique has been shown to be applicable to phylogenetic studies of greatly varying evolutionary depth and has proven to produce powerful, large multi-locus DNA sequence datasets of selected loci, suitable for phylogenetic analyses. However, target capture requires careful theoretical and practical considerations, which will greatly affect the success of the experiment. Here we provide an easy-to-follow flowchart for adequately designing phylogenomic target capture experiments, and we discuss necessary considerations and decisions from the first steps in the lab to the final bioinformatic processing of the sequence data. We particularly discuss issues and challenges related to the taxonomic scope, sample quality, and available genomic resources of target capture projects and how these issues affect all steps from bait design to the bioinformatic processing of the data. Altogether this review outlines a roadmap for future target capture experiments and is intended to assist researchers with making informed decisions for designing and carrying out successful phylogenetic target capture studiesPhylogenomic analysis of 589 metagenome-assembled genomes encompassing all major prokaryotic lineages from the gut of higher termiteshttps://peerj.com/preprints/279292019-08-302019-08-30Vincent HervéPengfei LiuCarsten DietrichDavid Sillam-DussèsPetr StiblikJan ŠobotníkAndreas Brune
“Higher” termites have been able to colonize all tropical and subtropical regions because of their ability to digest lignocellulose with the aid of their prokaryotic gut microbiota. Over the last decade, numerous studies based on 16S rRNA gene amplicon libraries have largely described both the taxonomy and structure of the prokaryotic communities associated with termite guts. Host diet and microenvironmental conditions have emerged as the main factors structuring the microbial assemblages in the different gut compartments. Additionally, these molecular inventories have revealed the existence of termite-specific clusters that indicate coevolutionary processes in numerous prokaryotic lineages. However, for lack of representative isolates, the functional role of most lineages remains unclear. We reconstructed 589 metagenome-assembled genomes (MAGs) from the different gut compartments of eight higher termite species that encompass 17 prokaryotic phyla. By iteratively building genome trees for each clade, we significantly improved the initial automated assignment, frequently up to the genus level. We recovered MAGs from most of the termite-specific clusters in the radiation of, e.g., Planctomycetes, Fibrobacteres, Bacteroidetes, Euryarchaeota, Bathyarchaeota, Spirochaetes, Saccharibacteria, and Firmicutes, which to date contained only few or no representative genomes. Moreover, the MAGs included abundant members of the termite gut microbiota. This dataset represents the largest genomic resource for arthropod-associated microorganisms available to date and contributes substantially to populating the tree of life. More importantly, it provides a backbone for studying the metabolic potential of the termite gut microbiota, including the key members involved in carbon and nitrogen biogeochemical cycles, and important clues that may help cultivating representatives of these understudied clades.
“Higher” termites have been able to colonize all tropical and subtropical regions because of their ability to digest lignocellulose with the aid of their prokaryotic gut microbiota. Over the last decade, numerous studies based on 16S rRNA gene amplicon libraries have largely described both the taxonomy and structure of the prokaryotic communities associated with termite guts. Host diet and microenvironmental conditions have emerged as the main factors structuring the microbial assemblages in the different gut compartments. Additionally, these molecular inventories have revealed the existence of termite-specific clusters that indicate coevolutionary processes in numerous prokaryotic lineages. However, for lack of representative isolates, the functional role of most lineages remains unclear. We reconstructed 589 metagenome-assembled genomes (MAGs) from the different gut compartments of eight higher termite species that encompass 17 prokaryotic phyla. By iteratively building genome trees for each clade, we significantly improved the initial automated assignment, frequently up to the genus level. We recovered MAGs from most of the termite-specific clusters in the radiation of, e.g., Planctomycetes, Fibrobacteres, Bacteroidetes, Euryarchaeota, Bathyarchaeota, Spirochaetes, Saccharibacteria, and Firmicutes, which to date contained only few or no representative genomes. Moreover, the MAGs included abundant members of the termite gut microbiota. This dataset represents the largest genomic resource for arthropod-associated microorganisms available to date and contributes substantially to populating the tree of life. More importantly, it provides a backbone for studying the metabolic potential of the termite gut microbiota, including the key members involved in carbon and nitrogen biogeochemical cycles, and important clues that may help cultivating representatives of these understudied clades.