PeerJ Preprints: Genomics

PeerJ Preprints: Genomics https://peerj.com/preprints/index.atom?journal=peerj&subject=1730 Genomics articles published in PeerJ Preprints A roadmap for gene functional characterisation in wheat https://peerj.com/preprints/26877 2019-12-18 2019-12-18 Nikolai M Adamski Philippa Borrill Jemima Brinton Sophie Harrington Clemence Marchal Alison R Bentley Wiliam D Bovill Luigi Cattivelli James Cockram Bruno Contreras-Moreira Brett Ford Sreya Ghosh Wendy Harwood Keywan Hassani-Pak Sadiye Hayta Lee T Hickey Kostya Kanyuka Julie King Marco Maccaferri Guy Naamati Curtis J Pozniak Ricardo H Ramirez-Gonzalez Carolina Sansaloni Ben Trevaskis Luzie U Wingen Brande BH Wulff Cristobal Uauy

To adapt to the challenges of climate change and the growing world population, it is vital to increase global crop production. Understanding the function of genes within staple crops will accelerate crop improvement by allowing targeted breeding approaches. Despite the importance of wheat, which provides 20 % of the calories consumed by humankind, a lack of genomic information and resources has hindered the functional characterisation of genes in this species. The recent release of a high-quality reference sequence for wheat underpins a suite of genetic and genomic resources that support basic research and breeding. These include accurate gene model annotations, gene expression atlases and gene networks that provide background information about putative gene function. In parallel, sequenced mutation populations, improved transformation protocols and structured natural populations provide rapid methods to study gene function directly. We highlight a case study exemplifying how to integrate these resources to study gene function in wheat and thereby accelerate improvement in this important crop. We hope that this review provides a helpful guide for plant scientists, especially those expanding into wheat research for the first time, to capitalise on the discoveries made in Arabidopsis and other plants. This will accelerate the improvement of wheat, a complex polyploid crop, of vital importance for food and nutrition security.

Testing asymptomatic individuals for unsuspected conditions is not new to the medical and public health communities and protocols to develop screening tests are well-established. However, the application of screening principles to inherited diseases presents unique challenges. Unlike most screening tests, the natural history and disease prevalence of most rare inherited diseases in an unselected population are unknown. It is difficult or impossible to obtain a “truth set” cohort for clinical validation studies. As a result, it is not possible to accurately calculate clinical positive and negative predictive values for “likely pathogenic” genetic variants, which are commonly returned in genetic screening assays. In addition, many of the genetic conditions included in screening panels do not have clinical confirmatory tests. All of these elements are typically required to justify the development of a screening test, according to the World Health Organization screening principles. Nevertheless, as the cost of DNA sequencing continues to fall, more individuals are opting to undergo genomic testing in the absence of a clinical indication. Despite the challenges, reasonable estimates can be deduced and used to inform test design strategies. Here, we review test design principles and apply them to genetic screening.

High-throughput DNA sequencing techniques enable time- and cost-effective sequencing of large portions of the genome. Instead of sequencing and annotating whole genomes, many phylogenetic studies focus sequencing efforts on large sets of pre-selected loci, which further reduces costs and bioinformatic challenges while increasing sequencing depth. One common approach that enriches loci before sequencing is often referred to as target sequence capture. This technique has been shown to be applicable to phylogenetic studies of greatly varying evolutionary depth and has proven to produce powerful, large multi-locus DNA sequence datasets of selected loci, suitable for phylogenetic analyses. However, target capture requires careful theoretical and practical considerations, which will greatly affect the success of the experiment. Here we provide an easy-to-follow flowchart for adequately designing phylogenomic target capture experiments, and we discuss necessary considerations and decisions from the first steps in the lab to the final bioinformatic processing of the sequence data. We particularly discuss issues and challenges related to the taxonomic scope, sample quality, and available genomic resources of target capture projects and how these issues affect all steps from bait design to the bioinformatic processing of the data. Altogether this review outlines a roadmap for future target capture experiments and is intended to assist researchers with making informed decisions for designing and carrying out successful phylogenetic target capture studies

As the diploid progenitor of common wheat, Aegilops tauschii Cosson (DD, 2n = 2x = 14) is regarded to be a potential genetic resource for improving common wheat, which is naturally distributed in central Eurasia, spreading from northern Syria and Turkey to western China. In this work, the chloroplast genomes of seventeen Ae. tauschii accessions showed 135 551~ 136 009 bp in length and contained the typical quadripartite structure of angiosperms. Meanwhile, a total of 127 functional genes, including 78 protein-coding genes, 4 rRNAs, 26 tRNAs, and 19 duplicated genes were identified. Overall genomic structure including gene number, gene order were well conserved with identical IR/SC boundary regions, but few variations predominantly were detected in non-coding regions (intergenic spacer regions). IR expansion and contraction with identical structure among 17 Aegilops tauschii accessions were not influence chloroplast genomes in length. Four cpDNA markers including rpl32-trnL-UAG, ccsA-ndhD, rbcL-psaI and rps18-rpl20 showed high nucleotide polymorphisms，which may be used to study on inter- and intra-specific genetic structure and diversity of Ae. tauschii. The ndhF gene in AY46 accession appeared the highest ω value, which might be involved in the adaptation to high altitude ecological environment during the evolution of AY46 accession. The phylogenetic relationships constructed by the complete genome sequences strongly support that Ae. tauschii in the Yellow River region might be directly originated from Central Asia rather than Xinjiang. The specific spreading route of Ae. tauschii revealed in this work, reflects the frequent cultural exchange through the silk road from one point of view. We confirmed that Ae. tauschii derived from monophyletic speciation rather than hybrid speciation at the chloroplast genome level.

“Higher” termites have been able to colonize all tropical and subtropical regions because of their ability to digest lignocellulose with the aid of their prokaryotic gut microbiota. Over the last decade, numerous studies based on 16S rRNA gene amplicon libraries have largely described both the taxonomy and structure of the prokaryotic communities associated with termite guts. Host diet and microenvironmental conditions have emerged as the main factors structuring the microbial assemblages in the different gut compartments. Additionally, these molecular inventories have revealed the existence of termite-specific clusters that indicate coevolutionary processes in numerous prokaryotic lineages. However, for lack of representative isolates, the functional role of most lineages remains unclear. We reconstructed 589 metagenome-assembled genomes (MAGs) from the different gut compartments of eight higher termite species that encompass 17 prokaryotic phyla. By iteratively building genome trees for each clade, we significantly improved the initial automated assignment, frequently up to the genus level. We recovered MAGs from most of the termite-specific clusters in the radiation of, e.g., Planctomycetes, Fibrobacteres, Bacteroidetes, Euryarchaeota, Bathyarchaeota, Spirochaetes, Saccharibacteria, and Firmicutes, which to date contained only few or no representative genomes. Moreover, the MAGs included abundant members of the termite gut microbiota. This dataset represents the largest genomic resource for arthropod-associated microorganisms available to date and contributes substantially to populating the tree of life. More importantly, it provides a backbone for studying the metabolic potential of the termite gut microbiota, including the key members involved in carbon and nitrogen biogeochemical cycles, and important clues that may help cultivating representatives of these understudied clades.

This study is the biodiversity and properties of bovine leukemia virus in Western Siberia. The researchers focused on exploring the polymorphism of the env gene and, in doing so, discovered the new genotypes Ia and Ib, which differ from genotype I. Restrictase Hae III sections the nucleotide sequence of the env gene intofragments with lengths of 316-27-95-5 bp (genotype I), 31-285-27-95-5 bp (genotype Ia), and 31-85-200-27-100 bp (genotype Ib). There are 2.57±0.55% (20 out of 779) samples of genotype Ib which does not differ significantly from 1% (χ2=2.46). Other genotypes were observed in the cattle of Siberia as wild type genotypes (their frequency varied from 17.84 to 32.73 %). This paper explores the effect of the env gene of the cattle leukemia virus on hematological parameters of infected animals. The maximum viral load was observed in animals with the II and IV viral genotypes (1000 – 1400 viral particles per 1000 healthy cells), and the minimum viral load was observed animals with genotype Ib (from 700 to 900 viral particles per 1000 healthy cells). Several hypotheses on the origin of the different genotypes in Siberia are discussed. The probability of the direct introduction of genotype II from South America to Siberia is extremely small and it is more likely that the strain originated independently in an autonomous population with its distribution also occurring independently. A new variety of genotype I (Ib) was found, which can be both a neoplasm and a relict strain.

Staphylococcus epidermidis is a human commensal and pathogen worldwide distributed. In this work, we surveyed for multi-resistant S. epidermidis strains in eight years at a children health-care unit in México City. Multidrug-resistant S. epidermidis were present in all years of the study. Resistance to methicillin, beta-lactams, fluoroquinolones, and macrolides were included. To understand the genetic basis of antibiotic resistance and its association with virulence and gene exchange, we sequenced the genomes of 17 S. epidermidis isolates. Whole-genome nucleotide identities between all the pairs of S. epidermidis strains were about 97% to 99%. We inferred a clonal structure and eight Multilocus Sequence Types (MLST´s) in the S. epidermidis sequenced collection. The profile of virulence includes genes involved in biofilm formation and phenol-soluble modulins (PSMs). However, half of the S. epidermidis analyzed lacked the icaoperon for biofilm formation. Likely, they are commensal S. epidermidis strains but multi-antibiotic resistant. Uneven distribution of insertion sequences, phages, and CRISPR-Cas immunity phage systems suggest frequent horizontal gene transfer. Rates of recombination between S. epidermidis strains were more prevalent than the mutation rate and affected the whole genome. Therefore, the multidrug-resistance, independently of the pathogenic traits, might explain the persistence of specific highly adapted S. epidermidis clonal lineages in nosocomial settings.

FASTA file format is a common file type for distributing proteome information, especially those obtained from Uniprot. While MATLAB could automatically read fasta files using the built-in function, fastaread, important information such as protein name and organism name remain enmeshed in a character array. Hence, difficulty exists in automatic extraction of protein names from fasta proteome file to help in building a database with fields comprising protein name and its amino acid sequence. The objective of this work was in developing a MATLAB software that could automatically extract protein name and amino acid sequence information from fasta proteome file and assign them to a new database that comprises fields such as protein name, amino acid sequence, number of amino acid residues, molecular weight of protein and nucleotide sequence of protein. Information on number of amino acid residues came from the use of the length built-in function in MATLAB analyzing the length of the amino acid sequence of a protein. The final two fields were provided by MATLAB built-in functions molweight and aa2nt, respectively. Molecular weight of proteins is useful for a variety of applications while nucleotide sequence is essential for gene synthesis applications in molecular cloning. Finally, the MATLAB software is also equipped with an error check function to help detect letters in the amino acid sequence that are not part of the family of 20 natural amino acids. Sequences with such letters would constitute as error inputs to molweight and aa2nt, and would not be processed. Collectively, given that important information such as protein name is enmeshed in a character array in fasta proteome file, this work sets out to develop a MATLAB software that could automatically extract protein name and amino acid sequence information, and assigns them to a new protein database. Using built-in functions, number of amino acid residues, molecular weight and nucleotide sequence of each protein were calculated; thereby, yielding a new protein database with improved functionalities that could support a variety of biology workflows ranging from sequence alignment to molecular cloning.

While the number of studies using Mendelian randomization (MR) methods has grown exponentially in the last decade, the quality of reporting of these studies often has been poor. Similar to other reporting guidelines such as CONSORT (Consolidated Standards of Reporting Trials) for randomised trials and STROBE (STrenghtening the Reporting of Observational studies in Epidemiology) for observational studies in epidemiology, the STROBE-MR working group aims to provide guidance to authors on how to improve reporting of MR studies and help readers, reviewers, and journal editors to evaluate the quality of the presented evidence. Empirical evidence indicates that many reports of MR studies do not clearly state or examine the various assumptions of MR methods and report insufficient details on the data sources, which makes it hard to evaluate the quality and reliability of the results. The STROBE-MR guidance covers both one sample and two sample MR studies. At present, the draft checklist consists of 20 items, organized into the title and abstract, introduction, methods, results and discussion sections of articles. As these guidelines aim to reach the entire MR community, we would like to give everyone the opportunity to contribute their comments. The following draft of the STROBE-MR checklist is open for public discussion and all feedback will be taken into account during its next revision. For feedback, please use the comment section below this post on PeerJ Preprints. We hope the final guidelines will serve the entire community and contribute to improving the reporting of MR studies in the future.

Segregation distortion (SD) is a phenomenon common among stable or segregating populations, and the principle behind it is still an issue that puzzles many researchers. An F2:3 generations developed from the wild cotton species of the D genomes was applied to investigate the possible plant transcription factors within the segregation distortion regions (SDRs). We constructed a consensus map between two maps in the D genome, map A; Gossypium klotzschianum and Gossypium davidsonii and Map B; Gossypium thurberi and Gossypium trilobum. The two maps were developed from 188 F2:3 populations for each map, a total of 1492 markers, were linked to the 13 linkage groups. The consensus linkage map size was 1467.445 cM with an average marker distance of 1.0370cM. Chr02 had the highest percentage of segregation distortion with 58.621% followed by Chr07 with 47.887%. A total of 6,038 genes were mined within the segregation distortion regions (SDR region) of Chr02 and Chr07 with 2,308 gene in Chr02 and 3,730 genes in Chr07, we obtained a total of 1,117 domains within the SDR with a total of 622 domains shared between the two chromosomes, the first 9 domains all belonged to the plant resistance genes (R genes), the largest domain was PF00069 with a total of 188 genes. A total of 287 miRNAs were found to target the various genes, such as gr-miR398, gra-miR5207, miR164a, miR164b, miR164c among others which have been found to target top-ranked stress-responsive transcription factors such as NAC genes. Moreover, the genes were found to be regulated by various stress responsive cis-regulatory elements. RNA profiling showed that higher numbers of genes were highly upregulated in abiotic and different fiber development stages. The result shows that the SDR regions could be playing an important role in the evolution of significant genes in plants.