PeerJ Preprints: Data Mining and Machine Learning

PeerJ Preprints: Data Mining and Machine Learning https://peerj.com/preprints/index.atom?journal=peerj&subject=9500 Data Mining and Machine Learning articles published in PeerJ Preprints Data-driven classification of the certainty of scholarly assertions https://peerj.com/preprints/27829 2019-12-20 2019-12-20 Mario Prieto Helena Deus Anita De Waard Erik Schultes Beatriz García-Jiménez Mark D Wilkinson

The grammatical structures scholars use to express their assertions are intended to convey various degrees of certainty or speculation. Prior studies have suggested a variety of categorization systems for scholarly certainty; however, these have not been objectively tested for their validity, particularly with respect to representing the interpretation by the reader, rather than the intention of the author. In this study, we use a series of questionnaires to determine how researchers classify various scholarly assertions, using three distinct certainty classification systems. We find that there are three distinct categories of certainty along a spectrum from high to low. We show that these categories can be detected in an automated manner, using a machine learning model, with a cross-validation accuracy of 89.2% relative to an author-annotated corpus, and 82.2% accuracy against a publicly-annotated corpus. This finding provides an opportunity for contextual metadata related to certainty to be captured as a part of text-mining pipelines, which currently miss these subtle linguistic cues. We provide an exemplar machine-accessible representation - a Nanopublication - where certainty category is embedded as metadata in a formal, ontology-based manner within text-mined scholarly assertions.

Killer whales (Orcinus orca) can produce 3 types of signals: clicks, whistles and vocalizations. This study focuses on Orca vocalizations from northern Vancouver Island (Hanson Island) where the NGO Orcalab developed a multi-hydrophone recording station to study Orcas. The acoustic station is composed of 5 hydrophones and extends over 50 km 2 of ocean. Since 2015 we are continuously streaming the hydrophone signals to our laboratory in Toulon, France, yielding nearly 50 TB of synchronous multichannel recordings. In previous work, we trained a Convolutional Neural Network (CNN) to detect Orca vocalizations, using transfer learning from a bird activity dataset. Here, for each detected vocalization, we estimate the pitch contour (fundamental frequency). Finally, we cluster vocalizations by features describing the pitch contour. While preliminary, our results demonstrate a possible route towards automatic Orca call type classification. Furthermore, they can be linked to the presence of particular Orca pods in the area according to the classification of their call types. A large-scale call type classification would allow new insights on phonotactics and ethoacoustics of endangered Orca populations in the face of increasing anthropic pressure.

Natural history collections (NHCs) are the foundation of historical baselines for assessing anthropogenic impacts on biodiversity. Along these lines, the online mobilization of specimens via digitization–the conversion of specimen data into accessible digital content–has greatly expanded the use of NHC collections across a diversity of disciplines. We broaden the current vision of digitization (Digitization 1.0)–whereby specimens are digitized within NHCs–to include new approaches that rely on digitized products rather than the physical specimen (Digitization 2.0). Digitization 2.0 builds upon the data, workflows, and infrastructure produced by Digitization 1.0 to create digital-only workflows that facilitate digitization, curation, and data linkages, thus returning value to physical specimens by creating new layers of annotation, empowering a global community, and developing automated approaches to advance biodiversity discovery and conservation. These efforts will transform large-scale biodiversity assessments to address fundamental questions including those pertaining to critical modern issues of global change.

As interest in genetic resequencing increases, so does the need for effective mathematical, computational, and statistical approaches. One of the difficult problems in genome annotation is determination of precise positions of transcription start sites. In this paper we present TransPrise - an efficient deep learning tool for prediction of positions of eukaryotic transcription start sites. TransPrise offers significant improvement over existing promoter-prediction methods. To illustrate this, we compared predictions of TransPrise with the TSSPlant approach for well annotated genome of Oryza sativa. Using a computer equipped with a graphics processing unit, the run time of TransPrise is 250 minutes on a genome of 374 Mb long. We provide the full basis for the comparison and encourage users to freely access a set of our computational tools to facilitate and streamline their own analyses. The ready-to-use Docker image with all necessary packages, models, code as well as the source code of the TransPrise algorithm are available at ( http://compubioverne.group /). The source code is ready to use and customizable to predict TSS in any eukaryotic organism.

There is growing interest within regulatory agencies and toxicological research communities to develop, test, and apply new approaches, such as toxicogenomics, to more efficiently evaluate chemical hazards. Given the complexity of analyzing thousands of genes simultaneously, there is a need to identify reduced gene sets.Though several gene sets have been defined for toxicological applications, few of these were purposefully derived using toxicogenomics data. Here, we developed and applied a systematic approach to identify 1000 genes (called Toxicogenomics-1000 or T1000) highly responsive to chemical exposures. First, a co-expression network of 11,210genes was built by leveraging microarray data from the Open TG-GATEs program. This network was then re-weighted based on prior knowledge of their biological (KEGG, MSigDB) and toxicological (CTD) relevance. Finally, weighted correlation network analysis was applied to identify 258 gene clusters. T1000 was defined by selecting genes from each cluster that were most associated with outcome measures. For model evaluation, we compared the performance of T1000 to that of other gene sets (L1000, S1500, Genes selected by Limma, and random set) using two external datasets. Additionally, a smaller (T384) and a larger version (T1500) of T1000 were used for dose-response modeling to test the effect of gene set size. Our findings demonstrated that the T1000 gene set is predictive of apical outcomes across a range of conditions (e.g.,in vitroand in vivo, dose-response, multiple species, tissues, and chemicals), and generally performs as well, or better than other gene sets available.

Background. MiRNA regulates cellular processes through acting on specific target genes. Hundreds of miRNAs and their target genes have been identified, as are many miRNA-disease associations. Cellular processes, including those related to disease, proceed through multiple interactions, are often organized into pathways among genes and gene products. Large databases on protein-protein interactions (PPIs) are available. Here, we have integrated the information mentioned above to build a web service platform, miRNA Disease Regulatory Network, or miRDRN, for users to construct disease and tissue-specific miRNA-protein regulatory networks. Methods. Data on human protein interaction, disease-associated miRNA, tumor-associated gene, miRNA targeted gene, molecular interaction and reaction network or pathway, gene ontology, gene annotation and gene product information, and gene expression were collected from publicly available databases and integrated. A complete set of regulatory sub-pathways (RSPs) having the form (M, T, G1, G2) were built from the integrated data and stored in the database part of miRDRN, where M is a disease-associated miRNA, T is its regulatory target gene, G1 (G2) is a gene/protein interacting with T (G1). Each sequence (T, G1, G2) was assigned a p-value weighted by the participation of the three genes in molecular interactions and reaction pathways. Results. A web service platform, miRDRN ( http://mirdrn.ncu.edu.tw/mirdrn/), was built to allow users to retrieve a disease and tissue-specific subset of RSPs, from which a miRNA regulatory network is constructed. miRDRN is a database that currently contains 6,973,875 p-valued sub-pathways associated with 119 diseases in 78 tissue types built from 207 diseases-associated miRNA regulating 389 genes, and a web tool that facilitates the construction and visualization of disease and tissue-specific miRNA-protein regulatory networks, for exploring single diseases, or for exploring the comorbidity of disease-pairs. As demonstrations, miRDRN was applied: to explore the single disease colorectal cancer (CRC), in which 26 novel potential CRC target genes were identified; to study the comorbidity of the disease-pair Alzheimer's disease-Type 2 diabetes (AD-T2D), in which 18 novel potential comorbid genes were identified; and, to explore possible causes that may shed light on recent failures of late-phase trials of anti-AD, BACE1 inhibitor drugs, in which genes downstream to BACE1 whose suppression may affect signal transduction were identified.

The sirtuin family contains seven proteins with the functions in multiple diseases of aging, which makes them an attractive subject for the development of therapies of age-related diseases and anti-aging treatments. The primary objective of the protein-interaction network analysis presented here is to identify the signaling pathways and protein nodes driving the functions of the sirtuins. For this purpose, the protein-protein interaction data were collected from the available public databases, which fulfilled the quality threshold and included at least one member of the sirtuin family. The databases provided 66 interactions validated by several experiments, which were further processed by the bioinformatic tools connected to the integrated genomic, proteomic, and pharmacologic data. The interactions were analyzed by the pathway enrichment, the gene function prediction analysis, and the protein node prioritization by use of Cytoscape applications GeneMania and Cytohubba. The constructed sirtuin protein interaction network (SPIN) contained after the extension 98 protein nodes. TGFβ, PTK2, CARM1, Notch signaling and the pathways regulating androgen and estrogen levels, significantly scored in the pathway enrichment analysis of SPIN. The enriched signaling pathways mediating the pleiotropic effects of the sirtuin family, play the roles in several age-related diseases probably. The Cytohubba application has highlighted the function of HDAC1, EP300, SMAD4, MYC, SIN3A, RBBP4, HDAC, SIN3B, RBBP7 and SMAD3 as the high priority protein nodes driving the molecular functions of SPIN. The presented protein interaction study provide new understandings of the sirtuin functions in the longevity and diseases of aging including cancer, neurodegenerative and metabolic disorders.

Background. Physical activity (PA) is increasingly being recognized as a major factor related to the development or prevention of many diseases, as an intervention to cure or delay disease and for patient assessment in diagnostics, as a clinical outcome measure or clinical trial endpoint. Thus, wearable sensors and signal algorithms to monitor PA in the free-living environment (real-world) are becoming popular in medicine and clinical research. This is especially true for walking speed, a parameter of PA behaviour with increasing evidence to serve as a patient outcome and clinical trial endpoint in many diseases. The development and validation of sensor signal algorithms for PA classification, in particular walking, and deriving specific PA parameters, such as real world walking speed depends on the availability of large reference data sets with ground truth values. In this study a novel, reliable, scalable (high throughput), user-friendly device and method to generate such ground truth data for real world walking speed, other physical activity types and further gait-related parameters in a real-world environment is described and validated. Methods. A surveyor’s wheel was instrumented with a rotating 3D accelerometer (actibelt). A signal processing algorithm is described to derive distance and speed values. In addition, a high-resolution camera was attached via an active gimbal to video record context and detail. Validation was performed in the following main parts: 1) walking distance measurement is compared to the wheel’s built-in mechanical counter, 2) walking speed measurement is analysed on a treadmill at various speed settings, 3) speed measurement accuracy is analysed by an independent certified calibration laboratory - accreditation by DAkkS applying standardised test procedures. Results: The mean relative error for distance measurements between our method and the built-in counter was 0.12%. Comparison of the speed values algorithmically extracted from accelerometry data and true treadmill speed revealed a mean adjusted absolute error of 0.01 m/s (relative error: 0.71 %). The calibration laboratory found a mean relative error between values algorithmically extracted from accelerometry data and laboratory gold standard of 0.36% (0.17-0.64 min/max), which is below the resolution of the laboratory. An official certificate was issued. Discussion. Error values were a magnitude smaller than the any clinically important difference for walking speed. Conclusion. Besides the high accuracy, the presented method can be deployed in a real world setting and allows to be integrated into the digital data flow.

The protein-protein interaction network of seven pleiotropic proteins (PIN7) contains proteins with multiple functions in the aging and age-related diseases (TPPII, CDK2, MYBBP1A, p53, SIRT6, SIRT7, and BSG). At the present work, the pathway enrichment, the gene function prediction and the protein node prioritization analysis were applied for the examination of main molecular mechanisms driving PIN7 and the extended network. Seven proteins of PIN7 were used as an input for the analysis by GeneMania, a Cytoscape application, which constructs the protein interaction network. The software also extends it using the interactions retrieved from databases of experimental and predicted protein-protein and genetic interactions. The analysis identified the p53 signaling pathway as the most dominant mediator of PIN7. The extended PIN7 was also analyzed by Cytohubba application, which showed that the top-ranked protein nodes belong to the group of histone acetyltransferases and histone deacetylases. These enzymes are involved in the reverse epigenetic regulation mechanisms linked to the regulation of PTK2, NFκB, and p53 signaling interaction subnetworks of the extended PIN7. The analysis emphasized the role of PTK2 signaling, which functions upstream of the p53 signaling pathway and its interaction network includes all members of the sirtuin family. Further, the analysis suggested the involvement of molecular mechanisms related to metastatic cancer (prostate cancer, small cell lung cancer), hemostasis, the regulation of the thyroid hormones and the cell cycle G1/S checkpoint. The additional data-mining analysis showed that the small protein interaction network MYBBP1A-p53-TPPII-SIRT6-CD147 controls Warburg effect and MYBBP1A-p53-TPPII-SIRT7-BSG influences mTOR signaling and autophagy. Further investigations of the detail mechanisms of these interaction networks would be beneficial for the development of novel treatments for aging and age-related diseases.

The technology of docking molecules in-silico has evolved significantly in recent years and has become a crucial component of the drug discovery tool process that includes virtual screening, lead optimization, and side-effect predictions. To date over 43,000 abstracts/papers have been published on docking, thereby highlighting the importance of this computational approach in the context of drug development. Considering the large amount of genomic and proteomic consortia active in the public domain, docking can exploit this data on a correspondingly ‘large scale’ to address a variety of research questions. Over 160 robust and accurate molecular docking tools based on different algorithms have been made available to users across the world. Further, 109 scoring functions have been reported in the literature till date. Despite these advancements, there continue to be several bottlenecks during the implementation stage. These problems or issues range from choosing the right docking algorithm, selecting a binding site in target proteins, performance of the given docking tool, integration of molecular dynamics information, ligand-induced conformational changes, use of solvent molecules, choice of docking pose, and choice of databases. Further, so far, not always have experimental studies been used to validate the docking results. In this review, basic features and key concepts of docking have been highlighted, with particular emphasis on its applications such as drug repositioning and prediction of side effects. Also, the use of docking in conjunction with wet lab experimentations and epitope predictions has been summarized. Attempts have been made to systematically address the above-mentioned challenges using expert-curation and text mining strategies. Our work shows the use of machine-assisted literature mining to process and analyze huge amounts of available information in a short time frame. With this work, we also propose to build a platform that combines human expertise (deep curation) and machine learning in a collaborative way and thus helps to solve ambitious problems (i.e. building fast, efficient docking systems by combining the best tools or to perform large scale docking at human proteome level).