PeerJ Preprints: Data Science

PeerJ Preprints: Data Science https://peerj.com/preprints/index.atom?journal=peerj&subject=9600 Data Science articles published in PeerJ Preprints ViSiElse: An innovative R-package to visualize raw behavioral data over time https://peerj.com/preprints/27665 2019-11-25 2019-11-25 Elodie M Garnier Nastasia Fouret Médéric Descoins

The scientific community encourages the use of raw data graphs to improve the reliability and transparency of the results presented in articles. However, the current methods used to visualize raw data are limited to one or two numerical variables per graph and/or small sample sizes. In the behavioral sciences, numerous variables must be plotted together in order to gain insight into the behavior in question. In this paper, we present ViSiElse, an R-package offering a new approach in the visualization of raw data. ViSiElse was developed with the open-source software R to visualize behavioral observations over time based on raw time data extracted from visually recorded sessions of experimental observations. ViSiElse gives a global overview of a process by creating a visualization of the timestamps for multiple actions and all participants into a single graph; individual or group behavior can then be easily assessed. Additional features allow users to further inspect their data by including summary statistics and time constraints.

Data from magnetoencephalography (MEG) and electroencephalography (EEG) is extremely rich and multifaceted. For example, in a standard MEG recording with 306 sensors and a sampling rate of 1,000 Hz, 306,000 data points are sampled every second. To be able to answer the question, which was the ultimate reason for acquiring the data, thus necessitates efficient data handling. Luckily, several software packages have been developed for handling MEG and/or EEG data. To name some of the most popular: MNE-Python; FieldTrip; Brainstorm; EEGLAB and SPM. These are all available under a public domain licence, meaning that they can be run, shared and modified by anyone. Commercial software released under proprietary licences include BESA and CURRY. It is important to be aware of that for clinical diagnosis of for example epilepsy, certified software is required FieldTrip, MNE-Python, Brainstorm, EEGLAB and SPM for example cannot be used for that. In this chapter, the emphasis will be on MNE-Python and FieldTrip. This will allow users of both Python and MATLAB (or alternatively GNU Octave to code along as the chapter unfolds. As a general remark, all that MNE-Python can do, FieldTrip can do and vice versa – though with some small difference. A full analysis going from raw data to a source reconstruction will be presented, illustrated with both code and figures with the aim of providing newcomers to the field a stepping stone towards doing their own analyses of their own datasets.

Sequencing data resources have increased exponentially in recent years, as has interest in large-scale meta-analyses of integrated next-generation sequencing datasets. However, curation of integrated datasets that match a user’s particular research priorities is currently a time-intensive and imprecise task. MetaSeek is a sequencing data discovery tool that enables users to flexibly search and filter on any metadata field to quickly find the sequencing datasets that meet their needs. MetaSeek automatically scrapes metadata from all publicly available datasets in the Sequence Read Archive, cleans and parses messy, user-provided metadata into a structured, standard-compliant database, and predicts missing fields where possible. MetaSeek provides a web-based graphical user interface and interactive visualization dashboard, as well as a programmatic API to rapidly search, filter, visualize, save, share, and download matching sequencing metadata. The MetaSeek online interface is available at https://www.metaseek.cloud/. The MetaSeek database can also be accessed via API to programmatically search, filter, and download all metadata. MetaSeek source code, metadata scrapers, and documents are available at https://github.com/MetaSeek-Sequencing-Data-Discovery/metaseek/. Additional guides, tutorials, and documents are available at https://github.com/MetaSeek-Sequencing-Data-Discovery/metaseek, and on the MetaSeek website, https://www.metaseek.cloud/. MetaSeek is distributed under an MIT license.

Objective: We recently developed CoCites, a citation-based search method that is designed to be more efficient than traditional keyword-based methods. The method begins with identification of one or more highly relevant publications (query articles) and consists of two searches: the co-citation search, which ranks publications on their co-citation frequency with the query articles, and the citation search, which ranks publications on frequency of all citations that cite or are cited by the query articles. Materials and Methods: We aimed to reproduce the literature searches of published systematic reviews and meta-analyses (n=250) and assess whether CoCites retrieves all eligible articles while screening fewer titles. Results: CoCites retrieved a median of 75% of the articles that were included in the original reviews. The percentage of retrieved articles was higher (88%) when the query articles were cited more frequently and when they had more overlap in their citations. Applying CoCites to only the highest-cited article yielded similar results. The co-citation and citation searches combined were more efficient when the review authors had screened more than 500 titles, but not when they had screened less. Discussion: CoCites uses the expert knowledge of authors to rank related articles. The method does not depend on keyword selection and requires no special expertise to build search queries. The method is transparent and reproducible. Conclusion: CoCites is an efficient and accurate method for finding relevant related articles.

Cities around the world have converged on structural and environmental characteristics that exert similar eco-evolutionary pressures on local communities. However, evaluating how urban biodiversity responds to urban intensification remains poorly understood because of the challenges in capturing the diversity of a range of taxa within and across multiple cities from different types of urbanization. Here we utilize a growing resource—citizen science data. We analyzed 66,209 observations representing 5,209 species generated by the City Nature Challenge project on the iNaturalist platform, in conjunction with remote sensing (NLCD2011) environmental data, to test for urban biotic homogenization at increasing levels of urban intensity across 14 metropolitan cities in the United States. Based on community composition analyses, we found that while similarities occur to an extent, urban biodiversity is often much more a reflection of the taxa living locally in a region. At the same time, the communities found in high intensity development were less explained by regional context than communities from other land cover types were. We also found that the most commonly observed species are often shared between cities and are non-endemic and/or have a distribution facilitated by humans. This study highlights the value of citizen science data in answering questions in urban ecology.

Background. Physical activity (PA) is increasingly being recognized as a major factor related to the development or prevention of many diseases, as an intervention to cure or delay disease and for patient assessment in diagnostics, as a clinical outcome measure or clinical trial endpoint. Thus, wearable sensors and signal algorithms to monitor PA in the free-living environment (real-world) are becoming popular in medicine and clinical research. This is especially true for walking speed, a parameter of PA behaviour with increasing evidence to serve as a patient outcome and clinical trial endpoint in many diseases. The development and validation of sensor signal algorithms for PA classification, in particular walking, and deriving specific PA parameters, such as real world walking speed depends on the availability of large reference data sets with ground truth values. In this study a novel, reliable, scalable (high throughput), user-friendly device and method to generate such ground truth data for real world walking speed, other physical activity types and further gait-related parameters in a real-world environment is described and validated. Methods. A surveyor’s wheel was instrumented with a rotating 3D accelerometer (actibelt). A signal processing algorithm is described to derive distance and speed values. In addition, a high-resolution camera was attached via an active gimbal to video record context and detail. Validation was performed in the following main parts: 1) walking distance measurement is compared to the wheel’s built-in mechanical counter, 2) walking speed measurement is analysed on a treadmill at various speed settings, 3) speed measurement accuracy is analysed by an independent certified calibration laboratory - accreditation by DAkkS applying standardised test procedures. Results: The mean relative error for distance measurements between our method and the built-in counter was 0.12%. Comparison of the speed values algorithmically extracted from accelerometry data and true treadmill speed revealed a mean adjusted absolute error of 0.01 m/s (relative error: 0.71 %). The calibration laboratory found a mean relative error between values algorithmically extracted from accelerometry data and laboratory gold standard of 0.36% (0.17-0.64 min/max), which is below the resolution of the laboratory. An official certificate was issued. Discussion. Error values were a magnitude smaller than the any clinically important difference for walking speed. Conclusion. Besides the high accuracy, the presented method can be deployed in a real world setting and allows to be integrated into the digital data flow.

Pacific island coral reef ecosystems are particularly threatened by anthropogenic stresses we can manage in the context of global threats we cannot control. State agencies are challenged to sample coastal waters at the spatial and temporal resolution needed to make decisions about improving watershed management. The acquisition of environmental data by committed non-profit organizations and trained community members represents a major opportunity to support agency monitoring programs and to complement field campaigns in the study of watershed dynamics. When data collection protocols match state agency protocols and these are supported by sufficient documentation there is an opportunity to create regulatory-quality data that can inform management. We describe the formation of the first volunteer group in Hawaii to establish a quality assured water quality sampling program to match the Hawaii Department of Health’s protocols. Hui O Ka Wai Ola, a partnership between three non-profit organizations on Maui, Hawaii, has trained 40 volunteers to use methods that directly match the state program. The group has taken over 900 discrete samples at 48 sites, providing the most comprehensive picture of water quality in Maui to date, motivating community activism and catalyzing large-scale restoration efforts in the adjoining watersheds. Results highlight coastal areas that have poor water quality, delineate a baseline from which to compare future restoration projects, and emphasize parts of the sampling protocol that might be improved for more reliable data.

The protein-protein interaction network of seven pleiotropic proteins (PIN7) contains proteins with multiple functions in the aging and age-related diseases (TPPII, CDK2, MYBBP1A, p53, SIRT6, SIRT7, and BSG). At the present work, the pathway enrichment, the gene function prediction and the protein node prioritization analysis were applied for the examination of main molecular mechanisms driving PIN7 and the extended network. Seven proteins of PIN7 were used as an input for the analysis by GeneMania, a Cytoscape application, which constructs the protein interaction network. The software also extends it using the interactions retrieved from databases of experimental and predicted protein-protein and genetic interactions. The analysis identified the p53 signaling pathway as the most dominant mediator of PIN7. The extended PIN7 was also analyzed by Cytohubba application, which showed that the top-ranked protein nodes belong to the group of histone acetyltransferases and histone deacetylases. These enzymes are involved in the reverse epigenetic regulation mechanisms linked to the regulation of PTK2, NFκB, and p53 signaling interaction subnetworks of the extended PIN7. The analysis emphasized the role of PTK2 signaling, which functions upstream of the p53 signaling pathway and its interaction network includes all members of the sirtuin family. Further, the analysis suggested the involvement of molecular mechanisms related to metastatic cancer (prostate cancer, small cell lung cancer), hemostasis, the regulation of the thyroid hormones and the cell cycle G1/S checkpoint. The additional data-mining analysis showed that the small protein interaction network MYBBP1A-p53-TPPII-SIRT6-CD147 controls Warburg effect and MYBBP1A-p53-TPPII-SIRT7-BSG influences mTOR signaling and autophagy. Further investigations of the detail mechanisms of these interaction networks would be beneficial for the development of novel treatments for aging and age-related diseases. QIIME 2: Reproducible, interactive, scalable, and extensible microbiome data science https://peerj.com/preprints/27295 2018-12-03 2018-12-03 Evan Bolyen Jai Ram Rideout Matthew R Dillon Nicholas A Bokulich Christian Abnet Gabriel A Al-Ghalith Harriet Alexander Eric J Alm Manimozhiyan Arumugam Francesco Asnicar Yang Bai Jordan E Bisanz Kyle Bittinger Asker Brejnrod Colin J Brislawn C Titus Brown Benjamin J Callahan Andrés Mauricio Caraballo-Rodríguez John Chase Emily Cope Ricardo Da Silva Pieter C Dorrestein Gavin M Douglas Daniel M Durall Claire Duvallet Christian F Edwardson Madeleine Ernst Mehrbod Estaki Jennifer Fouquier Julia M Gauglitz Deanna L Gibson Antonio Gonzalez Kestrel Gorlick Jiarong Guo Benjamin Hillmann Susan Holmes Hannes Holste Curtis Huttenhower Gavin Huttley Stefan Janssen Alan K Jarmusch Lingjing Jiang Benjamin Kaehler Kyo Bin Kang Christopher R Keefe Paul Keim Scott T Kelley Dan Knights Irina Koester Tomasz Kosciolek Jorden Kreps Morgan GI Langille Joslynn Lee Ruth Ley Yong-Xin Liu Erikka Loftfield Catherine Lozupone Massoud Maher Clarisse Marotz Bryan D Martin Daniel McDonald Lauren J McIver Alexey V Melnik Jessica L Metcalf Sydney C Morgan Jamie Morton Ahmad Turan Naimey Jose A Navas-Molina Louis Felix Nothias Stephanie B Orchanian Talima Pearson Samuel L Peoples Daniel Petras Mary Lai Preuss Elmar Pruesse Lasse Buur Rasmussen Adam Rivers Michael S Robeson, II Patrick Rosenthal Nicola Segata Michael Shaffer Arron Shiffer Rashmi Sinha Se Jin Song John R Spear Austin D Swafford Luke R Thompson Pedro J Torres Pauline Trinh Anupriya Tripathi Peter J Turnbaugh Sabah Ul-Hasan Justin JJ van der Hooft Fernando Vargas Yoshiki Vázquez-Baeza Emily Vogtmann Max von Hippel William Walters Yunhu Wan Mingxun Wang Jonathan Warren Kyle C Weber Chase HD Williamson Amy D Willis Zhenjiang Zech Xu Jesse R Zaneveld Yilong Zhang Qiyun Zhu Rob Knight J Gregory Caporaso

We present QIIME 2, an open-source microbiome data science platform accessible to users spanning the microbiome research ecosystem, from scientists and engineers to clinicians and policy makers. QIIME 2 provides new features that will drive the next generation of microbiome research. These include interactive spatial and temporal analysis and visualization tools, support for metabolomics and shotgun metagenomics analysis, and automated data provenance tracking to ensure reproducible, transparent microbiome data science.

Clinical bioinformatics, translational bioinformatics and personalised medicine are connected by the common task of analysing and integrating clinical data and results, in order to find important biomarkers related to pathologies and facilitate their prediction, diagnosis and treatment. New technologies provides the possibility to have more and more clinical data available in online databases. This data can be reused for studying complex disease from novel point of views. This work show how it is possible considering online microarray data from coeliac disease and some of its comorbidities, combining both the data and the results. The main goal is the extraction of common evidences among the selected pathologies, from genes to different kinds of functional annotation, showing which biological processes are more involved in these autoimmune disorders and quantifying the similarity between coeliac disease and its comorbidities. The pipeline of the work is developed in R language, and it is semi-automated. Methodologically, the advantage of this work is the possibility of performing the entire analysis starting from a different pathology; clinically, scientists can have the possibility of using data already published to highlight old and new evidences, with the possibility of improve the knowledge on a complex disease according to the availability of new microarray data.