PeerJ Preprints: Statistics

ViSiElse: An innovative R-package to visualize raw behavioral data over time

2019-11-25

The scientific community encourages the use of raw data graphs to improve the reliability and transparency of the results presented in articles. However, the current methods used to visualize raw data are limited to one or two numerical variables per graph and/or small sample sizes. In the behavioral sciences, numerous variables must be plotted together in order to gain insight into the behavior in question. In this paper, we present ViSiElse, an R-package offering a new approach in the visualization of raw data. ViSiElse was developed with the open-source software R to visualize behavioral observations over time based on raw time data extracted from visually recorded sessions of experimental observations. ViSiElse gives a global overview of a process by creating a visualization of the timestamps for multiple actions and all participants into a single graph; individual or group behavior can then be easily assessed. Additional features allow users to further inspect their data by including summary statistics and time constraints.

Integrated modeling of phylogenies, species traits, and environmental gradients to better predict biogeographic distributions

2019-09-30

There is an acknowledged need to combine species distribution and macro-ecological models with phylogenetic information, particularly when biogeographic research incorporates multiple species, explores phenotypic traits, or is spatially dynamic. Our aim is to present a new approach to multi-species joint modeling that applies spatially explicit phylogenetic regression to simultaneously predict species occurrence probability and the geographic distribution of interspecific continuous morphological traits. We developed a multi-tiered Bayesian geostatistical model that incorporates a species phylogeny, morphometric traits, and environmental variables to jointly estimate traits and geographic distributions for six species of South American leaf-eared mice (genus: Phyllotis). Covariates are included with the model to control for genetic relatedness, specimen age, specimen sex, and repeated measures errors. To help gauge model performance, we compared our approach to predictions made using several other species distribution modeling applications. Our integrated modeling framework demonstrated improved accuracy over alternative species distribution modeling techniques as judged by model sensitivity, specificity, and the true skill statistic. The inclusion of trait-based covariates and model terms to account for genetic relatedness, repeated measures, and spatial error were determined important as judged by credible intervals and parsimony metrics. Species distribution models and trait-based approaches that do not account for spatial dependencies, phylogenetic relationships, or repeated measures sampling errors may produce parameter estimates with smaller uncertainty than is warranted and produce predictions with significant error. Our study offers tools to address spatially and phylogenetically structured species data and presents an approach to integrating biological comparative methods in biogeographic research.

Bayesian meta-analysis of studies with rare events: Do the choice of prior distributions and the exclusion of studies without events in both arms matter?

2019-05-15

Randomized controlled trials (RCTs) analyzing serious adverse events often observe low incidence and might even observe zero events in either or both of the treatment and control arms. In the meta-analysis of RCTs of adverse events, it is unclear whether trials with zero events in both arms provide any information for the summary risk ratio (RR) or odds ratio (OR). Studies with zero events in both arms are usually excluded in both frequentist and Bayesian meta-analysis. We used a fully probabilistic approach—a Bayesian framework—for the meta-analysis of studies with rare events, and systematically assessed whether exclusion of studies with no events in both arms produced different results compared to keeping all studies in the meta-analysis. We did this by conducting a simulation study in which we assessed the bias in the point estimate of the log(OR) and the coverage of the 95% posterior interval for the log(OR) for different analytical decisions and choices in fixed effect and random effects meta-analysis. We used simulated data generated from a known fixed effect or random effects data scenario (each scenario with a 1000 meta-analysis data-set). We found that the uniform and Jeffrey’s prior on the baseline risk in the control group leads to biased results and a reduced coverage, and that setting the prior distribution on the log(odds) scale worked better. We also found nearly identical results regardless of whether studies with no events in both arms were excluded or not.

Environment regime influence on Chlorophyll-a abundance and distribution in North Indian Ocean

2019-05-06

North Indian Ocean region around India and Sri Lanka is a complex and rich coastal ecosystem undergoing various seasonal and inter-annual changes and various pressures. Hence the objective of this study was to assess the scales of coupling between chlorophyll-a concentration (chl-a) and the influencing variables and explore the nature of the spatiotemporal variability of them. The seasonal and annual variations of chl-a along the Bay of Bengal (BoB), Arabian sea (AS) and ocean region around Sri Lanka in relation to the physical and chemical oceanographic variables were analyzed using satellite observations covering the period of 2002-2018. The effects of diffuse attenuation coefficient, photosynthetically available radiation (PAR), sea surface temperature (SST), Wind speed, Eastward wind component, Nitrate, Black carbon column mass density, Sea Salt Surface Mass Concentration, Open water net downward longwave flux, Surface emissivity were considered on a monthly time scale. Wavelet analysis and the Boosted Regression Trees (BRT) were used as the main analysis and modeling methods. The peaks of chl-a, diffuse attenuation coefficient, and nitrate were observed in September. In wind speed and eastward wind it was July and in black carbon column mass density, and PAR in March. In Sea Salt Surface Mass Concentration, Open water net downward longwave flux, Surface emissivity, Diffuse attenuation coefficient for downwelling irradiance, and SST mean maximums were found in June, February, November, September, April respectively. In BRT model the estimated cross validation (cv) deviance, standard error (se), training data correlation, cv correlation, and D2 were 0.003, 0.002, 0.932, 0.949, and 0.846 respectively. According to the results, diffuse attenuation coefficient (90%), eastward wind component (3.7%) and nitrate (3%) were the most positively correlated variables with Chl-a occurrence. SST evidenced an inverse relationship with Chl-a. According to the model built <42 Einsteinm-2day-1 PAR, <0.986 surface emissivity, <70 Wm-2 open water net downward long wave flux, 28.2 -28.5 0C SST , 2 ms-1 Wind speed, 5 ms-1 - 6 ms-1 eastward wind, 4.8 x10-8 -7x10-8 kgm-3 sea salt surface mass concentration, and 0.1-0.5micromoleL-1 nitrate are favourable for the optimum level of phytoplankton occurrence. Since BRT deals robustly with non-linear relationships of the environmental variables it can be used in further studies of ecological modeling.

Detection of pelagic habitats and abundance of skipjack tuna in relation to the environment in the Indian Ocean around Sri Lanka

2019-05-06

Using remote sensing data of sea surface temperature (SST), chlorophyll-a (Chl-a) together with catch data, the pelagic hotspots of Skipjack tuna (SKPJ) were identified. MODIS/Aqua satellite data and the fish catch data were obtained during 2002-2016 period. Empirical cumulative distribution frequency (ECDF) model of satellite-based oceanographic data in relation to skipjack fishing was used for the initial statistical analysis and the results showed that key pelagic habitat corresponded mainly with the 0.4 – 0.7 mg m-3 Chl-a concentration. Chl-a represents the phytoplankton that attracts the food items of SKPJ like zooplankton and nekton The favorable SST range for SKPJ is 26 - 27 0C which provides suitable thermocline and an optimum level of upwelling to circulate nutrients needed for the primary production. The high total catches and CPUEs were found within the months of September to December and the optimum levels of Chl-a, SST also were observed in similar months. Hence, the South-West monsoon season was identified as the best and peak season of SKPJ fisheries. SST and Chl-a are important indicators to detect the habitats of SKPJ and the maps prepared can be used in the future to cost-effectively and efficiently identify and demarcate the biological conservation regions or fisheries zones of SKPJ. According to GAM the 0.3 - 0.6 mg m-3 Chl-a, 28 - 28.5 0C SST in Western and 0.25 - 0.3 mg m-3 Chl-a and 28.5 - 28.80C SST in Eastern were found as highly correlated predictor variables value ranges with SKPJ abundance. The deviances explained in above areas in GAM were 90.8% and 61.4% respectively. The GAM was considered as a robustly dealing method with nonlinear relationships and it can be used to model the fish catch abundance with influencing variables significantly since it could predict the CPUE values greater than 90% similarly to nominal CPUEs in both subregions of the study area.

Abandoning statistical significance is both sensible and practical

2019-04-16

To the Editor of JAMA Dr Ioannidis writes against our proposals to abandon statistical significance in scientific reasoning and publication, as endorsed in the editorial of a recent special issue of an American Statistical Association journal devoted to moving to a “post p<0.05 world.” We appreciate that he echoes our calls for “embracing uncertainty, avoiding hyped claims…and recognizing ‘statistical significance’ is often poorly understood.” We also welcome his agreement that the “interpretation of any result is far more complicated than just significance testing” and that “clinical, monetary, and other considerations may often have more importance than statistical findings.” Nonetheless, we disagree that a statistical significance-based “filtering process is useful to avoid drowning in noise” in science and instead view such filtering as harmful. First, the implicit rule to not publish nonsignificant results biases the literature with overestimated effect sizes and encourages “hacking” to get significance. Second, nonsignificant results are often wrongly treated as zero. Third, significant results are often wrongly treated as truth rather than as the noisy estimates they are, thereby creating unrealistic expectations of replicability. Fourth, filtering on statistical significance provides no guarantee against noise. Instead, it amplifies noise because the quantity on which the filtering is based (the p-value) is itself extremely noisy and is made more so by dichotomizing it. We also disagree that abandoning statistical significance will reduce science to “a state of statistical anarchy.” Indeed, the journal Epidemiology banned statistical significance in 1990 and is today recognized as a leader in the field. Valid synthesis requires accounting for all relevant evidence—not just the subset that attained statistical significance. Thus, researchers should report more, not less, providing estimates and uncertainty statements for all quantities, justifying any exceptions, and considering ways the results are wrong. Publication criteria should be based on evaluating study design, data quality, and scientific content—not statistical significance. Decisions are seldom necessary in scientific reporting. However, when they are required (as in clinical practice), they should be made based on the costs, benefits, and likelihoods of all possible outcomes, not via arbitrary cutoffs applied to statistical summaries such as p-values which capture little of this picture. The replication crisis in science is not the product of the publication of unreliable findings. The publication of unreliable findings is unavoidable: as the saying goes, if we knew what we were doing, it would not be called research. Rather, the replication crisis has arisen because unreliable findings are presented as reliable.

motoRneuron: an open-source R toolbox for time-domain motor unit analyses

2019-03-11

Motor unit synchronization is the tendency of motor neurons and their associated muscle fibers to discharge near-simultaneously. It has been theorized as a control mechanism for force generation by common excitatory inputs to these motor neurons. Magnitude of synchronization is calculated from peaks in cross-correlation histograms between motor unit discharge trains. However, there are many different methods for detecting these peaks and even more indices for calculating synchronization from them. Methodology is typically laboratory-specific and requires expensive software, like Matlab or LabView. This lack of standardization makes it difficult to draw definitive conclusions about motor unit synchronization. To combat this, we have developed a freely available, open-source toolbox, “motoRneuron”, for the R programming language. This toolbox contains functions for calculating time domain synchronization using different methods found in the literature. Our objective is to detail the program’s functionality and provide a clear use-case for implementation. The programs primary function “mu_synch” automatically performs the cross-correlation analysis based on user input. Automated peak detection methods such as the cumulative sum method and the z-score method, as well as subjective, visual analysis are available. Users can also define other parameters like the number of recurrence intervals to be used and histogram bin size. The function outputs six common synchronization indices, the common input strength (CIS), k’, k’-1, E, S, and Synch Index. This toolbox allows for better standardization of techniques and for more comprehensive data mining in the motor control community.

FORESEE: a tool for the systematic comparison of translational drug response modeling pipelines

2019-02-11

Translational models that utilize omics data generated in in vitro studies to predict the drug efficacy of anti-cancer compounds in patients are highly distinct, which complicates the benchmarking process for new computational approaches. In reaction to this, we introduce the uniFied translatiOnal dRug rESponsE prEdiction platform FORESEE, an open-source R-package. FORESEE not only provides a uniform data format for public cell line and patient data sets, but also establishes a standardized environment for drug response prediction pipelines, incorporating various state-of-the-art preprocessing methods, model training algorithms and validation techniques. The modular implementation of individual elements of the pipeline facilitates a straightforward development of combinatorial models, which can be used to re-evaluate and improve already existing pipelines as well as to develop new ones. Availability and Implementation: FORESEE is licensed under GNU General Public License v3.0 and available at https://github.com/JRC-COMBINE/FORESEE . Supplementary Information: Supplementary Files 1 and 2 provide detailed descriptions of the pipeline and the data preparation process, while Supplementary File 3 presents basic use cases of the package. Contact: schuppert@combine.rwth-aachen.de

Twenty steps towards an adequate inferential interpretation of p-values

2019-02-11

We suggest twenty immediately actionable steps to reduce widespread inferential errors related to “statistical significance testing.” Our propositions refer first to the theoretical preconditions for using p-values. They furthermore include wording guidelines as well as structural and operative advice on how to present results, especially in research based on multiple regression analysis, the working horse of empirical economists. Our propositions aim at fostering the logical consistency of inferential arguments by avoiding false categorical reasoning. They are not aimed at dispensing with p-values or completely replacing frequentist approaches by Bayesian statistics.

MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies

2019-02-06

We previously reported MetaBAT, an automated metagenome binning software tool to reconstruct single genomes from microbial communities for subsequent analyses of uncultivated microbial species. MetaBAT has become one of the most popular binning tools largely due to its computational efficiency and ease of use, especially in binning experiments with a large number of samples and a large assembly. MetaBAT requires users to choose parameters to fine-tune its sensitivity and specificity. If those parameters are not chosen properly, binning accuracy can suffer, especially on assemblies of poor quality. Here we developed MetaBAT 2 to overcome this problem. MetaBAT 2 uses a new adaptive binning algorithm to eliminate manual parameter tuning. We also performed extensive software engineering optimization to increase both computational and memory efficiency. Comparing MetaBAT 2 to alternative software tools on over 100 real world metagenome assemblies shows superior accuracy and computing speed. Binning a typical metagenome assembly takes only a few minutes on a single commodity workstation. We therefore recommend the community adopts MetaBAT 2 for their metagenome binning experiments. MetaBAT 2 is open source software and available at https://bitbucket.org/berkeleylab/metabat.