PeerJ Preprints: Statisticshttps://peerj.com/preprints/index.atom?journal=peerj&subject=7900Statistics articles published in PeerJ PreprintsProtocol: The effect of whole-grain dietary intake on non-communicable diseases: A systematic review, multivariate meta-analysis and dose-response of prospective cohorts, cross-sectional, case-control and intervention studieshttps://peerj.com/preprints/267102018-03-152018-03-15Wasim A IqbalGavin B StewartAbigail J SmithChris J Seal
The proposed protocol is for a systematic review and meta-analysis on the effects of whole-grains (WG) on non-communicable diseases such as type 2 diabetes, cardiovascular disease, hypertension and obesity. The primary objectives is to explore the mechanisms of WG intake on multiple biomarkers of NCDs such as fasting glucose, fasting insulin and many others. The secondary objective will look at the dose-response relationship between these various mechanisms. The protocol outlines the motive and scope for the review, and methodology including the risk of bias, statistical analysis, screening and study criteria.
The proposed protocol is for a systematic review and meta-analysis on the effects of whole-grains (WG) on non-communicable diseases such as type 2 diabetes, cardiovascular disease, hypertension and obesity. The primary objectives is to explore the mechanisms of WG intake on multiple biomarkers of NCDs such as fasting glucose, fasting insulin and many others. The secondary objective will look at the dose-response relationship between these various mechanisms. The protocol outlines the motive and scope for the review, and methodology including the risk of bias, statistical analysis, screening and study criteria.GMPR: A robust normalization method for zero-inflated count data with application to microbiome sequencing datahttps://peerj.com/preprints/34172018-03-072018-03-07Li ChenJames ReeveLujun ZhangShengbing HuangXuefeng WangJun Chen
Normalization is the first critical step in microbiome sequencing data analysis used to account for variable library sizes. Current RNA-Seq based normalization methods that have been adapted for microbiome data fail to consider the unique characteristics of microbiome data, which contain a vast number of zeros due to the physical absence or under-sampling of the microbes. Normalization methods that specifically address the zero inflation remain largely undeveloped. Here we propose GMPR - a simple but effective normalization method - for zero-inflated sequencing data such as microbiome data. Simulation studies and real datasets analyses demonstrate that the proposed method is more robust than competing methods, leading to more powerful detection of differentially abundant taxa and higher reproducibility of the relative abundances of taxa.
Normalization is the first critical step in microbiome sequencing data analysis used to account for variable library sizes. Current RNA-Seq based normalization methods that have been adapted for microbiome data fail to consider the unique characteristics of microbiome data, which contain a vast number of zeros due to the physical absence or under-sampling of the microbes. Normalization methods that specifically address the zero inflation remain largely undeveloped. Here we propose GMPR - a simple but effective normalization method - for zero-inflated sequencing data such as microbiome data. Simulation studies and real datasets analyses demonstrate that the proposed method is more robust than competing methods, leading to more powerful detection of differentially abundant taxa and higher reproducibility of the relative abundances of taxa.Brain network connectivity underlying decisions between the "lesser of two evils"https://peerj.com/preprints/33402018-03-012018-03-01Colleen Mills-FinnertyCatherine HansonStephen J Hanson
In daily life we are often forced to choose between the “lesser of two evils,” yet there remains limited understanding of how the brain encodes choices between aversive stimuli, particularly choices involving hypothetical futures. We tested how choice framing affects brain activity and network connectivity by having participants make choices about individualized, aversive, hypothetical stimuli (i.e. illnesses, car accidents, etc.) under approach and avoidance frames (“which would you rather have/avoid”) during fMRI scanning. We tested whether limbic and frontal regions show patterns of signal intensity and network connectivity that differed by frame, and compared this to response to similar appetitive choices involving appetitive preferences (i.e. hobbies, vacation destinations). We predicted that regions such as the insula, amgydala, and striatum would respond differently to approach vs. avoidance choices during aversive hypothetical choices. We identified activations for both choice frames in areas broadly associated with decision making, including the putamen, insula, and anterior cingulate, as well as deactivations in areas shown to be sensitive to valence, including the amygdala, insula, prefrontal cortex, and hippocampus. Connectivity between brain regions differed based on choice frame, with greater connectivity among deactive regions including the amygdala, insula, and ventromedial prefrontal cortex during avoidance frames compared to approach frames. These differences suggest that approach and avoidance frames lead to different behavioral and brain network response when deciding which of two evils are the lesser.
In daily life we are often forced to choose between the “lesser of two evils,” yet there remains limited understanding of how the brain encodes choices between aversive stimuli, particularly choices involving hypothetical futures. We tested how choice framing affects brain activity and network connectivity by having participants make choices about individualized, aversive, hypothetical stimuli (i.e. illnesses, car accidents, etc.) under approach and avoidance frames (“which would you rather have/avoid”) during fMRI scanning. We tested whether limbic and frontal regions show patterns of signal intensity and network connectivity that differed by frame, and compared this to response to similar appetitive choices involving appetitive preferences (i.e. hobbies, vacation destinations). We predicted that regions such as the insula, amgydala, and striatum would respond differently to approach vs. avoidance choices during aversive hypothetical choices. We identified activations for both choice frames in areas broadly associated with decision making, including the putamen, insula, and anterior cingulate, as well as deactivations in areas shown to be sensitive to valence, including the amygdala, insula, prefrontal cortex, and hippocampus. Connectivity between brain regions differed based on choice frame, with greater connectivity among deactive regions including the amygdala, insula, and ventromedial prefrontal cortex during avoidance frames compared to approach frames. These differences suggest that approach and avoidance frames lead to different behavioral and brain network response when deciding which of two evils are the lesser.Twenty steps towards an adequate inferential interpretation of p-valueshttps://peerj.com/preprints/34822018-02-092018-02-09Norbert HirschauerSven GrünerOliver MußhoffClaudia Becker
We suggest twenty immediately actionable steps to reduce widespread inferential errors related to “statistical significance testing.” Our propositions refer first to the theoretical preconditions for using p-values. They furthermore include wording guidelines as well as structural and operative advice of how to present results, especially in multiple regression analysis. Our propositions aim at fostering the logical consistency of inferential arguments by avoiding false categorical reasoning. They are not aimed at dispensing with p-values or completely replacing frequentist approaches by Bayesian statistics.
We suggest twenty immediately actionable steps to reduce widespread inferential errors related to “statistical significance testing.” Our propositions refer first to the theoretical preconditions for using p-values. They furthermore include wording guidelines as well as structural and operative advice of how to present results, especially in multiple regression analysis. Our propositions aim at fostering the logical consistency of inferential arguments by avoiding false categorical reasoning. They are not aimed at dispensing with p-values or completely replacing frequentist approaches by Bayesian statistics.Quantitative color profiling of images in a comparative framework using the R package colordistancehttps://peerj.com/preprints/264872018-02-052018-02-05Hannah WellerMark Westneat
Color is a central aspect of biology, with important impacts on ecology and evolution. Organismal color may be adaptive or incidental, seasonal or permanent, species- or population-specific, or modified for breeding, defense or camouflage. Thus, measuring and comparing color among organisms provides important biological insights. However, color comparison is limited by color categorization methods, with few universal tools available for quantitative color profiling and comparison. We present a package of R tools for processing images of organisms (or other objects) in order to quantify color profiles, gather color trait data, and compare color palettes in a reproducible way. The package treats image pixels as 3D coordinates in “color space", producing a multidimensional color histogram for each image. Pairwise distances between histograms are computed using earth mover's distance or a combination of more conventional distance metrics. The user sets parameters for generating color histograms, and comparative color profile analysis is performed through pairwise comparisons to produce a color distance matrix for a set of images. The toolkit provided in the colordistance R package can be used for analyses involving quantitative color variation in organisms with statistical testing. We illustrate the use of colordistance with three biological examples: hybrid coloration in butterflyfishes; mimicry in wing coloration in Heliconius butterflies; and analysis of background matching in camouflaging flounder fish. The tools presented for quantitative color analysis may be applied to a broad range of questions in biology and other disciplines.
Color is a central aspect of biology, with important impacts on ecology and evolution. Organismal color may be adaptive or incidental, seasonal or permanent, species- or population-specific, or modified for breeding, defense or camouflage. Thus, measuring and comparing color among organisms provides important biological insights. However, color comparison is limited by color categorization methods, with few universal tools available for quantitative color profiling and comparison. We present a package of R tools for processing images of organisms (or other objects) in order to quantify color profiles, gather color trait data, and compare color palettes in a reproducible way. The package treats image pixels as 3D coordinates in “color space", producing a multidimensional color histogram for each image. Pairwise distances between histograms are computed using earth mover's distance or a combination of more conventional distance metrics. The user sets parameters for generating color histograms, and comparative color profile analysis is performed through pairwise comparisons to produce a color distance matrix for a set of images. The toolkit provided in the colordistance R package can be used for analyses involving quantitative color variation in organisms with statistical testing. We illustrate the use of colordistance with three biological examples: hybrid coloration in butterflyfishes; mimicry in wing coloration in Heliconius butterflies; and analysis of background matching in camouflaging flounder fish. The tools presented for quantitative color analysis may be applied to a broad range of questions in biology and other disciplines.Sci-Hub provides access to nearly all scholarly literaturehttps://peerj.com/preprints/31002018-02-022018-02-02Daniel S HimmelsteinAriel R RomeroJacob G LevernierThomas A MunroStephen R McLaughlinBastian Greshake TzovarasCasey S Greene
The website Sci-Hub enables users to download PDF versions of scholarly articles, including many articles that are paywalled at their journal’s site. Sci-Hub has grown rapidly since its creation in 2011, but the extent of its coverage was unclear. Here we report that, as of March 2017, Sci-Hub’s database contains 68.9% of the 81.6 million scholarly articles registered with Crossref and 85.2% of articles published in toll access journals. We find that coverage varies by discipline and publisher and that Sci-Hub preferentially covers popular, paywalled content. For toll access articles, green open access via licit services is quite limited, while Sci-Hub provides greater coverage than a major research university. Our interactive browser at https://greenelab.github.io/scihub allows users to explore these findings in more detail. For the first time, nearly all scholarly literature is available gratis to anyone with an Internet connection, suggesting the toll access business model will become unsustainable.
The website Sci-Hub enables users to download PDF versions of scholarly articles, including many articles that are paywalled at their journal’s site. Sci-Hub has grown rapidly since its creation in 2011, but the extent of its coverage was unclear. Here we report that, as of March 2017, Sci-Hub’s database contains 68.9% of the 81.6 million scholarly articles registered with Crossref and 85.2% of articles published in toll access journals. We find that coverage varies by discipline and publisher and that Sci-Hub preferentially covers popular, paywalled content. For toll access articles, green open access via licit services is quite limited, while Sci-Hub provides greater coverage than a major research university. Our interactive browser at https://greenelab.github.io/scihub allows users to explore these findings in more detail. For the first time, nearly all scholarly literature is available gratis to anyone with an Internet connection, suggesting the toll access business model will become unsustainable.Data collection for seed system network analysishttps://peerj.com/preprints/28062018-01-312018-01-31Christopher E BuddenhagenKelsey F AndersenJames C FultonKaren A Garrett
We present survey questions useful for describing agricultural seed systems. The questions are designed so that they can be used for standardized comparisons among seed systems, addressing both networks for seed movement and networks for the communication of information related to variety selection and integrated pest management. This approach provides information that can be used in multilayer network analyses of how information influences seed system success. Also provided are example data sheets with field descriptors that should provide for straightforward statistical analysis after data collection.
We present survey questions useful for describing agricultural seed systems. The questions are designed so that they can be used for standardized comparisons among seed systems, addressing both networks for seed movement and networks for the communication of information related to variety selection and integrated pest management. This approach provides information that can be used in multilayer network analyses of how information influences seed system success. Also provided are example data sheets with field descriptors that should provide for straightforward statistical analysis after data collection.A brief introduction to mixed effects modelling and multi-model inference in ecologyhttps://peerj.com/preprints/31132018-01-102018-01-10Xavier A HarrisonLynda DonaldsonMaria Eugenia Correa-CanoJulian EvansDavid N FisherCecily GoodwinBeth RobinsonDavid J HodgsonRichard Inger
The use of linear mixed effects models (LMMs) is increasingly common in the analysis of biological data. Whilst LMMs offer a flexible approach to modelling a broad range of data types, ecological data are often complex and require complex model structures, and the fitting and interpretation of such models is not always straightforward. The ability to achieve robust biological inference requires that practitioners know how and when to apply these tools. Here, we provide a general overview of current methods for the application of LMMs to biological data, and highlight the typical pitfalls that can be encountered in the statistical modelling process. We tackle several issues relating to the use of information theory and multi-model inference in ecology, and demonstrate the tendency for data dredging to lead to greatly inflated Type I error rate (false positives) and impaired inference. We offer practical solutions and direct the reader to key references that provide further technical detail for those seeking a deeper understanding. This overview should serve as a widely accessible code of best practice for applying LMMs to complex biological problems and model structures, and in doing so improve the robustness of conclusions drawn from studies investigating ecological and evolutionary questions.
The use of linear mixed effects models (LMMs) is increasingly common in the analysis of biological data. Whilst LMMs offer a flexible approach to modelling a broad range of data types, ecological data are often complex and require complex model structures, and the fitting and interpretation of such models is not always straightforward. The ability to achieve robust biological inference requires that practitioners know how and when to apply these tools. Here, we provide a general overview of current methods for the application of LMMs to biological data, and highlight the typical pitfalls that can be encountered in the statistical modelling process. We tackle several issues relating to the use of information theory and multi-model inference in ecology, and demonstrate the tendency for data dredging to lead to greatly inflated Type I error rate (false positives) and impaired inference. We offer practical solutions and direct the reader to key references that provide further technical detail for those seeking a deeper understanding. This overview should serve as a widely accessible code of best practice for applying LMMs to complex biological problems and model structures, and in doing so improve the robustness of conclusions drawn from studies investigating ecological and evolutionary questions.A more intuitive interpretation of the area under the ROC curvehttps://peerj.com/preprints/34682017-12-152017-12-15A. Cecile J.W. Janssens
The area under the receiver operating characteristic (ROC) curve (AUC) is commonly used for assessing the discriminative ability of prediction models even though the measure is criticized for being clinically irrelevant and lacking an intuitive interpretation. Most of the criticism is traced back to the fact that the ROC curve was introduced as the discriminative ability of a binary classifier across all its possible thresholds. Yet, this is not the curve’s only interpretation. Every tutorial explains how the coordinates of the ROC curve are obtained from the risk distributions of diseased and non-diseased individuals, but it has not become common sense that the ROC plot is another way of presenting these risk distributions. This alternative perspective on the ROC plot invalidates most limitations of the AUC and attributes others to the underlying risk distributions. The separation of these distributions, represented by the area between the curves (AUC), is a more straightforward and intuitive measure of the inference of the discriminative ability of prediction models.
The area under the receiver operating characteristic (ROC) curve (AUC) is commonly used for assessing the discriminative ability of prediction models even though the measure is criticized for being clinically irrelevant and lacking an intuitive interpretation. Most of the criticism is traced back to the fact that the ROC curve was introduced as the discriminative ability of a binary classifier across all its possible thresholds. Yet, this is not the curve’s only interpretation. Every tutorial explains how the coordinates of the ROC curve are obtained from the risk distributions of diseased and non-diseased individuals, but it has not become common sense that the ROC plot is another way of presenting these risk distributions. This alternative perspective on the ROC plot invalidates most limitations of the AUC and attributes others to the underlying risk distributions. The separation of these distributions, represented by the area between the curves (AUC), is a more straightforward and intuitive measure of the inference of the discriminative ability of prediction models.Average deviation for measuring variation in data in small samples (n < 5)https://peerj.com/preprints/34602017-12-112017-12-11Wenfa Ng
Good control of experiment variability is critical to experiment success, and many methods are available for quantifying variation in data. Popular methods for measuring variability in data typically uses a statistical distribution such as a standard normal distribution, but these distributions are designed for large sample size with n > 30. However, experiments typically generate less than 5 replicates (n < 5). Thus, the key requirement for the use of standard normal distribution is not satisfied, which bring forth the need for the development of alternative ways of quantifying the variation in collected data for small sample size. This abstract describes a new statistic, average deviation, that aims to quantify the variation of repeated measurements of a variable. By taking an average of the sum of the differences between the mean and all measurements, average deviation provides a better representation of the variation in data around a mean, while capturing the impact of significant deviation from the mean by individual measurement. However, division of the sum of deviation of all measurements from the mean by the sample size meant that the presence of outlier measurement may not be fully represented by the calculated average deviation. Thus, the new statistic is better used with a small sample size of less than 5, which helps reduce the extent in which an outlier’s influence on the average deviation would be diluted. In summary, for small sample size, average deviation better represents the deviation between each measurement and the mean compared to statistical distribution-based approaches such as standard error. However, desire to not dilute the impact of outlier measurement on the calculated average deviation meant that the new statistic is only suitable for sample size less than 5.
Good control of experiment variability is critical to experiment success, and many methods are available for quantifying variation in data. Popular methods for measuring variability in data typically uses a statistical distribution such as a standard normal distribution, but these distributions are designed for large sample size with n > 30. However, experiments typically generate less than 5 replicates (n < 5). Thus, the key requirement for the use of standard normal distribution is not satisfied, which bring forth the need for the development of alternative ways of quantifying the variation in collected data for small sample size. This abstract describes a new statistic, average deviation, that aims to quantify the variation of repeated measurements of a variable. By taking an average of the sum of the differences between the mean and all measurements, average deviation provides a better representation of the variation in data around a mean, while capturing the impact of significant deviation from the mean by individual measurement. However, division of the sum of deviation of all measurements from the mean by the sample size meant that the presence of outlier measurement may not be fully represented by the calculated average deviation. Thus, the new statistic is better used with a small sample size of less than 5, which helps reduce the extent in which an outlier’s influence on the average deviation would be diluted. In summary, for small sample size, average deviation better represents the deviation between each measurement and the mean compared to statistical distribution-based approaches such as standard error. However, desire to not dilute the impact of outlier measurement on the calculated average deviation meant that the new statistic is only suitable for sample size less than 5.