PeerJ Preprints: Statisticshttps://peerj.com/preprints/index.atom?journal=peerj&subject=7900Statistics articles published in PeerJ PreprintsHow to conduct meta-analysis: a basic tutorialhttps://peerj.com/preprints/29782017-05-152017-05-15Arindam Basu
Meta analysis refers to a process of integration of the results of many studies to arrive at evidence syn- thesis. Meta analysis is similar to systematic review; however, in addition to narrative summary that is conducted in systematic review, in meta analysis, the analysts also numerically pool the results of the studies and arrive at a summary estimate. In this paper, we discuss the key steps of conducting a meta analysis. We will discuss the steps of a simple meta-analysis with a demonstration of the key steps based on a published paper on meta-analysis and systematic review of the effectiveness of salt restricted diet on blood pressure control. This paper is a basic introduction to the process of meta-analysis. In subsequent papers in this series, we will show how you can conduct meta-analysis of diagnostic and screening studies, network meta analyses, and those of diagnostic and screening studies.
Meta analysis refers to a process of integration of the results of many studies to arrive at evidence syn- thesis. Meta analysis is similar to systematic review; however, in addition to narrative summary that is conducted in systematic review, in meta analysis, the analysts also numerically pool the results of the studies and arrive at a summary estimate. In this paper, we discuss the key steps of conducting a meta analysis. We will discuss the steps of a simple meta-analysis with a demonstration of the key steps based on a published paper on meta-analysis and systematic review of the effectiveness of salt restricted diet on blood pressure control. This paper is a basic introduction to the process of meta-analysis. In subsequent papers in this series, we will show how you can conduct meta-analysis of diagnostic and screening studies, network meta analyses, and those of diagnostic and screening studies.Evaluation of a multi-institution mastoidectomy performance instrumenthttps://peerj.com/preprints/29542017-04-282017-04-28Thomas KerwinBrad HittleDon StredneyPaul De BoeckGregory Wiet
Objective: The objective of this work is to obtain validity evidence for an evaluation instrument used to assess the performance level of a mastoidectomy. The instrument has been previously described and had been formulated by a multi-institutional consortium.
Design: Mastoidectomies were performed on a virtual temporal bone system and then rated by experts using a previously described 15 element task-based checklist. Based on the results, a second, similar checklist was created and a second round of rating was performed.
Setting: Twelve otolaryngological surgical training programs in the United States. Participants 66 individuals with a variety of temporal bone dissection experience, from medical students to attending physicians. Raters were attending surgeons from 12 different institutions.
Results: Intraclass correlation (ICC) scores varied greatly between items in the checklist with some being low and some being high. Percentage agreement scores were similar to previous rating instruments. There is strong evidence that a high score on the task- based checklist is necessary for a rater to consider a mastoidectomy to be performed at the level of an expert but a high score is not a sufficient condition.
Conclusions: Rewording of the instrument items to focus on safety does not result in increased reliability of the instrument. The strong result of the Necessary Condition Analysis suggests that going beyond simple correlation measures can give extra insight into grading results. Additionally, we suggest using a multiple point scale instead of a binary pass/fail question combined with descriptive mastery levels.
Objective: The objective of this work is to obtain validity evidence for an evaluation instrument used to assess the performance level of a mastoidectomy. The instrument has been previously described and had been formulated by a multi-institutional consortium.Design: Mastoidectomies were performed on a virtual temporal bone system and then rated by experts using a previously described 15 element task-based checklist. Based on the results, a second, similar checklist was created and a second round of rating was performed.Setting: Twelve otolaryngological surgical training programs in the United States. Participants 66 individuals with a variety of temporal bone dissection experience, from medical students to attending physicians. Raters were attending surgeons from 12 different institutions.Results: Intraclass correlation (ICC) scores varied greatly between items in the checklist with some being low and some being high. Percentage agreement scores were similar to previous rating instruments. There is strong evidence that a high score on the task- based checklist is necessary for a rater to consider a mastoidectomy to be performed at the level of an expert but a high score is not a sufficient condition.Conclusions: Rewording of the instrument items to focus on safety does not result in increased reliability of the instrument. The strong result of the Necessary Condition Analysis suggests that going beyond simple correlation measures can give extra insight into grading results. Additionally, we suggest using a multiple point scale instead of a binary pass/fail question combined with descriptive mastery levels.Factors influencing healthcare provider respondent fatigue answering a globally administered in-app surveyhttps://peerj.com/preprints/29392017-04-212017-04-21Vikas N O'Reilly-Shah
Background: Respondent fatigue, also known as survey fatigue, is a common problem in the collection of survey data. Factors that are known to influence respondent fatigue include survey length, survey topic, question complexity, and open-ended question type. There is a great deal of interest in understanding the drivers of physician survey responsiveness due to the value of information received from these practitioners. With the recent explosion of mobile smartphone technology 7, it has been possible to obtain survey data from users of mobile applications (apps) on a question-by-question basis. We obtained basic demographic survey data as well as survey data related to an anesthesiology-specific drug called sugammadex and leveraged nonresponse rates to examine factors that influenced respondent fatigue.
Methods: Primary data were collected between December 2015 and February 2017. Surveys and in-app analytics were collected from global users of a mobile anesthesia calculator app. Key independent variables were user country, healthcare provider role, rating of importance of the app to personal practice, length of time in practice, and frequency of app use. Key dependent variable was the metric of respondent fatigue.
Results: Provider role and World Bank country income level were predictive of the rate of respondent fatigue for this in-app survey. Importance of the app to the provider and length of time in practice were moderately associated with fatigue. Frequency of app use was not associated. This study focused on a survey with a topic closely related to the subject area of the app. Respondent fatigue rates will likely change dramatically if the topic does not align closely,
Discussion: Although apps may serve as powerful platforms for data collection, responses rates to in-app surveys may differ on the basis of important respondent characteristics. Studies should be carefully designed to mitigate fatigue as well as powered with the understanding of the respondent characteristics that may have higher rates of respondent fatigue.
Background: Respondent fatigue, also known as survey fatigue, is a common problem in the collection of survey data. Factors that are known to influence respondent fatigue include survey length, survey topic, question complexity, and open-ended question type. There is a great deal of interest in understanding the drivers of physician survey responsiveness due to the value of information received from these practitioners. With the recent explosion of mobile smartphone technology 7, it has been possible to obtain survey data from users of mobile applications (apps) on a question-by-question basis. We obtained basic demographic survey data as well as survey data related to an anesthesiology-specific drug called sugammadex and leveraged nonresponse rates to examine factors that influenced respondent fatigue.Methods: Primary data were collected between December 2015 and February 2017. Surveys and in-app analytics were collected from global users of a mobile anesthesia calculator app. Key independent variables were user country, healthcare provider role, rating of importance of the app to personal practice, length of time in practice, and frequency of app use. Key dependent variable was the metric of respondent fatigue.Results: Provider role and World Bank country income level were predictive of the rate of respondent fatigue for this in-app survey. Importance of the app to the provider and length of time in practice were moderately associated with fatigue. Frequency of app use was not associated. This study focused on a survey with a topic closely related to the subject area of the app. Respondent fatigue rates will likely change dramatically if the topic does not align closely,Discussion: Although apps may serve as powerful platforms for data collection, responses rates to in-app surveys may differ on the basis of important respondent characteristics. Studies should be carefully designed to mitigate fatigue as well as powered with the understanding of the respondent characteristics that may have higher rates of respondent fatigue.Predicting trophic discrimination factor using Bayesian inference and phylogenetic, ecological and physiological data. DEsIR: Discrimination Estimation in R.https://peerj.com/preprints/19502017-04-172017-04-17Kevin HealySeán B.A KellyThomas GuillermeRichard IngerStuart BearhopAndrew L Jackson
1. Stable isotope analysis is a widely used tool for the reconstruction and interpretation of animal diets and trophic relationships. Analytical tools have improved the robustness of inferring the relative contribution of different prey sources to an animal’s diet by accounting for many of the sources of variation in isotopic data. One major source of uncertainty is Trophic Discrimination Factor (TDF), the change in isotopic signatures between consumers’ tissues and their food sources. This parameter can have a profound impact on model predictions, but often, it is not feasible to estimate a species’ TDF value and so researchers often use aggregated or taxon level estimates, an assumption that in turn has major implications for the interpretation of subsequent analyses.
2. We collected extensive carbon (δ13C) and nitrogen (δ15N) TDF data on mammals and birds from published literature. We then used a Bayesian linear modelling approach to determine if, and to what extent, variation in TDF values can be attributed to a species’ ecology, physiology, phylogenetic relationships and experimental variation. Finally, we developed a Bayesian imputation approach to estimate unknown TDF values and compared the accuracy of this tool using a series of cross-validation tests.
3. Our results show that, for birds and mammals, TDF values are influenced by phylogeny, tissue type sampled, diet of consumer, isotopic signature of food source, and the error associated with the measurement of TDF within a species. Furthermore, our cross-validation tests determined that, our tool can (i) produce accurate estimates of TDF values with a mean distance of 0.2 ‰ from observed TDF values, and (ii) provide an estimate of the precision associated with these estimates, with species presence in the data allowing for a reduced level of uncertainty.
4. By incorporating various sources of variation and reflecting the levels of uncertainty associated with TDF estimates our novel tool will contribute to more accurate and honest reconstructions and interpretations of animal diets and trophic interactions. This tool can be extended readily to include other taxa and sources of variation as data becomes available. To facilitate this, we provide a step-by-step guide and code for this tool: Discrimination Estimation in R (DEsiR)
1. Stable isotope analysis is a widely used tool for the reconstruction and interpretation of animal diets and trophic relationships. Analytical tools have improved the robustness of inferring the relative contribution of different prey sources to an animal’s diet by accounting for many of the sources of variation in isotopic data. One major source of uncertainty is Trophic Discrimination Factor (TDF), the change in isotopic signatures between consumers’ tissues and their food sources. This parameter can have a profound impact on model predictions, but often, it is not feasible to estimate a species’ TDF value and so researchers often use aggregated or taxon level estimates, an assumption that in turn has major implications for the interpretation of subsequent analyses.2. We collected extensive carbon (δ13C) and nitrogen (δ15N) TDF data on mammals and birds from published literature. We then used a Bayesian linear modelling approach to determine if, and to what extent, variation in TDF values can be attributed to a species’ ecology, physiology, phylogenetic relationships and experimental variation. Finally, we developed a Bayesian imputation approach to estimate unknown TDF values and compared the accuracy of this tool using a series of cross-validation tests.3. Our results show that, for birds and mammals, TDF values are influenced by phylogeny, tissue type sampled, diet of consumer, isotopic signature of food source, and the error associated with the measurement of TDF within a species. Furthermore, our cross-validation tests determined that, our tool can (i) produce accurate estimates of TDF values with a mean distance of 0.2 ‰ from observed TDF values, and (ii) provide an estimate of the precision associated with these estimates, with species presence in the data allowing for a reduced level of uncertainty.4. By incorporating various sources of variation and reflecting the levels of uncertainty associated with TDF estimates our novel tool will contribute to more accurate and honest reconstructions and interpretations of animal diets and trophic interactions. This tool can be extended readily to include other taxa and sources of variation as data becomes available. To facilitate this, we provide a step-by-step guide and code for this tool: Discrimination Estimation in R (DEsiR)The earth is flat (p>0.05): Significance thresholds and the crisis of unreplicable researchhttps://peerj.com/preprints/29212017-04-132017-04-13Valentin AmrheinFränzi Korner-NievergeltTobias Roth
The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.
The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.Optimizing pedigrees: Using a biasing system to determine likely inheritance systemshttps://peerj.com/preprints/28712017-04-092017-04-09Justin Ang
Pedigrees, though straightforward and versatile, lack the ability to tell us information about many individuals. Though numerical systems have been developed, there is currently no system to quantify the probability of a pedigree following certain inheritance systems. My system intends to fulfill that chasm by creating a flexible numerical system and testing it for variance. First, my system attempts to adapt inheritance system to known pedigree data. Then, it calculates the difference between the calculated values and the known pedigree data. It aggregates these values, then it uses a chi-squared analysis in order to determine the likelihood of said inheritance system. This is done for many different systems, until we have a general idea of which systems are probable and which are not.
Pedigrees, though straightforward and versatile, lack the ability to tell us information about many individuals. Though numerical systems have been developed, there is currently no system to quantify the probability of a pedigree following certain inheritance systems. My system intends to fulfill that chasm by creating a flexible numerical system and testing it for variance. First, my system attempts to adapt inheritance system to known pedigree data. Then, it calculates the difference between the calculated values and the known pedigree data. It aggregates these values, then it uses a chi-squared analysis in order to determine the likelihood of said inheritance system. This is done for many different systems, until we have a general idea of which systems are probable and which are not.Probability that p-value provides misleading evidence cannot be controlled by sample sizehttps://peerj.com/preprints/28692017-03-132017-03-13Marian GrendarGeorge G Judge
A measure of statistical evidence should permit the sample size determination so that the probability M of obtaining (strong) misleading evidence can be held as low as desired. On this desideratum the p-value fails completely, as it leads either to an arbitrary sample size if M >= 0.01 or no sample size at all, if M < 0.01.
Unlike the p-value, the ratio of likelihoods, the ratio of posteriors, as well as the Bayes Factor, permit controlling the probability of misleading evidence by the sample size.
A measure of statistical evidence should permit the sample size determination so that the probability M of obtaining (strong) misleading evidence can be held as low as desired. On this desideratum the p-value fails completely, as it leads either to an arbitrary sample size if M >= 0.01 or no sample size at all, if M < 0.01.Unlike the p-value, the ratio of likelihoods, the ratio of posteriors, as well as the Bayes Factor, permit controlling the probability of misleading evidence by the sample size.Ranking of critical species to preserve the functionality of mutualistic networks using the k-core decompositionhttps://peerj.com/preprints/28552017-03-072017-03-07Javier García-AlgarraJuan Manuel PastorJosé María IriondoJavier Galeano
Mutualistic communities play an important role in biodiversity preservation. They are modeled as bipartite networks and measurements of centrality and degree help to order species and their relative importance for network robustness. Identifying the most endangered ones or those more prone to trigger cascade extinctions is essential to define conservation policies. In this work, we explain how a classical graph analysis tool, the k-core decomposition, provides new ranking magnitudes that reach outstanding performance for these purposes.
Mutualistic communities play an important role in biodiversity preservation. They are modeled as bipartite networks and measurements of centrality and degree help to order species and their relative importance for network robustness. Identifying the most endangered ones or those more prone to trigger cascade extinctions is essential to define conservation policies. In this work, we explain how a classical graph analysis tool, the k-core decomposition, provides new ranking magnitudes that reach outstanding performance for these purposes.Linearization improves the repeatability of quantitative Dynamic Contrast-Enhanced MRIhttps://peerj.com/preprints/28242017-02-212017-02-21Kyle M. JonesMarisa H. BordersKimberly A. FitzpatrickMark D. PagelJulio Cárdenas-Rodríguez
We studied the effect of linearization on the repeatability of the Tofts and reference region models (RRM) for Dynamic Contrast-Enhanced MRI (DCE MRI). We compared the repeatabilities of these two linearized models, the standard non-linear version, and semi-quantitative methods of analysis. Simulated and experimental DCE MRI data from 12 rats with a flank tumor of C6 glioma acquired over three consecutive days were analyzed using four quantitative and semi-quantitative DCE MRI metrics. The quantitative methods used were: 1) Linear Tofts model (LTM), 2) Non-linear Tofts model (NTM), 3) Linear RRM (LRRM), and 4) Non-linear RRM (NRRM). The following semi-quantitative metrics were used: 1) Maximum enhancement ratio (MER), 2) time to peak (TTP), 3) initial area under the curve (iauc64), and 4) slope. LTM and NTM were used to estimate Ktrans, while LRRM and NRRM were used to estimate Ktrans relative to muscle (RKtrans). Repeatability was assessed by calculating the within-subject coefficient of variation (wSCV) and the percent intra-subject variation (iSV) determined with the Gage repeatability and reproducibility (R&R) analysis. The iSV for RKtrans using LRRM was two-fold lower compared to NRRM at all simulated and experimental conditions. A similar trend was observed for the Tofts model, where LTM was at least 50% more repeatable than the NTM under all experimental and simulated conditions. The semi-quantitative metrics iauc64 and MER were as equally reproducible as Ktrans and RKtrans estimated by LTM and LRRM respectively. The iSV for iauc64 and MER were significantly lower than the iSV for slope and TTP. In simulations and experimental results, linearization improves the repeatability of quantitative DCE MRI by at least 30%, making it as repeatable as semi-quantitative metrics.
Westudied the effect of linearization on the repeatability of the Tofts and reference region models (RRM) for Dynamic Contrast-Enhanced MRI (DCE MRI). We compared the repeatabilities of these two linearized models, the standard non-linear version, and semi-quantitative methods of analysis. Simulated and experimental DCE MRI data from 12 rats with a flank tumor of C6 glioma acquired over three consecutive days were analyzed using four quantitative and semi-quantitative DCE MRI metrics. The quantitative methods used were: 1) Linear Tofts model (LTM), 2) Non-linear Tofts model (NTM), 3) Linear RRM (LRRM), and 4) Non-linear RRM (NRRM). The following semi-quantitative metrics were used: 1) Maximum enhancement ratio (MER), 2) time to peak (TTP), 3) initial area under the curve (iauc64), and 4) slope. LTM and NTM were used to estimate Ktrans, while LRRM and NRRM were used to estimate Ktrans relative to muscle (RKtrans). Repeatability was assessed by calculating the within-subject coefficient of variation (wSCV) and the percent intra-subject variation (iSV) determined with the Gage repeatability and reproducibility (R&R) analysis. The iSV for RKtrans using LRRM was two-fold lower compared to NRRM at all simulated and experimental conditions. A similar trend was observed for the Tofts model, where LTM was at least 50% more repeatable than the NTM under all experimental and simulated conditions. The semi-quantitative metrics iauc64 and MER were as equally reproducible as Ktrans and RKtrans estimated by LTM and LRRM respectively. The iSV for iauc64 and MER were significantly lower than the iSV for slope and TTP. In simulations and experimental results, linearization improves the repeatability of quantitative DCE MRI by at least 30%, making it as repeatable as semi-quantitative metrics.Within outlying mean indexes: refining the OMI analysis for the realized niche decompositionhttps://peerj.com/preprints/28102017-02-152017-02-15Stéphane KarasiewiczSylvain DolédecSébastien Lefebvre
The ecological niche concept has a revival interest under climate change, especially to study its impact on niche shift and/or conservatism. Here, we propose the Within Outlying Mean Indexes (WitOMI), which refines the Outlying Mean Index (OMI) analysis by using its properties in combination with the K-select analysis species marginality decomposition. The purpose is to decompose the ecological niche, into subniches associated to the experimental design, i.e. taking into account temporal or spatial subsets. WitOMI emphasizes the habitat conditions that contribute 1) to the definition of species’ niches using all available conditions and, at the same time, 2) to the delineation of species’ subniches according to given subsets of dates or sites. This latter aspect allows addressing niche dynamics by highlighting the influence of atypical habitat conditions on species at a given time or space. 3) Then, the biological constraint exerted on the species subniche becomes observable within the Euclidean space as the difference between the potential subniche and the realized subniche. We illustrate the decomposition of published OMI analysis, using spatial and temporal examples. The species assemblage’s subniches are comparable to the same environmental gradient, producing a more accurate and precise description of the assemblage niche distribution under climate change.
The ecological niche concept has a revival interest under climate change, especially to study its impact on niche shift and/or conservatism. Here, we propose the Within Outlying Mean Indexes (WitOMI), which refines the Outlying Mean Index (OMI) analysis by using its properties in combination with the K-select analysis species marginality decomposition. The purpose is to decompose the ecological niche, into subniches associated to the experimental design, i.e. taking into account temporal or spatial subsets. WitOMI emphasizes the habitat conditions that contribute 1) to the definition of species’ niches using all available conditions and, at the same time, 2) to the delineation of species’ subniches according to given subsets of dates or sites. This latter aspect allows addressing niche dynamics by highlighting the influence of atypical habitat conditions on species at a given time or space. 3) Then, the biological constraint exerted on the species subniche becomes observable within the Euclidean space as the difference between the potential subniche and the realized subniche. We illustrate the decomposition of published OMI analysis, using spatial and temporal examples. The species assemblage’s subniches are comparable to the same environmental gradient, producing a more accurate and precise description of the assemblage niche distribution under climate change.