PeerJ Preprints: Statisticshttps://peerj.com/preprints/index.atom?journal=peerj&subject=7900Statistics articles published in PeerJ PreprintsHow to share data for collaborationhttps://peerj.com/preprints/31392017-08-112017-08-11Shannon E EllisJeffrey T Leek
Within the statistics community, a number of guiding principles for sharing data have emerged; however, these principles are not always made clear to collaborators generating the data. To bridge this divide, we have established a set of guidelines for sharing data. In these, we highlight the need to provide raw data to the statistician, the importance of consistent formatting, and the necessity of including all essential experimental information and pre-processing steps carried out to the statistician. With these guidelines we hope to avoid errors and delays in data analysis.
Within the statistics community, a number of guiding principles for sharing data have emerged; however, these principles are not always made clear to collaborators generating the data. To bridge this divide, we have established a set of guidelines for sharing data. In these, we highlight the need to provide raw data to the statistician, the importance of consistent formatting, and the necessity of including all essential experimental information and pre-processing steps carried out to the statistician. With these guidelines we hope to avoid errors and delays in data analysis.Positive association of supersaturation effects in the human airways with influenza activity in subtropical climate: influenza seasons in Okinawa (2007-2012) – New method for analyzing and forecasting (first preliminary results)https://peerj.com/preprints/31322017-08-092017-08-09Aleksandr N Ishmatov
There are many theories of the seasonality of influenza for different climatic zones. But none of the known theories provides a clear explanation, especially for the tropical and subtropical climate.
Here we have originally analyzed the association/connection of activity of seasonal influenza in Okinawa (subtropical zone) with the probability of occurring of supersaturation in the human airways when inhaling environmental air under specific weather conditions.
We have shown for the first time that the effects of supersaturation in the human airways may be associated with main representative peaks of intensity/activity of influenza in Okinawa in the period of observation from Jan 2007 until Dec 2012 including 2009 pandemic.
Our observation is the first one which clearly shows in the practice that the effect of supersaturation in the airways can be used for understanding and forecast the influenza activity in subtropical and tropical zones. Because the effect of supersaturation may lead to an additional risk of acidification of epithelial lining fluid in the local areas of the respiratory tract and to additional risk of deposition of infectious agents from inhaled air in the upper airways.
There are many theories of the seasonality of influenza for different climatic zones. But none of the known theories provides a clear explanation, especially for the tropical and subtropical climate.Here we have originally analyzed the association/connection of activity of seasonal influenza in Okinawa (subtropical zone) with the probability of occurring of supersaturation in the human airways when inhaling environmental air under specific weather conditions.We have shown for the first time that the effects of supersaturation in the human airways may be associated with main representative peaks of intensity/activity of influenza in Okinawa in the period of observation from Jan 2007 until Dec 2012 including 2009 pandemic.Our observation is the first one which clearly shows in the practice that the effect of supersaturation in the airways can be used for understanding and forecast the influenza activity in subtropical and tropical zones. Because the effect of supersaturation may lead to an additional risk of acidification of epithelial lining fluid in the local areas of the respiratory tract and to additional risk of deposition of infectious agents from inhaled air in the upper airways.Sicegar: R package for sigmoidal and double-sigmoidal curve fittinghttps://peerj.com/preprints/31162017-07-312017-07-31M. Umut CaglarAshley I. TeufelClaus O Wilke
Sigmoidal and double-sigmoidal dynamics are commonly observed in many areas of biology. Here we present sicegar, an R package for the automated fitting and classification of sigmoidal and double-sigmodial data. The package categorizes data into one of three categories, "no signal", "sigmodial", or "double sigmoidal", by rigorously fitting a series of mathematical models to the data. The data is labeled as "ambiguous" if neither the sigmoidal nor double-sigmoidal model fit the data well. In addition to performing the classification, the package also reports a wealth of metrics as well as biologically meaningful parameters describing the sigmoidal or double-sigmoidal curves. In extensive simulations, we find that the package performs well, can recover the original dynamics even under fairly high noise levels, and will typically classify curves as "ambiguous" rather than misclassifying them. The package is available on CRAN and comes with extensive documentation and usage examples.
Sigmoidal and double-sigmoidal dynamics are commonly observed in many areas of biology. Here we present sicegar, an R package for the automated fitting and classification of sigmoidal and double-sigmodial data. The package categorizes data into one of three categories, "no signal", "sigmodial", or "double sigmoidal", by rigorously fitting a series of mathematical models to the data. The data is labeled as "ambiguous" if neither the sigmoidal nor double-sigmoidal model fit the data well. In addition to performing the classification, the package also reports a wealth of metrics as well as biologically meaningful parameters describing the sigmoidal or double-sigmoidal curves. In extensive simulations, we find that the package performs well, can recover the original dynamics even under fairly high noise levels, and will typically classify curves as "ambiguous" rather than misclassifying them. The package is available on CRAN and comes with extensive documentation and usage examples.Best practice in mixed effects modelling and multi-model inference in ecologyhttps://peerj.com/preprints/31132017-07-272017-07-27Xavier A HarrisonLynda DonaldsonMaria Eugenia Correa-CanoJulian EvansDavid N FisherCecily GoodwinBeth RobinsonDavid J HodgsonRichard Inger
The use of linear mixed effects models (LMMs) is increasingly common in the analysis of biological data. Whilst LMMs offer a flexible approach to modelling a broad range of data types, ecological data are often complex and require complex model structures, and the fitting and interpretation of such models is not always straightforward. The ability to achieve robust biological inference requires that practitioners know how and when to apply these tools. Here, we provide a general overview of current methods for the application of LMMs to biological data, and highlight the typical pitfalls that can be encountered in the statistical modelling process. We tackle several issues relating to the use of information theory and multi-model inference in ecology, and demonstrate the tendency for data dredging to lead to greatly inflated Type I error rate (false positives) and impaired inference. We offer practical solutions and direct the reader to key references that provide further technical detail for those seeking a deeper understanding. This overview should serve as a widely accessible code of best practice for applying LMMs to complex biological problems and model structures, and in doing so improve the robustness of conclusions drawn from studies investigating ecological and evolutionary questions.
The use of linear mixed effects models (LMMs) is increasingly common in the analysis of biological data. Whilst LMMs offer a flexible approach to modelling a broad range of data types, ecological data are often complex and require complex model structures, and the fitting and interpretation of such models is not always straightforward. The ability to achieve robust biological inference requires that practitioners know how and when to apply these tools. Here, we provide a general overview of current methods for the application of LMMs to biological data, and highlight the typical pitfalls that can be encountered in the statistical modelling process. We tackle several issues relating to the use of information theory and multi-model inference in ecology, and demonstrate the tendency for data dredging to lead to greatly inflated Type I error rate (false positives) and impaired inference. We offer practical solutions and direct the reader to key references that provide further technical detail for those seeking a deeper understanding. This overview should serve as a widely accessible code of best practice for applying LMMs to complex biological problems and model structures, and in doing so improve the robustness of conclusions drawn from studies investigating ecological and evolutionary questions.Sci-Hub provides access to nearly all scholarly literaturehttps://peerj.com/preprints/31002017-07-202017-07-20Daniel S HimmelsteinAriel R RomeroStephen R McLaughlinBastian Greshake TzovarasCasey S Greene
The website Sci-Hub provides access to scholarly literature via full text PDF downloads. The site enables users to access articles that would otherwise be paywalled. Since its creation in 2011, Sci-Hub has grown rapidly in popularity. However, until now, the extent of Sci-Hub's coverage was unclear. As of March 2017, we find that Sci-Hub's database contains 68.9% of all 81.6 million scholarly articles, which rises to 85.2% for those published in closed access journals. Furthermore, Sci-Hub contains 77.0% of the 5.2 million articles published by inactive journals. Coverage varies by discipline, with 92.8% coverage of articles in chemistry journals compared to 76.3% for computer science. Coverage also varies by publisher, with the coverage of the largest publisher, Elsevier, at 97.3%. Our interactive browser at https://greenelab.github.io/scihub allows users to explore these findings in more detail. Finally, we estimate that over a six-month period in 2015–2016, Sci-Hub provided access for 99.3% of valid incoming requests. Hence, the scope of this resource suggests the subscription publishing model is becoming unsustainable. For the first time, the overwhelming majority of scholarly literature is available gratis to anyone with an Internet connection.
The website Sci-Hub provides access to scholarly literature via full text PDF downloads. The site enables users to access articles that would otherwise be paywalled. Since its creation in 2011, Sci-Hub has grown rapidly in popularity. However, until now, the extent of Sci-Hub's coverage was unclear. As of March 2017, we find that Sci-Hub's database contains 68.9% of all 81.6 million scholarly articles, which rises to 85.2% for those published in closed access journals. Furthermore, Sci-Hub contains 77.0% of the 5.2 million articles published by inactive journals. Coverage varies by discipline, with 92.8% coverage of articles in chemistry journals compared to 76.3% for computer science. Coverage also varies by publisher, with the coverage of the largest publisher, Elsevier, at 97.3%. Our interactive browser at https://greenelab.github.io/scihub allows users to explore these findings in more detail. Finally, we estimate that over a six-month period in 2015–2016, Sci-Hub provided access for 99.3% of valid incoming requests. Hence, the scope of this resource suggests the subscription publishing model is becoming unsustainable. For the first time, the overwhelming majority of scholarly literature is available gratis to anyone with an Internet connection.Examining publication bias – A simulation-based evaluation of statistical tests on publication biashttps://peerj.com/preprints/30592017-06-292017-06-29Andreas Schneck
Background
Publication bias is a form of scientific misconduct. It threatens the validity of research results and the credibility of science. Although several tests on publication bias exist, no in-depth evaluations are available that suggest which test to use for the specific research problem.
Methods
In the study at hand four tests on publication bias, Egger’s test (FAT), p-uniform, the test of excess significance (TES), as well as the caliper test, were evaluated in a Monte Carlo simulation. Two different types of publication bias, as well as its degree (0%, 50%, 100%), were simulated. The type of publication bias was defined either as file-drawer, meaning the repeated analysis of new datasets, or p-hacking, meaning the inclusion of covariates in order to obtain a significant result. In addition, the underlying effect (β = 0, 0.5, 1, 1.5), effect heterogeneity, and the number of observations in the simulated primary studies (N =100, 500), as well as in the number of observations for the publication bias tests (K =100, 1000), were varied.
Results
All tests evaluated were able to identify publication bias both in the file-drawer and p-hacking condition. The false positive rates were, with the exception of the 15%- and 20%-caliper test, unbiased. The FAT had the largest statistical power in the file-drawer conditions, whereas under p-hacking the TES was, except under effect heterogeneity, slightly better. The caliper test was, however, inferior to the other tests under effect homogeneity and had a decent statistical power only in conditions with 1000 primary studies.
Discussion
The FAT is recommended as a test for publication bias in standard meta-analyses with no or only small effect heterogeneity. If no clear direction of publication bias is suspected the TES is the first alternative to the FAT. The 5%-caliper tests is recommended under conditions of effect heterogeneity, which may be found if publication bias is examined in a discipline-wide setting when primary studies cover different research problems.
BackgroundPublication bias is a form of scientific misconduct. It threatens the validity of research results and the credibility of science. Although several tests on publication bias exist, no in-depth evaluations are available that suggest which test to use for the specific research problem.MethodsIn the study at hand four tests on publication bias, Egger’s test (FAT), p-uniform, the test of excess significance (TES), as well as the caliper test, were evaluated in a Monte Carlo simulation. Two different types of publication bias, as well as its degree (0%, 50%, 100%), were simulated. The type of publication bias was defined either as file-drawer, meaning the repeated analysis of new datasets, or p-hacking, meaning the inclusion of covariates in order to obtain a significant result. In addition, the underlying effect (β = 0, 0.5, 1, 1.5), effect heterogeneity, and the number of observations in the simulated primary studies (N =100, 500), as well as in the number of observations for the publication bias tests (K =100, 1000), were varied.ResultsAll tests evaluated were able to identify publication bias both in the file-drawer and p-hacking condition. The false positive rates were, with the exception of the 15%- and 20%-caliper test, unbiased. The FAT had the largest statistical power in the file-drawer conditions, whereas under p-hacking the TES was, except under effect heterogeneity, slightly better. The caliper test was, however, inferior to the other tests under effect homogeneity and had a decent statistical power only in conditions with 1000 primary studies.DiscussionThe FAT is recommended as a test for publication bias in standard meta-analyses with no or only small effect heterogeneity. If no clear direction of publication bias is suspected the TES is the first alternative to the FAT. The 5%-caliper tests is recommended under conditions of effect heterogeneity, which may be found if publication bias is examined in a discipline-wide setting when primary studies cover different research problems.Genetic diversity and population structure in an invasive pantropical earthworm along an altitudinal gradienthttps://peerj.com/preprints/30242017-06-232017-06-23Diana Ortíz-GaminoLuis CunhaEsperanza Martínez-RomeroNorma Flores-EstévezÁngel I Ortíz-Ceballos
Population genetic analyses of the invasive pantropical earthworm P. corethrurus populations will contribute significantly to better understand the ecology and especially the reproductive system of this species. Using 34 polymorphic ISSR markers the genetic diversity and population structure was assessed for four populations of P. corethrurus along an altitudinal gradient, ranging from sea level up to ~1667 meters. Nuclear markers were able to distinguish two genetic clusters, probably corresponding to two distinct genetic lineages, herein defined as A and B. Clones were detected in one population (Actopan at 480 masl) and its number was lower than expected for a parthenogenetic species. Nevertheless, low levels of genetic diversity and a high number of intermediary genotypes were detected among the studied P. corethrurus populations with no apparent population structure related to the distinct geographic regions, which may indicate that human-mediated transference is prevalent, in particular, for the lower altitude regions. Hybridisation between the two genetic clusters was tested and pointed to 11 MLGs as being later-generation hybrids (B1 introgression) mainly associated with the three lower altitude regions. Still, most of the individuals seem to belong to lineage A and only five individuals seem to belong exclusively to the lineage B. Interestingly, these parental individuals were only found present at the highest altitude site, Naolinco (1566-1667 masl), which also showed the highest values of genotypic richness. During the biological invasion, multiple introduction of different genetic lineages can provide opportunities for admixture among genetically distinct clusters. The signatures of admixture among P. corethrurus populations along the altitudinal gradient in Mexico may have allowed the invasion success by directly increasing fitness. ISSR markers revealed to be useful for the study of genetic variation in the invasive pantropical earthworm, P. corethrurus.
Population genetic analyses of the invasive pantropical earthworm P. corethrurus populations will contribute significantly to better understand the ecology and especially the reproductive system of this species. Using 34 polymorphic ISSR markers the genetic diversity and population structure was assessed for four populations of P. corethrurus along an altitudinal gradient, ranging from sea level up to ~1667 meters. Nuclear markers were able to distinguish two genetic clusters, probably corresponding to two distinct genetic lineages, herein defined as A and B. Clones were detected in one population (Actopan at 480 masl) and its number was lower than expected for a parthenogenetic species. Nevertheless, low levels of genetic diversity and a high number of intermediary genotypes were detected among the studied P. corethrurus populations with no apparent population structure related to the distinct geographic regions, which may indicate that human-mediated transference is prevalent, in particular, for the lower altitude regions. Hybridisation between the two genetic clusters was tested and pointed to 11 MLGs as being later-generation hybrids (B1 introgression) mainly associated with the three lower altitude regions. Still, most of the individuals seem to belong to lineage A and only five individuals seem to belong exclusively to the lineage B. Interestingly, these parental individuals were only found present at the highest altitude site, Naolinco (1566-1667 masl), which also showed the highest values of genotypic richness. During the biological invasion, multiple introduction of different genetic lineages can provide opportunities for admixture among genetically distinct clusters. The signatures of admixture among P. corethrurus populations along the altitudinal gradient in Mexico may have allowed the invasion success by directly increasing fitness. ISSR markers revealed to be useful for the study of genetic variation in the invasive pantropical earthworm, P. corethrurus.A mathematical theory of knowledge, science, bias and pseudosciencehttps://peerj.com/preprints/19682017-06-222017-06-22Daniele Fanelli
This essay proposes mathematical answers to meta-scientific questions including "how much knowledge is produced by research?", "how rapidly is a field making progress?", "what is the expected reproducibility of a result?", "what do we mean by soft science?", "what demarcates a pseudoscience?", and many others. From two simple postulates - 1) information is finite; 2) knowledge is information compression - we derive a function \(K(y;x\tau)=\frac{T(y)-T(y|x \tau)}{(T(y)+T(x)+T(\tau)}\), in which the total information \(T()\) contained in an explanandum \(y\) is lossless or lossy compressed via an explanans composed of an information input \(x\) and a "theory" component \(\tau\). The latter is a factor that conditions the relationship between \(y\) and \(x\), with an information "cost" equivalent to the description length of the relationship itself. This function is proposed as a simple and universal tool to understand and analyse knowledge dynamics, scientific or otherwise. Soft sciences are shown to be simply fields that yield relatively low K values. Bias turns out to be information that is concealed in methodological choices, thereby reducing K. Disciplines typically classified as pseudosciences are suggested to be sciences that suffer from extreme bias: their informational input is greater than their output, yielding \(K(y;x\tau) < 0\). The essay derives numerous general results, some of which may be counter-intuitive. For example, it suggests that reproducibility failures are inevitable, and that the value of publishing negative results may vary across fields and within a field over time. Therefore, there may be conditions in which the costs of reproducible research practices such as publishing negative results and sharing data may outweigh the benefits. The theory makes several testable predictions concerning science and cognition in general, and it may have numerous applications that future research could develop, test and implement to foster progress on all frontiers of knowledge.
This essay proposes mathematical answers to meta-scientific questions including "how much knowledge is produced by research?", "how rapidly is a field making progress?", "what is the expected reproducibility of a result?", "what do we mean by soft science?", "what demarcates a pseudoscience?", and many others. From two simple postulates - 1) information is finite; 2) knowledge is information compression - we derive a function \(K(y;x\tau)=\frac{T(y)-T(y|x \tau)}{(T(y)+T(x)+T(\tau)}\), in which the total information \(T()\) contained in an explanandum \(y\) is lossless or lossy compressed via an explanans composed of an information input \(x\) and a "theory" component \(\tau\). The latter is a factor that conditions the relationship between \(y\) and \(x\), with an information "cost" equivalent to the description length of the relationship itself. This function is proposed as a simple and universal tool to understand and analyse knowledge dynamics, scientific or otherwise. Soft sciences are shown to be simply fields that yield relatively low K values. Bias turns out to be information that is concealed in methodological choices, thereby reducing K. Disciplines typically classified as pseudosciences are suggested to be sciences that suffer from extreme bias: their informational input is greater than their output, yielding \(K(y;x\tau) < 0\). The essay derives numerous general results, some of which may be counter-intuitive. For example, it suggests that reproducibility failures are inevitable, and that the value of publishing negative results may vary across fields and within a field over time. Therefore, there may be conditions in which the costs of reproducible research practices such as publishing negative results and sharing data may outweigh the benefits. The theory makes several testable predictions concerning science and cognition in general, and it may have numerous applications that future research could develop, test and implement to foster progress on all frontiers of knowledge.Statistical infarction: A postmortem of the Cornell Food and Brand Lab pizza publicationshttps://peerj.com/preprints/30252017-06-142017-06-14Jordan AnayaTim van der ZeeNick Brown
We previously reported over 150 inconsistencies in a series of four articles (the "pizza papers") from the Cornell Food and Brand Lab that described a study of eating habits at an all-you-can-eat pizza buffet. The lab's initial response led us to investigate more of their work, and our investigation has now identified issues with at least 45 publications from this lab. Perhaps because of the growing media attention, Cornell and the lab have released a statement concerning the pizza papers, which included a response to the inconsistencies, along with data and code. Many of the inconsistencies were identified with the new technique of granularity testing, and this case has the highest density of granularity inconsistencies that we know of. This is also the first time a data set has been made public after granularity concerns were raised, making it a highly suitable case study for showing the accuracy and potential of this technique. It is also important that a third party audit the lab's response, given the continuing investigation of misconduct and presumably future reports and data releases. Our careful inspection of the data set suggests no evidence of fabrication, but we found the lab's report confusing, incomplete, and error prone. In addition, we found the number of missing, unusual, and logically impossible responses in the data set highly concerning. Unfortunately, given the unsound theory, poor methodology, questionable data, and countless errors, we find it remarkable that these four papers were published and recommend retraction of all four papers.
We previously reported over 150 inconsistencies in a series of four articles (the "pizza papers") from the Cornell Food and Brand Lab that described a study of eating habits at an all-you-can-eat pizza buffet. The lab's initial response led us to investigate more of their work, and our investigation has now identified issues with at least 45 publications from this lab. Perhaps because of the growing media attention, Cornell and the lab have released a statement concerning the pizza papers, which included a response to the inconsistencies, along with data and code. Many of the inconsistencies were identified with the new technique of granularity testing, and this case has the highest density of granularity inconsistencies that we know of. This is also the first time a data set has been made public after granularity concerns were raised, making it a highly suitable case study for showing the accuracy and potential of this technique. It is also important that a third party audit the lab's response, given the continuing investigation of misconduct and presumably future reports and data releases. Our careful inspection of the data set suggests no evidence of fabrication, but we found the lab's report confusing, incomplete, and error prone. In addition, we found the number of missing, unusual, and logically impossible responses in the data set highly concerning. Unfortunately, given the unsound theory, poor methodology, questionable data, and countless errors, we find it remarkable that these four papers were published and recommend retraction of all four papers.The earth is flat (p>0.05): Significance thresholds and the crisis of unreplicable researchhttps://peerj.com/preprints/29212017-06-142017-06-14Valentin AmrheinFränzi Korner-NievergeltTobias Roth
The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.
The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.