Scatterplots are the most common way for statisticians, scientists, and the public to visually detect relationships between measured variables. At the same time, and despite widely publicized controversy,

Over the last two decades there has been a dramatic increase in the amount and variety of data available to scientists, physicians, and business leaders in nearly every area of application. Statistical literacy is now critical for anyone consuming data analysis reports, including scientific papers, newspaper reports (

Despite the critical importance of statistics and data analysis in modern life, we have relatively little empirical evidence about how statistical tools work in the hands of typical analysts and consumers. The most well-studied statistical tool is the visual display of quantitative information. Previous studies have shown that humans have difficulty interpreting linear measures of correlation (

Here we perform a large-scale study of the ability of average data analysts to detect statistically significant relationships from scatterplots. Our study compares two of the most common data analysis tasks, making scatterplots and calculating

Data analysts frequently use exploratory scatterplots for model selection and building. Selecting which variables to include in a model can be viewed as visual hypothesis testing where the test statistic is the plot and the measure of significance is human judgement. However, it is not well known how accurately humans can visually classify significance when looking at graphs of raw data. This classification task depends on both understanding what combinations of sample size and effect size constitute significant relationships, and being able to visually distinguish these effect sizes. We performed a set of experiments to (1) estimate the baseline accuracy with which subjects could visually determine if two variables showed a statistically significant relationship; (2) test whether accuracy in visually classifying significance was changed by the number of data points in the plot or the way the plot was presented; and (3) test whether accuracy in visually classifying significance improved with practice. Our intuition is that potential improvements with practice would be better explained by an improved cognitive understanding of statistical significance, rather than an improved perceptive ability to distinguish effect sizes.

Our study was conducted within the infrastructure of a statistics massive online open course (MOOC). While MOOCs have previously been used to study MOOCs (

Each student who participated in the survey was shown a set of bi-variate scatterplots (examples shown in

Reference | 100 data points (e.g., |

Smaller |
35 data points |

Larger |
200 data points |

Best-fit line | 100 data points, with best fit line added |

Lowess | 100 data points, with smooth lowess curve added (using R “lowess” function) |

Axis Scale | 100 data points, with the axis range increased to 1.5 standard deviations outside |

Axis Label | 100 data points, with fictional |

For each plot, students were asked to visually determine whether the bi-variate relationship shown was statistically significant at the 0.05 level (in the example plots shown in

To analyze responses, we created separate models for the probability of correctly visually classifying significance in: (1) graphs that showed two variables with a statistically significant relationship (e.g.,

Point estimates and confidence intervals for classification accuracy for each presentation style (

We found that, overall, subjects tended to be conservative in their classifications of significance. In the reference category (100 data points;

When comparing the reference plot category of 100 data points to other plot categories (

The exception to this counter-balancing trend came in “Larger

To test if accuracy in visually classifying significance improved with practice, we selected only the students who submitted the quiz multiple times (101 students) and compared accuracy rates between these students’ first and second attempts. Of these students, 92% completed their first attempt of the survey, and 99% completed their second attempt of the survey. Because these students self-selected to take the survey twice, they may not form a representative sample of the broader population. However, they may still be representative of motivated students who wish to improve their statistical skills.

Each plot shows point estimates and confidence intervals for accuracy rates of human visual classifications of statistical significance on the first and second attempt of the survey. For the truly significant underlying

We found that, for the “Reference”, “Best Fit”, and “Smaller

Our research focuses on the question of how accurately statistical significance can be visually perceived in scatterplots of raw data. This work is a logical extension of previous studies on the visual perception of correlation in raw data scatterplots (

Our results also suggest that, on average, readers can improve their ability to visually perceive statistical significance through practice. Our intuition is that this improvement is better explained by an improved understanding of what effect sizes constitute significant relationships, rather than an improved ability to visually distinguish these effect sizes. It would follow that the apparent baseline poor accuracy in visually detecting significance is largely due to a false intuition for what constitutes significant relationships. A broad movement towards practicing the task of visually classifying significance could improve this intuition, and better the efficiency and clarity of communication in science. To help readers train their sense for

This work is also relevant to debate over the misuse of EDA. It has been argued that when EDA and formal hypothesis testing are applied to the same dataset, the “data snooping” committed through EDA process can increase the Type I error rates of the formal hypothesis tests (

Data analysis involves the application of statistical methods. Our study highlights that even when the theoretical properties of a statistic are well understood, the actual behavior in the hands of data analysts may not be known. Our study highlights the need for placing the practice of data analysis on a firm scientific footing through experimentation. We call this idea of evidence based data analysis, as it closely parallels the idea of evidence based medicine, the term for scientifically studying the impact of clinical practice. Evidence based data analysis studies the practical efficacy of a broad range of statistical methods when used, sometimes imperfectly, by analysts with different levels of statistical training. Further research in evidence based data analysis may be one way to reduce the well-documented problems with reproducibility and replicability of complicated data analyses.

The authors declare there are no competing interests.

The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers):

Johns Hopkins Bloomberg School of Public Health IRB Approval number: IRB00005072, 45 CFR 46.101(b)(4).

The following information was supplied regarding the deposition of related data: