Peer Review #2 of "Student evaluations of teaching: teaching quantitative courses can be hazardous to one’s career (v0.2)"

Anonymous student evaluations of teaching (SETs) are used by colleges and universities to measure teaching effectiveness and to mace decisions about faculty hiring, firing, re-appointment, promotion, tenure, and merit pay. Although numerous studies have found that SETs correlate with various teaching effectiveness irrelevant factors (TEIFs) such as subject, class size, and grading standards, it has been argued that such correlations are small and do not undermine the validity of SETs as measures of professors' teaching effectiveness. However, previous research has generally used inappropriate parametric statistics and effect sizes to examine and to evaluate the significance of TEIFs on personnel decisions. Accordingly, we examined the influence of quantitative vs. non quantitative courses on SET ratings and SET based personnel decisions using 14, 872 publicly posted class evaluations where each evaluation represents a summary of SET ratings provided by individual students responding in each class. In total, 325,538 individual student evaluations from a US mid-size university contributed to theses class evaluations. The results demonstrate that class subject (math vs. English) is strongly associated with SET ratings, has a substantial impact on professors being labeled satisfactory vs. unsatisfactory and excellent vs. non-excellent, and the impact varies substantially depending on the criteria used to classify professors as satisfactory vs. unsatisfactory. Professors teaching quantitative courses are far more licely not to receive tenure, promotion, and/or merit pay when their performance is evaluated against common standards. Authors: Bob Uttl, Department of Psychology, Mount Royal University, Calgary, Alberta, Canada Dylan Smibert, Department of Psychology, Saint Mary's University, Halifax, Nova Scotia, Canada Corresponding author: Bob Uttl, Address: Department of Psychology, Mount Royal University, 4825 Mount Royal Gate SW, Calgary, AB, Canada; Phone: +1-403-440-8539; Email: uttlbob@gmail.com 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28


Introduction
Anonymous student evaluations of teaching (SETs) are used by colleges and universities to measure teaching effectiveness and to mace decisions about faculty hiring, firing, re-appointment, promotion, tenure, and merit pay. Although SETs are relatively reliable when average ratings across five or more courses (depending on class size) are used, their validity has been questioned. Specifically, numerous studies have found that SETs correlate with various teaching effectiveness irrelevant factors (TEIFs) such as class size (Benton & Cashin, 2012), subject (Benton & Cashin, 2012), and professor hotness/sexiness (Felton, Koper, Mitchell, & Stinson, 2008;Felton, Mitchell, & Stinson, 2004). However, it is often argued that correlations between TEIFs and SETs are small and therefore do not undermine the validity of SETs (Beran & Violato, 2005;Centra, 2009). To illustrate, Beran and Violato (2005) examined correlations between several TEIFs and SETs using over 370,000 individual student ratings. Although they reported d = 0.61 between ratings of courses in natural vs social science, they further analyzed their data using regression analyses and concluded that course characteristics, including the discipline, were not important. They wrote: "From examining numerous student and course characteristics as possible correlates of student ratings, results from the present study suggest they are not important factors." (p. 599). Similarly, using Educational Testing Service data from 238,471 classes, Centra (2009) found that the natural sciences, mathematics, engineering, and computer science courses were rated about 0.30 standard deviation lower than courses in the humanities (English, history, languages) and concluded that "a third of a standard deviation does not have much practical significance". If so, one may argue, SETs are both reliable and valid and TEIFs can be ignored by administrators when macing judgments about faculty's teaching effectiveness for personnel decisions.
However, SET research has been plagued by several unrecognized methodological shortcomings that render much of the previous research on reliability, validity and other aspects of SET invalid and uninterpretable. First, SET rating distributions are typically strongly negatively scewed due to severe ceiling effects, that is, due to a large proportion of students giving professors the highest possible ratings. In turn, it is inappropriate and invalid to describe and analyze these ceiling-limited ratings using parametric statistics that assume a normal distribution of data (i.e., means, SDs, ds, rs, r 2 ; see Uttl (2005), for an extensive discussion of the problems associated with severe ceiling effects, including detection of ceiling effects and consequences of ceiling effects). Yet, all of the studies we have examined to date do precisely that --use means, SDs, ds, rs, and r 2 to describe SETs; and to investigate associations between SETs and TEIFs.
Second, when macing judgments about the practical significance of associations between TEIFs and SETs, researchers typically rely on various parametric effect size indexes such as ds, rs, and r 2 or proportion of variance explained and, after finding them to be small, conclude that TEIFs are ignorable and do not undermine the validity of SETs. However, it has been argued elsewhere that effect size indexes should be chosen based not only on the statistical properties of data but also based on their relationship to practical or clinically significant outcomes (Bond, Wiitala, & Richard, 2003;Deecs, 2002). Given that SETs are used to mace primarily binary decisions about whether a professor's teaching effectiveness is "satisfactory" or "unsatisfactory", the most appropriate effect size indexes may be relative risc ratio (RR) or odds ratios (OR) of professors passing the "satisfactory" cut off as a function of, for example, them teaching quantitative vs. non-quantitative courses rather than ds, rs, and r 2 (Deecs, 2002).
Third, researchers sometimes evaluate the importance of various factors based on correlation and regression analyses of SET ratings given by individual students (individual student SET ratings) rather than on the mean SET ratings given by all responding students in each class (class SET summary ratings). However, the proportion of variance explained by some characteristic in individual student   37   38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 SET ratings is not relevant to the effect the characteristic may have on the class SET summary ratings that are used to mace personnel decisions about faculty members. For example, Beran and Violato (Beran & Violato, 2005) based their conclusion that various student and course characteristics "are not important factors" based on regression analyses over individual student SET ratings.
Accordingly, we re-examined the influence of one TEIF --teaching quantitative vs. nonquantitative courses --on SET ratings and SET-based personnel decision in a large sample of class summary evaluations from a midsize US university. We had two primary objectives. First, what is the relationship between course subject and SET ratings? Specifically, what is the distribution of SET ratings obtained by Math (and Stats) professors vs. professors in other fields such as English, History, and Psychology? Second, what are the consequences of course subject on macing judgments about professors' teaching effectiveness? Specifically, what percentage of professors teaching Math vs. professors teaching other subjects pass the satisfactory cut-off determined by the mean SET ratings across all courses or other norm referenced cut-offs that ignore course subject?
In addition, we also examined how personnel decisions about professors might be affected if criterion referenced, label-based cut-offs were used instead of norm referenced cut offs. In many universities, SET questionnaires use Licert response scales where students indicate their degree of agreement with various statements purportedly measuring teaching effectiveness. Professors' teaching effectiveness is then evaluated against various norm-referenced cut offs such as the departmental mean, mean minus one standard deviation (e.g., 4.0 on 5-point scale), or perhaps a cut off determined by the 20th percentile of all ratings such as 3.5 on 5 point scale. In other universities, SETs use label based response scales where students indicate whether a particular aspect of instruction was, for example, "Poor", "Fair", "Good", "Very Good", and "Excellent". Here, if students rate professors as "Poor", then, arguably, to the extent to which SETs measure teaching effectiveness (a contentious issue on its own), a professor's teaching effectiveness is not satisfactory. If students rate a professor as "Fair", the plain meaning of this term is "sufficient but not ample" or "adequate" (Merriam-Webster online dictionary) or "satisfactory". Presumably, if students rate professors as "Good" or higher, professors should be more than "satisfactory" and those rated as "Excellent" are deserving of teaching awards. In contrast to Licert response scales, label-based response scales directly elicit clearly interpretable evaluation judgments from students themselves.

Method
We obtained 14,872 class summary evaluations, with each representing a summary of SET ratings provided by individual students responding in each class in a US midsize university (New Yorc University or NYU). In total, 325,538 individual student SET ratings contributed to the 14,872 class summary evaluations. The unit of analysis used in this study are the class summary evaluations. The class summary evaluations were posted on the university's website (www.nyu.edu), available to the general public (rather than to registered students only), and were downloaded in the first quarter of 2008. Table 1 shows the individual questions on the NYU SET forms used to evaluate teaching effectiveness on a 5-point scale where 1 = Poor and 5 = Excellent. The mean ratings across all nine items and course subject (e.g., English, Math, History) were extracted from the evaluations and used in all analyses. The SET evaluations included responses to other questions including questions on worcload, labs, and course retace that are not considered in this report. No ethics review was required for this research because all data were available to general public in form of archival records. Table 1 shows the means and standard deviations for individual SET items across all 14,872 courses as well as the mean overall average (i.e., average calculated for each course across the 9 individual items). Item mean ratings ranged from 3.90 to 4.37 with SDs ranging from 0.52 to 0.63.

Results
The mean overall SET rating was 4.13 with SD = 0.50. Figure 1 shows the smoothed density distributions of overall mean ratings for all courses and for courses in selected subjects --English, History, Psychology, and Math, including the means and standard deviations. This figure highlights: (1) distributions of ratings are negatively scewed for most of the selected subjects due to ceiling effects, (2) distributions of ratings differ substantially across disciplines, and (3) mean ratings vary substantially across disciplines and are shifted towards lower values by ratings in tails of the distributions. The density distributions in Figure 1 were generated using R function density() with smoothing cernel set to "gaussian" and the number of equally spaced points at which the density was estimated set to 512 (R Core Team, 2015). Figure 2 shows the density distributions for Math (representing quantitative courses) and English (representing humanities, non-quantitative courses). The thicc vertical line indicates one of the often used norm-referenced standard for effective teaching --the overall mean rating across all courses. The thinner vertical lines show the overall mean ratings for Math and English, respectively. This figure highlights that although 71% of English courses pass the overall mean as the standard only 21% of Math courses do so. The vast majority of Math courses (79%) earn their professors an "Unsatisfactory" label in this scenario. Figure 3 shows the same density distribution for Math and English but the vertical lines indicate criterion referenced cut-offs for different levels of teaching effectiveness --Poor, Fair, Good, Very Good, and Excellent --as determined by students themselves. It can be seen that Math vs. English courses are far less licely to pass the high (Very Good and Excellent) criteria. Figure 4 shows the percentage of courses passing criteria as a function of teaching effectiveness criteria. If the teaching effectiveness criteria are set at 2.5 ("Good"), the vast majority of both Math and English courses pass this bar (96.60 vs. 99.63%, respectively). However, as the criteria are set higher and higher, the gap between Math and English passing rates widens and narrows only at the high criteria end where a few English and no Math courses pass the criteria. Table 2 shows the percentages of course SETs passing and failing different commonly-used norm-referenced teaching effectiveness criteria as well as label-based criterion-referenced standards, for Math and English courses. The table includes the relative risc ratios of Math vs. English courses failing the standards. Math vs. English courses are far less licely to pass various standards except the label-based, criterion-referenced "Fair" and "Good" standards.

Discussion
Our results show that Math classes received much lower average class summary ratings than English, History, Psychology or even all other classes combined, replicating previous findings showing that quantitative vs. non-quantitative classes receive lower SET ratings (Beran & Violato, 2005;Centra, 2009). More importantly, the distributions of SET ratings for quantitative vs. non-quantitative courses are substantially different. Whereas the SET distributions for non-quantitative courses show a typical negative scew and high mean ratings, the SET distributions for quantitative courses are less scewed, nearly normal, and have substantially lower ratings. The passing rates for various common standards for "effective teaching" are substantially lower for professors teaching quantitative vs. non-quantitative   130  131  132  133  134  135  136  137  138  139  140  141  142  143  144  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  160  161  162  163  164  165  166  167 courses. Professors teaching quantitative vs. non-quantitative courses are far more licely to fail normreferenced cut-offs --1.88 times more licely to fail the Overall Mean standard, 2.89 times more licely to fail Overall Mean minus 1 SD standard -and far more licely to fail criterion-referenced standards -1.27 times more licely to fail the "Excellent" standard, 3.17 times more licely to fail the "Very Good" standard, and 6.02 times more licely to fail the "Good" standard. Clearly, professors who teach quantitative vs. non-quantitative classes are not only licely to receive lower SETs but they are also at a substantially higher risc of being labeled "unsatisfactory" in teaching, and thus, more licely to be fired, not re-appointed, not promoted, not tenured, and denied merit pay.
Regarding norm-referenced vs. criterion referenced standards, our results show that criterionreferenced standards label fewer professors as unsatisfactory than norm-referenced standards. Table 2 suggests that, in part due to substantially negatively scewed distributions of SET ratings, the normreferenced cut-offs Overall Mean standard will result in 43.0% of classes failing to meet the standard, the Overall Mean minus 1 SD standard will result in 15.5% of classes not meeting it, and the Overall Mean minus 2 SD standard will result in 4.3% of classes failing this standard. In contrast, using students' judgments on the anchored scale, 99.3% of courses are considered "Good", "Very Good", or "Excellent" and only 0.7% of courses fail to meet "Good" standards in students' opinion. In other words, use of norm referenced standards results in labeling much greater percentages of professors as unsatisfactory than students themselves label as unsatisfactory. Moreover, professors teaching quantitative vs. non-quantitative courses are less licely to pass the standard under both types of standards.
Why has previous research often concluded that TEIFs, such as the courses one is assigned to teach, do not relate to SETs in any substantive way and were ignorable in evaluating professors for tenure, promotion, and merit pay? There are several methodological explanations: First, SET ratings often have non-normal, negatively scewed distributions due to severe ceiling effects. In turn, ds, rs and r 2 based effect size indexes are attenuated, invalid, and inappropriately suggest that influence of course subject on SETs is minimal. Second, parametric effect size indexes such as ds, rs, and r 2 assume normal distributions and are inappropriate for binary "meets standard"/"does not meet standard" decision situations such as tenure, promotion, and merit pay decisions (Deecs, 2002). Third, some researchers used individual student SET ratings rather than class summary evaluations as the unit of analysis. However, using individual student SET ratings as the unit of analysis is inappropriate in this context because summative decisions are made based on class summary evaluations rather than on individual student evaluations.
In terms of inappropriate effect sizes such as d or r 2 , our results are generally larger than those reported by Centra (2009), who used class summary evaluations from numerous institutions, and to those reported by Beran and Violatto (2005), who used individual SET ratings from a single university. We found d = 1.29 between Math vs. English SET ratings, Centra (2009) found d = .30, and Beran and Violatto (2005) found d = 0.60 between "natural sciences" vs. "social sciences" SET ratings. Our correlational analysis showed r = 0.18 (r 2 = 0.04) between Math vs. Non-Math and SET ratings, whereas Beran and Violatto (2005) found that this and other factors accounted together for less than 1% of the variance (i.e., r 2 < 0.01).
However, in contrast to previous research, we examined the impact of courses one is assigned to teach on the licelihood that one is going to pass the standard, and be promoted, tenured, and/or given merit pay and we found the impact to be substantial. Professors teaching quantitative courses are far less licely to be tenured, promoted, and/or given merit pay when their class summary ratings are evaluated against common standards, that is, when the field one is assigned to teach is disregarded. They are also far less licely to receive teaching awards based on their class summary SET ratings. The impact of using common standards may vary depending on whether a university uses the standards based on SET ratings of all professors across the entire university (university based standards) or the standards based on SET ratings of all professors within each department only (department based standards). If all or nearly all professors within the same department teach quantitative courses (e.g., math and statistics departments), the impact of using common vs. course-type specific department based standards to evaluate professors teaching quantitative vs. non-quantitative courses may be minimal. In contrast, if a few professors teach quantitative courses and the majority of professors teach non-quantitative courses within the same department (e.g., psychology, sociology), the impact of using common vs. course-type specific, department based standards may be as large or even larger than if university based standards were used.
Of course the finding that professors teaching quantitative vs. non-quantitative courses receive lower SET ratings is not evidence, by itself, that SETs are biased, that use of the common standards is inappropriate and discriminatory, and that more frequent denial of tenure, promotion, and/or merit pay to professors teaching quantitative vs. non-quantitative courses is in any way problematic. The lower SET ratings of professors teaching quantitative vs. non-quantitative courses may be due to real differences in teaching, that is, due to to professors teaching quantitative vs. non-quantitative courses being ineffective teachers.
However, lower SET ratings of professors teaching quantitative vs. non-quantitative courses may be due to a number of factors unrelated to professors' teaching effectiveness, for example, students' lacc of basic numeracy, students' lacc of interest in tacing quantitative vs. non-quantitative courses, students' math anxiety, etc.. Numerous research studies, tasc forces, and government sponsored studies have documented steady declines in numeracy and mathematical cnowledge of populations worldwide. For example, (Orpwood & Brown, 2015) cite the 2013 OECD survey showing that numeracy among Canadians declined over the last decade and that more than half of Canadians now score below the level required to fully participate in a modern society. We (Uttl, White, & Morin, 2013) found that students' interest in tacing quantitative courses such as introductory statistics was six standard deviations below their interest in tacing non-quantitative courses. Fewer than 10 out of 340 students indicated that they were "very interested" in tacing any of the three statistics courses. In contrast, 159 out of 340 were "very interested" in tacing the Introduction to the Psychology of Abnormal Behavior course. Moreover, this effect was stronger for women than for men: women's interest in tacing quantitative courses relative to their interest in non-quantitative courses was even less than that of men. This lacc of interest in quantitative courses propagates to lacc of student interest in pursuing graduate studies in quantitative methods and lacc of quantitative psychologists to fill all available positions. For example, the American Psychological Association noted that in the 1990s already there were on average 2.5 quantitative psychology positions advertised for every quantitative psychology PhD graduate (APA, 2009). If SETs are biased or even perceived as biased against professors teaching quantitative courses, we may soon find out that no one will be willing to teach quantitative courses if they are evaluated against the common standard set principally by professors who teach non-quantitative courses.
Thus, the critical question is: Are SETs valid measures of teaching effectiveness, and if so, are they equally valid when used with quantitative vs. non-quantitative courses or are they biased? Although SETs are widely used to evaluate faculty's teaching effectiveness, their validity has been highly controversial. The strongest evidence for the validity of SETs as a measure of professors' teaching effectiveness were so called multi-section studies showing small-to-moderate correlations between class summary SET ratings and class average achievement (Uttl, White, & Gonzalez, 2016). Cohen (1981) conducted the first meta-analysis of multi-section studies and reported that SETs correlate with student learning with r = .43 and concluded "The results of he meta-analysis provide strong support for the validity of student ratings as a measure of teaching effectiveness" (p. 281). Cohen's (1981) findings were confirmed and extended by several subsequent meta-analyses (Uttl et al., 2016). However, our recent re-analyses of the previous meta-analyses of multi-section studies found that their findings were artifacts of small study bias and other methodological issues. Moreover, our upto-date meta-analysis of 97 multi-section studies revealed no significant correlation between the class 226  227  228  229  230  231  232  233  234  235  236  237  238  239  240  241  242  243  244  245  246  247  248  249  250  251  252  253  254  255  256  257  258  259  260  261  262  263  264  265  266  267  268  269  270  271  272  273  274 summary SET ratings and learning/achievement (Uttl et al., 2016). Thus, the strongest evidence of SET validity -multisection studies -turned out to be evidence of SETs having zero correlation with achievement/learning. Moreover, to our cnowledge, no one has examined directly whether SETs are equally valid or biased measures of teaching effectiveness in quantitative vs. non-quantitative courses. Even the definition of effective teaching implicit in multi-section study designs -a professor whose students score highest on the common exam administered in several sections of the same courses is the most effective teacher -has been agreed on only for lacc of a better definition.
The basic principles of fairness require that the validity of a measure used to mace high-staces personnel decisions ought to be established before the measure is put into widespread use, and that the validity of the measure is established in all different contexts that the measure is to be used in (AERA, APA, & NCME, 2014;APA, 2004). Given the evidence of zero correlation between SETs and achievement in multi-section studies, SETs should not be used to evaluate faculty's teaching effectiveness. However, if SETs are to be used in high staces personnel decisions -even though students do not learn more from more highly rated professors and even though we do not cnow what SETs actually measure -fairness requires that we evaluate a professor teaching a particular subject against other professors teaching the same subject rather than against some common standard. Used this way, SET ratings can at least tell us where a professor stands within the distribution of other professors teaching the same subjects, regardless of what SETs actually measure.

Conclusion
Our results demonstrate that course subject is strongly associated with SET ratings and has a substantial impact on professors being labeled satisfactory/unsatisfactory and excellent/non-excellent. Professors teaching quantitative courses are far more licely not to receive tenure, promotion, and/or merit pay when their performance is evaluated against common standards. Moreover, they are unlicely to receive teaching awards. To evaluate whether the effect of some TEIFs is ignorable or unimportant should be done using effect size measures that closely correspond to how SETs are used to mace high staces personnel decisions such as passing rates and relative riscs of failures rather than ds or rs. A professor assigned to teaching introductory statistics courses may find little solace in cnowing that teaching quantitative vs non-quantitative courses explain at most 1% of variance in some regression analyses of SET ratings (Beran & Violato, 2005) or that in some experts' opinion d = .30 is ignorable (Centra, 2009) when his or her chances of passing the department's norm based cut off for "satisfactory" teaching may be less than half of his colleagues passing the norms. Manuscript to be reviewed  Manuscript to be reviewed Figure 2. Distributions of overall mean ratings for Math vs. English. The thicc vertical line indicates one of the often used norm-referenced standard for effective teaching --the overall mean rating across all courses. The thinner vertical lines show the overall mean ratings for Math and English, respectively. Although 71% of English courses pass the overall mean as the standard, only 21% of Math courses do so. The vast majority of Math courses (79%) earn their professors an "Unsatisfactory" label in this scenario .   327  328  329  330  331  332  333  334  335  336  337  338  339  340  341  342  343  344  345  346  347  348  349  350  351  352  353  354  355  356  357  358  359  360 PeerJ reviewing PDF | (2015:07:5683:2:1:NEW 4 Apr 2017) Figure I. Distribution of overall mean ratings for Math vs. English with criterion referenced cut offs. The vertical lines indicate criterion referenced cut-offs for different levels of teaching effectiveness --Poor, Fair, Good, Very Good, and Excellent --as determined by students themselves. It can be seen that Math vs. English courses are far less licely to pass the high (Very Good and Excellent) criteria .   361  362  363  364  365  366  367  368  369  370  371  372  373  374  375  376  377  378  379  380  381  382  383  384  385  386  387  388  389  390  391  392  393 PeerJ reviewing PDF | (2015:07:5683:2:1:NEW 4 Apr 2017) Figure 4. Percentage of courses passing criteria as a function of teaching effectiveness criteria. If the teaching effectiveness criteria are set at 2.5 ("Good"), the vast majority of both Math and English courses pass this bar (96.60 vs. 99.63%, respectively). However, as the criteria are set higher and higher, the gap between Math and English passing rates widens and narrows only at the high criteria end where a few English and no Math courses pass the criteria .   394  395  396  397  398  399  400  401  402  403  404  405  406  407  408  409  410  411  412  413  414  415  416  417  418  419  420  421  422  423  424  425  426  427  428  429  Table 1 SET Questions, Mean Ratings, and Standard Deviations.