Uncategorized

On Reading Entrails and Student Evaluations

In Cracks in the Ivory Tower, we note that empirical work overwhelmingly shows that students evaluations are reliable but invalid measures of teaching effectiveness. Some early results were positive, but as sophistication in using statistic techniques and controlling for confounding variables increased, the published results became ever more negative.

So, why do universities insist on using them? Some hypotheses:

  1. The SET is relatively cheap and easy to implement.
  2. Valid measures of faculty effectiveness are expensive and difficult to implement.
  3. As a corollary of 1 and 2: Since nearly every American university and college already uses SET scores, the architecture for collecting these scores is already in place. But replacing the SET requires finding and implementing a new system.
  4. Some people’s jobs—including certain administrators, staff members, and software developers—dependon universities continuing to use these SET scores. So, they will actively lobby to keep using them.
  5. Perhaps Students generally do not know the SET is invalid, and so using the SET helps the university trick students into thinking the university cares about and is trying to improve teaching.
  6. Even though the SET is invalid, so long as administrators continue to collectively behave as if it were valid, they can use SET scores to wield power over and control faculty. 

An excerpt from chapter 4 of Cracks in the Ivory Tower:

How Divination Rituals Do and Don’t Work

            The haruspex slices open the bleating lamb’s abdomen. His expert hands remove the liver in one swift motion. The lamb lies still. He throws the pulsing purple mass on the sacred stone and waits, his eyes fixed. A pattern emerges that only his eyes can see. He sighs with relief and proclaims, “The king will survive.”

            The augur turns his eyes to the sky. For twenty minutes, he watches as flocks of birds sweep overheard. His closes his eyes and listens to their songs. A pattern emerges that only his eyes and ears can detect. He sighs and mutters, “The gods do not support our invasion. I fear the battle will be lost.”

            Many ancient cultures—and some current ones—practiced divination rituals. Certain people—witchdoctors, augurs, haruspices, heptomancers, oracles, or whatnot—would “read” the movement and sounds of birds, patterns in smoke, images in fire or the entrails of sacrifice animals, or messages hidden in the stars. Often such “seers” occupied high positions in society. They served as advisors to kings, queens, and generals, who made important, sometimes life or death, decisions on the basis of their visions. 

            But here’s the problem: Taken at face value, the nearly universal practice of divination is bullshit. A sheep’s liver cannot tell you whether Julius Caesar is in danger. The flight of birds cannot tell you whether non-existent gods will aid or hinder your battle. The way tea leaves float tells you nothing about whether an illness will pass or whether your child will be born healthy. Eating black and white impepho does not allow you to speak with dead ancestors. Divination does not work, in the sense that that the divination practices fail to provide any evidence for the claims the seers later make. The cultures and individual people who practice divination may be sincere—they often genuinely believe that their rituals allow them to predict the future or communicate with gods and spirits—but they are wrong.

            We say “taken at face value” because there is a way in which these divination rituals are notbullshit. They invalid as ways of generating knowledge. But these rituals also serve secondary socialpurposes, and may be quite effective at these social purposes.[i]

For instance, shared sacred rituals and shared beliefs in the divine help to bind a community together; they facilitate trust and mutual cooperation.[ii](The mechanism: That you adopt an “expensive” belief and practice expensive prayer rituals is evidence you’re one of us, you’re committed to the group, and we can trust you.) Sometimes certain people know the rituals are nonsense, but they enable better positioned, more knowledgeable people to influence or placate the superstitious masses. Sometimes, thanks to the placebo effect, fake medicine cures psychosomatic illness. Sometimes the purpose of the rituals is straightforwardly to allow the powerful to control the less powerful. Possessing sacred knowledge means you know more and have higher status than others, which thereby entitles you to additional power. 

Finally, sometimes the reason cultures continue such practices is simple inertia. People just do what their ancestors did, because it’s what they grew up with, and so a practice just continues year after year.

Student Evaluations of Faculty Teaching Effectiveness

            This chapter examines how colleges routinely make faculty hiring, retention, and promotion decisions on the basis of what they ought to knoware invalid tests. 

Most universities and colleges in the United States ask students to complete course evaluations at the end of each semester. They ask students how much they think they’ve learned, how much they studied, whether the instructor seemed well-prepared, and how valuable the class was overall.

            Colleges and universities use these surveys to make decisions about whom to hire, whom to give a raise (and how much), whom to tenure, and whom to promote. Some colleges—especially small liberal arts colleges where faculty focus almost exclusively on teaching—rely heavily on these results. They may even request past teaching evaluations from new job applicants along with usual features such as a cover letter and CV, or academic résumé.  At some schools, student course evaluations make or break faculty careers. In contrast, tenure-track faculty research-heavy R1s are evaluated mostly on their research output. But even at R1s, student evaluations affect promotion, tenure, and raise decisions. Further, many R1s employ a large number of permanent/long-term but non-tenure-track teaching faculty (lecturers, teaching professors, professors of the practice, clinical professors, etc.), and they often evaluate such faculty heavily on the basis of such course evaluations.

            It’s clear that nearly all universities and colleges use student course evaluations.[iii]But we could not find good data quantifying just howuniversities use the data.  

That won’t be important for our argument here, though. Our argument is simple: Student course evaluations do not track teacher effectiveness. Thus, the more you rely upon them, the worse you are. In this chapter, we’ll argue that teaching evaluations are largely invalid. Using student evaluations to hire, promote, tenure, or determine raises for faculty is roughly on par with reading entrails or tea leaves to make such decisions. (Actually, reading tea leaves would be better; it’s equally bullshit but faster and cheaper.)

            We’ll recite the rather damning evidence about course evaluations. However, given how damning the evidence is—and given that the evidence damning evaluations has been accumulating steadily for forty years—one might wonder why universities continue to use student course evaluations. We’ll end by discussing a number of reasons: Using student evaluations gives some people (administrators) power over others. Student evaluations may placate students and make students believe (in some cases, falsely) that they have control over the university and share in university governance. Administrators may use course evaluations because they believe they must produce somethingthat evaluates teaching, and more effective and reliable methods of evaluation are just too expensive and time-consuming. Finally, it may just be that universities continue to use student evaluations because it’s what they’ve been doing for forty years, and change is hard. Student evals are like other divination rituals: they fail to serve their putative information-gathering purpose, but they serve secondary social purposes.


[i]For an account of why weird behaviors were often functional, see Leeson 2017.

[ii]Simler and Hanson 2018; Haidt 2012. 

[iii]Emery, Kramer, and Tian 2001, 37


Share: