Saturday, May 5, 2007

Student Ratings of Teaching, Part 1

Note: If this long post starts to wear you down, buck up - it is followed by a video of the funniest thing it has been my pleasure to see in a long time - 1:00 of hilarious bunny-filled brilliance. (Thanks, Mom, for telling me about this commercial and "thank you, science.")

In the comments to the post on people’s ability to evaluate an instructor’s personality based on viewing 6 seconds of silent video, Tam asks:

"I also wonder what the results are of studies comparing these factors (the ones that influence student evaluations, or just the results of student evaluations themselves) to actual effectiveness in teaching - i.e., how much students learn."

The easy answer is, it depends on whom you ask. The validity of student evaluation of teachers is a matter of great controversy among researchers (to say nothing of the larger general academic public). At the most extreme, this divides into two camps: those who have made research predicated on the necessity that student evaluations are reasonably valid measures a prominent aspect of their career and are tempted to believe this despite the contradictory evidence that may appear and those who believe that student evaluations are of obvious bogosity and are tempted to hold researchers in this area to a standard that perhaps is unfairly rigorous (and that they are unlikely to match in their own areas of interest) such that their opponents can never make a good enough case to satisfy them.

Of course, my own tendencies toward critical assessment make me naturally inclined to be skeptical of student evaluations and a half-assed cursory examination of the literature does not make me any less a tentative ally of those in the second camp.

Here I will discuss at length one particular journal article that I liked a lot (written by a critic of student evaluations of teaching named Olivares), briefly another that had a useful run-down of some empirical findings (written by Ahmadi and Cotton), and my general thoughts on the subject. (Sources at the end of the post as usual.)

The Olivares article begins with a review of the as-near-to-universally-accepted-as-I-can-imagine definitions of validity. It poses the general question, what would it mean to say that student ratings of teaching (SRTs) are valid? At its most basic level, a valid measure is one that measures what it is supposed to measure. There are several types of validity that psychologists talk about:
- Content validity: Does the measure (SRTs) represent all aspects of teacher effectiveness?
- Criterion validity: Is there a meaningful relationship between the measure (SRTs) and some measure of the relevant behavior (such as “student learning”)? Note that this is related to the question that Tam asks.
- Construct validity: Do SRTs measure a trait or characteristic of interest? Does “teacher effectiveness” exist?

Olivares argues that SRTs do not hold up well to an examination of their validity. One major problem is that without a good definition of “teacher effectiveness,” it becomes next to impossible to judge whether SRTs do a good job of measuring it. In the absence of such a definition, the ratings themselves become the de facto operational definition of teaching effectiveness. He quotes another critic who has issues with the most obvious definition of teacher effectiveness – how much students learn: “The best teaching is not that which produces the most learning, since what is learned may be worthless.”

I am inclined to agree that the lack of a well-formulated definition of teacher effectiveness is problematic, and the pervasive use of SRTs as the de facto operational definition can put the field in the uncomfortable position of being caught in a circular reference: What is teacher effectiveness? What this test of teacher effectiveness measures. I feel sure that most researchers who use SRTs as a measure in their work appreciate the fact that teacher effectiveness is a multi-faceted concept, but I know that there is a strong tendency to privilege in your mind whatever aspect of some complex thing you can measure and do something with. In my opinion, psychologists, who generally work in an experimental mode and hence have a bit better control over their datasets, can be less prone to this than other social scientists* (and even doctors perhaps**), but it is a danger in all of these fields. I think it’s entirely appropriate for curmudgeonly critics (and I am obviously a student member of the Curmudgeonly Critics of America) to occasionally remind researchers of this fact.

* As Robert has said, economists are forced to use whatever data they can find and thus use very strange measures indeed, like tractors-per-capita, as proxies for their variables of interest, and when asked to explain what one of these measures means, are inclined to respond, “I don’t know, but it explains 73% of the variance.”

** My mom recently commented to me that she was starting to wonder if her doctor’s insistent focus on her cholesterol level was a true reflection of the importance of that level to her health or was simply an artifact of cholesterol being something that she could measure.

It is interesting to think about the ways that even “how much students learn” fails as a universally acceptable definition for teacher effectiveness. Even assuming there was some way to get a very good measure of this (using some kind of pre-/post- measure of knowledge and controlling for student variables like intelligence, motivation, study habits, etc., that could impact learning), maximizing the sheer amount of learning is not always the sole (or in some cases, primary) goal of teaching. One aspect that I think is important is a teacher’s ability to stimulate interest in and future study/thinking about a subject in students. It is easy to imagine the instructor who by blunt force crams a significant amount of knowledge into students’ heads long enough to take the exam, but whose students come away hating the subject and eager to forget this boring crap as soon as possible. Depending on the situation, the ideal balance between knowledge and interest may shift, but in most cases, I believe you want to do both.

It’s certainly true that learning a great amount of trivial information is less desirable than mastery of the fundamental concepts of a subject. (For instance, an American history student may be able to regurgitate a large number of names, dates, and places without understanding how any of it ties together.)

Also, it’s possible that in specialized situations, teachers would focus on very different things. For example, a teacher working with students disadvantaged by some combination of circumstances (e.g. socio-economic status) and innate abilities (e.g. learning disability), and with a history of low achievement, may emphasize increasing the students’ motivation to learn and feelings of self-efficacy toward learning at the expense of short-term mastery of the subject material. (By this I mean students who have done poorly in classes that progressed at a normal pace might be placed in a more slowly paced course that allows them to realize, hey, with effort, I can learn something; it just takes me longer to do it.) Most to all junior colleges and some universities teach what at Rice and (to my knowledge) most other schools is a two-semester course in calculus over three semesters, presumably because they recognize that their students are not prepared to take on this material at such a fast pace. On the flip side, as Robert pointed out, other institutions use a “weed out” process to separate out those who can advance through difficult material very quickly from the masses who cannot. And even I can see the value of this to an elite program for astronauts, specialized doctors, or that kind of thing.

Another huge issue to me, that is more methodological than theoretical, is that effective teaching results in learning that lasts beyond the final exam.

The article lays out four assumptions about student ratings of teachers that Olivares believes are not sufficiently met:
"- rating forms adequately capture the domain of teacher effectiveness across instructional settings, academic disciplines, instructors and course levels and types;
- students know what effective teaching is, hold a common view of teacher effectiveness, and are objective and reliable sources of teacher effectiveness data;
- relatedly, ratings are, for all intents and purposes, unaffected by potential biasing variables; and, collaterally;
- teacher effectiveness is being measured as opposed to, for example, course difficulty or differences in disciplines, student characteristics, grading leniency, teacher expressiveness, teacher popularity or any number of other variables."

Olivares states (in a sentence I enjoy very much), “To think that students, who have no training in evaluation, are not content experts, and possess myriad idiosyncratic tendencies, would not be susceptible to errors in judgment is specious.” I agree that to the degree that the validity of SRTs is dependent on believing otherwise, the SRT project is doomed.

Ahmadi and Cotton, who conclude in their article that “in general, student ratings tend to be statistically reliable, valid, and relatively free from bias or the need for control,” give a run-down of some findings that may be useful and, more significantly, on point to answering Tam’s question.

First, they report that studies have found correlations between exam grades and SRTs such that classes that gave higher ratings tend to be the ones where students learned more (i.e. scored higher on an [I believe standardized] exam). However, they acknowledge that many variables related to student learning are themselves related to student ability rather than teacher performance. This might imply that the answer to Tam’s question is “Yes, but.…”

They list the following factors that have been found to not be related to student ratings: instructor research productivity, age, teaching experience, race, and gender (though there can be interaction effects between student race or gender and teacher race or gender, with students giving higher ratings to instructors with similar characteristics); student age, level (e.g. freshman, grad student), GPA, and personality; class size and time of day.

They also list factors that are related: faculty rank and teacher expressiveness; students’ expected grades and motivation; work load and difficulty (perhaps surprisingly and reassuringly, those correlate positively, with classes perceived as more difficult getting higher ratings), level of course (higher level are rated more positively), and academic field (humanities and arts>social sciences>math and science).

Getting back to our fellow curmudgeonly critic Olivares. He talks about how, when pressed, many supporters of SRTs will fall back on an argument for their utility; he quotes one who wrote, “Student ratings almost certainly contain useful information that is independent of their correlation with student achievement. That is, student ratings provide information on how well students like a course.”

Of course, this is where yet another of my buttons get pushed – customer satisfaction. So join me later for the continuation of this discussion, focusing on the theory of customer satisfaction and practice of its measurement, SRTs as a customer satisfaction measure, comparisons of c-sat in teaching and a field I know quite a bit about, viewing students as the single relevant consumer group, the implications of SRTs for teacher behavior, and the use of SRTs. I do not mean use as in, are Likert scales, which are considered ordinal level data following Stevens' arguably invalid definitions in measurement theory, appropriately reported using parametric statistics (which is an interesting if highly geeky debate with implications for the calculation of student GPAs as well), but rather: the use and misuse of SRTs (understood as a c-sat measure) as an element in making instructor personnel decisions, such as granting tenure.

Sources:

A Conceptual and Analytic Critique of Student Ratings of Teachers in the USA with Implications for Teacher Effectiveness and Student Learning, Teaching in Higher Education, Vol. 8, No. 2, 2003, pp. 233–245, ORLANDO J. OLIVARES

Assessing Students’ Ratings of Faculty, Assessment Update, September–October 1998, Volume 10, Number 5, Reza T. Ahmadi, Samuel E. Cotton

3 comments:

Anonymous said...

I realize that the SRTs in question are being done at the college level, so my comment being about public middle school won't exactly be as relevant. However, I did want to say that Sally's dad taught middle school for 8 years. If SRTs were done at this grade level the highest rated teacher would be the one(s) that they "like" the best, not the ones they respect or think they have learned the most from. I would hope that college students aren't so shallow, but I'm sure personality and popularity make an impact on an SRT. How do you separate the student who tries to do a fair assessment from the student who doesn't really care and/or uses the assessment to "get back" at a teacher that they didn't like for whatever reason.

Tam said...

I appreciated this blog post very much. I do have one question. When you write, "They list the following factors that have been found to not be related," do really mean they were found not to be related, or that they were simply not found to be related? Or is there a difference between those two statements?

Sally said...

Tam, I should have said there has been no significant relationship found between these factors and SRTs.

I haven't looked at the source material, so it's possible that they simply did not have sufficient power to detect differences that existed (and even if I did, it's not likely that I would be able to tell that from the papers), but I think the point is that they failed to reject the hypothesis that there is no relationship between the factor and SRTs.

Mom, I think junior high teachers should feel fortunate not to have SRTs as part of their performance review process.