One of the most important activities at any university is the instruction of students in formal classes. Teaching performance is a key evaluation focus for tenure and promotion decisions. Teaching performance is a key criterion in the renewal of faculty contracts for non-tenure-line and adjunct faculty. Finally, among its peers, Georgetown is unusually serious about valuing good instructional performance.
Measuring the success of instruction is difficult work. Few courses have pre-measurement of the knowledge state of a student; so, the marginal increase in understanding or knowledge attributable to the instructor’s class cannot be easily measured. The vast majority of courses are unique compilations of materials and pedagogy. Examinations in two different courses do not yield themselves to easy equivalency; getting 100% correct on one examination may not be equivalent to the same performance on the examination of another courses. So, grades as performance metrics of instructors are usually not viewed as useful.
Some units schedule classroom visits by senior faculty who draft a report on the observed performance of the instructor. These have the value of being based on real instructional behavior. They have the weaknesses of limited exposure to variation in performance over classes and “observer effects” that may induce unusual behavior in the instructor because of the very fact that they’re being observed.
The staple and ubiquitous measure of instructions that is used on all campuses I’ve encountered is the student evaluation. These measurement devices ask the student to evaluate various aspects of the course, the pedagogy, and the instructor. Five-point scales are the common tool, with “5” usually meaning the most positive rating and “1” being the lowest.
Over the summer, our new program analytics team in the Office of the Provost assembled five years of student evaluation data, representing over 22,000 different courses taught at Georgetown from 2009-2014. A key item I examined was the overall rating of the instructor. Overall, Georgetown faculty attain ratings that hover around 4.5 on average, close to the top rating of 5.
I was less interested in the averages than in what attributes of a course and instructor explained variation in scores. Ideally, the overall rating of the instructor would be a pure measure of how well he/she performed in the class.
We’ve only just begun the analysis. We first asked whether we could find attributes of the course that were correlated with the student scores but should not necessarily affect the performance of a faculty member.
One of the results we found was that small classes generated higher evaluations of the instructor than large classes. We found that courses that had labs or recitation sessions tended to receive lower evaluations. We found that courses with high mean grades usually generated higher evaluations of the instructor. We found that students taking courses as a requirement for the degrees (versus as an elective) tended to give lower evaluations of the instructor.
It’s important not to leap to causal conclusions. It could be that the best instructors are assigned to teach small, elective courses without labs or recitations. It could even be possible that the best instructors tend to give high grades because they so effectively achieve the learning outcomes of the course. We could also phrase it the opposite way; it could be that the worst instructors are assigned to large, required classes with labs or recitations and that they tend to give lower grades. The data themselves don’t provide refutations to those alternative interpretations. It’s worth doing more analyses to check alternative explanations. We’re starting to do that now.
Given that student evaluations play an important part in merit review, tenure, and promotion decisions, we want to learn more about the measurement properties of student evaluations.
We’ll seek faculty and student advice on future analyses to get closer to the truth. At the same time, we’ll engage in discussions about whether there are better ways to evaluate instructional performance.
This reply is about a month late, but the following article should be relevant and of interest.
http://www.stat.berkeley.edu/~stark/Preprints/evaluations14.pdf
Julia Lamm, Theology
You wrote: ” A key item I examined was the overall rating of the instructor. Overall, Georgetown faculty attain ratings that hover around 4.5 on average, close to the top rating of 5.”
Exactly how was this number, 4.5, arrived at? It is vitally important to know the process, not just the outcome. So I would welcome a specific answer to this question before I will accept the number as being valid.
P.S. What is the Standard Deviation associated with the 4.5 average? Clearly, an average conveys far less information alone than it does when accompanied by the standard deviation. We cannot know how representative the average is unless we have information about the standard deviation.
As a student, I’ve often found that the numeric values I assign are less indicative of the course’s value than the comments I leave on the form. Many of my peers feel similarly. Perhaps looking at a way to include non-quantitative data in a larger review (as difficult as that is) would help provide better answers to some of these questions.
In addition to making the evaluations mandatory for all students, I think the comments should be mandatory as well. That type of feedback is much more useful than a simple score and any college student should be able to provide some sort of written feedback on their experience.
Michael Donnay’s comment on the meaning and value of qualitative information is right on point–there is clearly a place for the narrative information in evaluating what a professor does with and for students. By now it ought to be obvious that not everything that counts can be counted and not everything that can be counted counts. There is a huge difference between knowing “that” (a numeric score) and knowing “why” (a narrative). That is why these two approaches are complementary to one another. Neither can be used alone satisfactorily and I applaud Michael for bringing this point to the fore. How wonderful that a student would shed light on this important matter!
Sorry that’s sample
Good idea to require an evaluation to access your grade. Better ample if higher number to look at.
Larger classes generate worse average evaluations for the simple reason that larger classes are a worse form of instruction. There is a fundamental disconnect between instructor evaluations and course evaluations that is more than mere semantics. The data collected here is almost certainly not about the quality of the instructor teaching the course so much as the course itself. Large classes such as the torturous lower level econ courses are only marginally more educational than simply reading the textbook, or asking professors to email out transcripts of their lectures. When students become invisible in a room with 200+ faces it should not be surprising that they are less impressed with the instructor and give poorer reviews. From my experience as a student these huge classes have subpar TAs, often with no experience teaching and inaccessible professors with short, sporadic and uncomfortable office hours. Meanwhile small courses have accessible professors, excellent TAs and useful classroom instruction. Obviously always providing small classes is impossible but the TA/Recitation scenario could be improved. The best TAs/Adjuncts should be hired for the larger courses to compensate for the less useful class time. Instead of deciding that since introductory economics or trade courses are ‘easy’ the TAs don’t need to be as thoroughly vetted those jobs should be more difficult to get and more focused on the teaching and explanatory skills of the TA than merely his/her grasp of the material in the course. Basically, large classes are getting lower reviews because they provide lower quality instruction during class periods and make little sincere effort to compensate for that outside the classroom.
I was interested in Abigail’s comments related to the poor response rate for student evaluations. In my experience, there have been some semesters that fewer than 15% of enrolled students have submitted their course evaluation. I know of some top universities which tie posting of course grades to completion and submission of a course evaluation. I believe it would be worthwhile to assess the pros and cons of putting such a system in place at GU.
As this evaluation issue is being examined, I believe it would also be interesting to see what correlation, if any, exists between students’ course grades and the course evaluations that they submit.
I’d suggest a more preliminary step which is to look at the actual evaluation form itself. To me, it seems like only an evaluation of the professor, even though it is called a “course evaluation”. I would find an evaluation form comprehensively on the course much more helpful in improving my course for the future. The current form is suited mostly to evaluate one way of teaching – top-down, professor/lecture style. Also, I’ve utilized CNDLS to do a mid-semester evaluation; it is helpful.
I like the mid course correction dea for many reasons. Giving feedback to reassess how things are going. Empowering students during the course etc. good thoughts
Student evaluations are a bit like capitalism … imperfect, but presumably the best system so far invented. I think Abigail is spot on, partial response rates are a big problem. We are obligated to read them, so why shouldn’t students be obligated to submit them ? Successful courses are of course successful partnerships between faculty and students, not autocracy. Other modifications to student evaluations might also help, such as adding evaluations half way through the semester (some profs already do this on their own). Key questions such as those related to whether goals of the course are being met could then be addressed when there is still time to do something about any deficiencies (the good bits that are working well could also be expanded further), and comparing evaluation statistics generated midway vs at the end of a semester can be more informative in some cases. Overall, actively engaging students in enhancing a course (during the course, not after) has many benefits.
Cheers
Paul
Thanks for taking up another crucially important issue. Two possibly causal, but sensitive, factors to consider are the relationships between student evaluations and (1) awarded grades and (2) subject matter difficulty.
Another metric to perhaps consider is the research productivity or scholarly output of the instructor. Many (but not all) professors that are particularly successful in their research tend to gravitate towards smaller teaching assignments because there are only so many hours in the day (more success in research often, but not always, necessitates more time spent on research). It is a difficult thing to quantify, but the added enthusiasm for a topic that highly successful researchers tend to have also adds to student evaluations.
The quantification of learning through big data is an interesting and quite active area. Some recent readings have pointed me toward the relatively nascent Society for Learning Analytics ( http://solaresearch.org ) and the International Educational Data Mining Society ( http://www.educationaldatamining.org ). Perhaps some initial methods that go beyond the highly subjective student evaluation to the actual measurement of learning could be found there? This is no easy task, but I think it would help both the individual and institutional domain.
Interesting thoughts on your data about evaluation of faculty. Also you seem to address some of the possible interpretations of the data in various ways. One of the problems in looking at all this data is that what may be most important and not very scorable might be similar to what we are taught early in medicine. You recognize things sometimes because they are.. For example when you see a grandmother you recognize a grandmother. Maybe that could be quantified. When you see a great teacher you recognize that. But it may be hard to put that into an evaluation. But we can only try. Also a similar comment in medicine is that if looks like a duck and quacks like a duck… its probably a duck. So if a great teacher seems to be a great teacher they probably are but might be hard to prove or analyze, but they probably ARE a great teacher. I quess tho that we have to try to quantify such things the best we can all the while realizing that a great teacher is a great teach and that always can’t be quantified. Good luck in the quest!
One factor I did not see mentioned was the percentage of students in the class who submit an evaluation, which is rarely 100% and frequently much lower. This creates the risk of major sampling-related biases. Given that evaluations do play an important role in merit review, tenure, and promotion decisions, and given that we as institution profess to take evaluations, and instruction in general, seriously, I think we should require students to submit evaluations rather than leaving it optional. This is the practice at other institutions (e.g., Stanford).
it’s especially important to evaluate the performance of adjuncts, who are critically important in SFS but who may have little or no teaching experience. STIA tries to have a faculty member visit each adjunct each semester, evaluate their performance and advise them on how to improve their teaching.
It seems that comparing rating of the same faculty members (within-faculty member analysis) across different classes they teach can provide some information here. If the same faculty members consistently get rated higher for their smaller classes than for their larger (required, discussion-sectioned, etc), the class characteristics are the likely culprit. Anecdotally, this is exactly what is typically seen.