Classroom Observations: Rating the Raters

This is the second of two blog posts about two new studies from AIR researchers and collaborators on the use of classroom observations for teacher evaluation.

Most press coverage about new teacher evaluation systems focuses on student growth (or value-added) measures based on student test scores. But even in districts that use such measures, a teacher’s performance appraisal still depends largely on classroom observations.

In May, the Brown Center on Education Policy at The Brookings Institution released a report presenting new concerns about the classroom observation component of most teacher evaluation systems. In particular, the authors note, “observations conducted by evaluators from outside the building have higher predictive power for value-added scores in the next year than those done by administrators in the building.” The authors attributed this difference to principals’ bias and suggested that evaluation systems either need “unbiased observers from outside the building as a validity check on principal observations, or … training and reliability checks on the principal or other in-building observers.”

My team's latest study supports these suggestions, but bases its recommendations on different evidence.

My colleagues and I analyzed teacher observation data in a large, suburban school district. Twenty-eight principals and 10 peers (other teachers) observed the same teachers, at the same time, in the same classrooms. The principals conducted observations only in their own schools. The peer observers worked across schools and didn’t know the teachers.

Peer observers received observer training; principals did not. (Most of the principals had been conducting in-class observations for years.) The district wanted to know whether peer observers, on average, were more or less lenient than principals when rating other teachers.

Our study showed the district two interesting results:

1. Principals, on average, were more lenient than peer observers when rating teachers. This might suggest that principals want to keep their teachers happy. Or, principals drew on additional information when rating what they saw—rather than rating teachers based on the instructional practices observed.

2. Although principals were more lenient on average, there was a wide range among the 28 principals. The eight most lenient raters were principals, and five of the eight least lenient (or most critical) raters were also principals. This might suggest that principals were using the rating levels of the observation rubric with less consistency than peer observers.

Based on this sample, these findings suggest that through a year of observations, some teachers may have benefited by being assigned to a school with a more lenient principal, and others may have been disadvantaged by being assigned to a school with a less lenient principal.

It’s reasonable for principals to rate teacher classroom performance; principals must know what’s going on in their schools’ classrooms. But given the emerging research and the rising stakes around teacher evaluations, it’s also clear that peer evaluators and principals need careful training in advance and a system to check or calibrate their results as they rate teachers through classroom observations.

David Manzeske is a senior researcher at AIR. He specializes in research on educator effectiveness.