Why is it that student surveys are more accurate in determining the quality of teaching than classroom observations? This is a question raised by the study on teacher evaluations released last week by the Bill & Melinda Gates Foundation’s Measures of Effective Teaching project. And the likely answer should give school reformers pause in pushing for the so-called multiple measures approach to teacher performance management.

As I noted in this week’s Dropout Nation Podcast, how to evaluate and manage the performance of teachers has been one of the most-contentious discussions in the battle over reforming American public education. School reformers have largely won the battle over the use of value-added analysis of student test data in teacher evaluations, and, thanks in part to the Obama administration’s Race to the Top initiative, 32 states are on the path to using them. But despite the evidence that value-added is the most-objective and most-accurate tool for evaluating teacher performance, education traditionalists have also successfully weakened the use of value-added (and convinced many reformers to go along) by advocating for a “multiple measures” that prominently features the traditional (and largely subjective and inaccurate) approach of classroom observations.

But the MET report — which touts multiple measures — provides more reasons why such an approach is ineffective. For one, it pointed out that surveys of students were more accurate in evaluating teacher performance than classroom observations.  In fact, the student surveys — in this case, the Tripod student perception survey developed by Harvard’s Ronald Ferguson and Cambridge Education — were so accurate in evaluating teacher performance that they were almost as good as value-added analysis of student test score data. For example, the reliability of one classroom evaluation was less than half of a standard deviation in math and a fifth of a standard deviation in reading. On the other hand, the reliability of student surveys was two-thirds of a standard deviation, almost as high as the rather significant seven-tenths of standard deviation for value-added data, the most accurate of the tools for measuring student and teacher progress.

Meanwhile the multiple measures approach — in this case, combining value-added data along with classroom observations and student surveys — watered down the reliability and usefulness of evaluations. When classroom observations and student surveys are weighted equally alongside test scores, the accuracy of the overall evaluation declines by as much as a twelfth of a standard deviation. Even when value-added data and student surveys are given greater weight — accounting for 72.9 percent, and 17.2 percent of the evaluation in one model — the classroom observations are of such low quality that they bring down the value of the overall performance review.

Certainly the very issues of subjective bias and inability to measure the most-important (and unobservable) aspects of teacher quality is one reason why classroom observations are far less useful on their own. But, at first glance, it wouldn’t make sense that student surveys do a better job of assessing teacher quality. After all, children aren’t exactly knowledgeable about what is involved in high-quality teaching, and, in theory, shouldn’t be able to get a handle on subject-matter competency. Let’s also be clear that not every student survey is likely to match up to Tripod’s level of accuracy.

Yet kids may be really good at assessing some important things. As Ellen Gallinsky noted in Mind in the Making, her 2010 book on child development, babies are incredibly good at determining which adults are helpful in their lives and which adults aren’t worthy of their time. And anyone who has spent time with a preschool-aged niece or nephew has seen how quickly they can catch on to what is happening in the world around them. If this is true for those kids, then school-aged children should have a good idea of what adults do (or don’t).

It will take more empirical studies to show this to be true or not, but your editor surmises that the reason lies with what I call the Familiarity Breeds Hypothesis. The more time you spend with someone, the more you know about their particular personality traits, skills, quirks, and even eating habits. Such familiarity either breeds admiration, respect, contempt, or indifference. This familiarity is what leads a couple to either go from simply dating to becoming husband and wife — or break up within a year, and why colleagues can go from simply working with one another to becoming best buddies — or hate each other intensely. It is also why your spouse likely knows you better than you know yourself. And if this is true in relationships in areas outside of education, why wouldn’t it be true when it comes to relationships between teachers and students?

Consider this: Depending on whether they are in elementary, middle, or high school, the average student will spend at least five hours a week with a given teacher and as much as 35 hours a week when they are in elementary. Essentially, a student spends more time with his or her teacher than either the instructor’s colleagues and principal — and definitely more time than the most-impartial evaluator. Given the amount of time the student spends with a teacher, they likely have a better grasp of the teacher’s skillset than any of the adults in a school, even if they don’t actually understand such concepts as subject-matter competency and instructional method. And they also have a better sense of how much a teacher truly cares about them, and is willing to build the kind of long-lasting connections that can improve student success.

This is likely to be especially true for two particular groups of students: At-risk kids struggling with reading and other achievement gaps, and top-performers ahead of the class. After all, both sets of students are looking for help (even when they don’t know how to verbalize it) but in different ways. At-risk students want a caring teacher with strong subject-knowledge competency who can help them address their struggles in their subjects. Top-performers want an instructor who both cares for them and is more-knowledgeable than they are in order to get the challenge they need to continue their success. How a teacher comes off to both groups of students matters because if she can do well by these students, she will also do well by the rest of the class.

If the struggling student senses that a teacher cares for them and can help them improve, that teacher will gain high levels of esteem. If not, the struggling student won’t have much respect for the teacher and will not rate the teacher highly. This is also likely to be true for a top-performing peer. If the teacher lacks subject-matter competency and, in fact, seems to know less than that particular student, then the kid will tune out the teacher altogether.

While kids have intense, up-close dealings with their teachers, the average evaluator isn’t likely to be so familiar and knowledgeable. In traditional classroom observations, a harried principal will only get to spend one pre-arranged hour in the school year with a teacher (who already knows they will be evaluated and, thus, will have rehearsed for the appearance). Even under better-developed and more-rigorous classroom observations used by MET, the observer only gets to watch what a teacher does four times during a school year. Unless the teacher is completely inept, she can put on a good enough show to fool that observer for those times.

The observers may get to notice some how a teacher cares and empathizes for kids in their care, especially in how they talk to a student of a different racial or economic background, or even see how a teacher reduces the amount of time she spends on talking and increases the time spent on students answering questions and otherwise engaging in learning. But they will miss out on some aspects of the instructor’s work because they’re not going to spend enough time in a classroom to get a full picture of what the teacher does. And this is on top of the realities of subjective biases inherent in observations, and the fact that no observation will be able to measure the most-important aspect of teacher quality – whether a teacher has improved the child’s progress and is making gains in their knowledge.

Again, the Familiarity Breeds hypothesis is exactly that. But if it bears out empirically, then it raises some real questions about how to improve classroom observations and, more importantly, whether American public education should still bother with them anyway.

One possible way to improve classroom observations may start with immersing an observer into the classroom of a particular teacher. This means an observer would remain in class for long periods of time, say two months, in order to get a full sense of an instructor’s performance. Why? Because after a while, a teacher would stop putting on a show and actually do what they really do in classrooms because the observer becomes just like everyone else in the class. Which, in turn, allows an observer to really see and analyze what a teacher is doing. But such extensive observations may be too costly for districts to implement and, given that many veteran teachers prefer to work solo, isn’t likely to be welcomed. It could also lead to even more subjective biases that would make evaluations less useful in measuring and even improving performance.

In an ideal world, your editor would simply use the value-added data as the sole measure in evaluations — and this would be possible if we did a better job of recruiting and training aspiring teachers long before they begin work in classrooms. This would screen out those who lack subject-matter competency, empathy for children, and entrepreneurial drive. But there would still be need to measure observable aspects of teacher quality. The better solution may be to move away from classroom observations altogether and take a different approach to multiple measures. Value-added data from standardized tests would be the biggest component, but student surveys would also be included because of their accuracy. Also adding results from formative assessments — which can show how a teacher has addressed particular student issues over the period of a school year — would also make sense.

As I noted in this week’s Podcast, evaluations should also take note of how teachers work with parents, especially as Parent Trigger laws , along with the expansion of school choice, helps foment the Parent Power movement. And another measure should deal with how teachers, working on their own or in collaboration with their colleagues, work to improve student achievement and develop innovative ways of helping kids succeed. A high-quality teacher should definitely be rewarded for developing a new instructional specialty the same way doctors are now specialists in particular areas of medicine.

Ultimately, it may be the students, rather than the adults, who point the way towards improving the quality of teachers who work in our schools.