Amanda Ripley has a thoroughly (if inadvertently) depressing story in the new Atlantic about the rise in the way in which teachers are evaluated by means of multiple-choice tests given out to students. She says the idea is “revolutionary”:
Research had shown something remarkable: if you asked kids the right questions, they could identify, with uncanny accuracy, their most—and least—effective teachers.
This is probably true, although just how revolutionary it is remains to be seen. These tests really are a great tool for judging which teachers are the most effective and which are the least, across various axes. But from reading Ripley’s rah-rah article, it seems very much that they’re used for precisely the wrong reasons, and barely used at all for the right ones.
It’s impossible to argue with the assertion that the quality of a child’s education rises with the quality of the teacher — just as it’s impossible to argue with the assertion that kids can be very good judges of how good their teachers are. But the important thing here is how these tests are used: are they used by teachers to improve the instruction they’re giving kids, or are they used by managers to come up with yet another key performance indicator to impose on the teachers under their charge?
One way to answer that question is to look at the questions which Ripley isolates as being particularly informative.
Of the 36 items included in the Gates Foundation study, the five that most correlated with student learning were very straightforward:
1. Students in this class treat the teacher with respect.
2. My classmates behave the way my teacher wants them to.
3. Our class stays busy and doesn’t waste time.
4. In this class, we learn a lot almost every day.
5. In this class, we learn to correct our mistakes.
When Ferguson and Kane shared these five statements at conferences, teachers were surprised. They had typically thought it most important to care about kids, but what mattered more, according to the study, was whether teachers had control over the classroom and made it a challenging place to be.
You see what Ripley did there? Measuring how well children are being educated is an astonishingly difficult job. Increasingly, these days, and especially since No Child Left Behind, we’re using test scores as a proxy for quality of education. Everybody agrees that it’s a poor proxy, although there’s disagreement about exactly how poor.
In any case, along comes the Gates Foundation with a 36-question survey, severely chopped from a much longer one developed by Ronald Ferguson. Since there are 36 questions, the survey essentially measures teachers along 36 different axes, all of which are aligned with each other to differing degrees. In and of itself, that’s more useful than just measuring test scores, which are much less teacher-specific and which only provide one axis of educational quality.
But then what do the reformers do? They regress the survey answers against test scores, look at which survey questions align most closely with that test-score axis, and declare that those axes — the ones which test scores, by definition, are already measuring — must be the “most important”. Did you think that caring about kids was of paramount importance? Silly you! It turns out that caring about kids isn’t as correlated with test-score results as, say, whether the class learns to correct its mistakes. And therefore, we shouldn’t be worrying as much about whether teachers care about their kids; we should be worrying more about other things, instead. That’s what the test-score regressions tell us, so it must be true!
This reminds me a bit of the way in which investment banks are prone to taking tens of thousands of risk positions, reducing them all to a single value-at-risk number, and then using their VaR way too much, despite the fact that it’s of only limited utility. Except in this case it’s not even all of the survey answers which get used: most of them are simply discarded.
If Ripley and the Gates Foundation wanted to find a new and powerful and rich way of measuring how effective teachers are, they would use all the information at their disposal, and then they would underweight the answers to the questions most correlated with test scores. After all, test scores are already given far too much weight, for lack of other measures to look at. Instead, they do the exact opposite, and use the surveys to double down on the inherently flawed idea that test scores are a good proxy for educational prowess.
Now if that was all they did, it would feel a bit like a wasted opportunity. But it gets so much worse. Check out this theme running through Ripley’s piece:
Should teachers be paid, trained, or dismissed based in part on what children say about them? … This past school year, Memphis became the first school system in the country to tie survey results to teachers’ annual reviews; surveys counted for 5 percent of a teacher’s evaluation. And that proportion may go up in the future… The New Teacher Project, a national nonprofit based in Brooklyn that recruits and trains new teachers, last school year used student surveys to evaluate 460 of its 1,006 teachers… In Georgia, principals will consider student survey responses when they evaluate teachers this school year. In Chicago, starting in the fall of 2013, student survey results will count for 10 percent of a teacher’s evaluation… On average over the past decade, only a third of teachers even clicked on the link sent to their e-mail inboxes to see the results. Presumably, more would click if the results affected their pay… This school year, Washington, D.C., will make the survey available to all principals and teachers who want to use it. Chancellor Kaya Henderson says that next year, the survey may count toward teacher pay and firing decisions.
No! Stop! Do none of these people get it? What everybody wants, here, is better teachers. These surveys could be instrumental in helping to improve teaching. Teachers would be able to see where they score well and where they score badly, and ask themselves how to improve their scores in areas where they are weak. Principals could see which teachers were good on which axes, and set classes up so that students ended up with a balanced range of teachers. And generally, everybody could treat this data as an interesting and very rich way of improving educational outcomes.
Instead, reformers are rushing to use this data as a quantitative performance-review tool, something which can get you a raise or which can even get you fired. And by so doing, they’re turning it from something potentially extremely useful, into a bone of contention between teachers and managers, and a metric to be gamed and maximized. Check out Ripley’s language here:
The variation within the school was staggering—as it is in many places. In the categories of Control and Challenge—the areas that matter most to student learning—Nubia and her classmates gave different teachers wildly different reviews. For Control, which reflects how busy and well-behaved students are in a given classroom, teachers’ scores ranged from 16 to 90 percent favorable; for Challenge, the range stretched from 18 to 88 percent. Some teachers were clearly respected for their ability to explain complex material or keep students on task, while others seemed to be boring their students to death.
The first thing Ripley does here is throw out nearly all of the rich survey data, in her attempt to boil everything down to one or two simple numbers per teacher. She concentrates on things she calls Control and Challenge, and declares in an omniscient tone that these areas “matter most to student learning”. She then gives each of those metrics a neatly rankable percentage, so that any school can point easily to which teachers are the Best in Control (“clearly respected for their ability to keep students on task”), or Worst in Challenge (“boring their students to death”).
You can see how this might not go down very well with teachers, who are meant to be working as a group to broadly educate a cohort of children, but instead are being isolated and compared against each other, with potentially career-ending consequences for those who score low. The minute that the scores start being used in that way, the teachers understand what’s really going on here, and they resent it. What’s more, they do so for good reason: the more that an enormous quantity of complex data is reduced to a couple of performance-review datapoints, the less useful that data becomes.
School reformers in general, it seems to me, tend to be obsessed with the idea of Good Teachers and Bad Teachers, as though the quality of the education a kid gets in any given classroom is somehow both predictable and innate to the teacher. And yes, at the extremes, there are a few great inspirational teachers who we all remember decades later, and a few dreadful ones who had no idea what they were talking about and who had no control of their classes. But frankly, you don’t need student surveys to identify those outliers. And the fact is that schools are much more than just the sum of their parts: that’s one of the reasons that reformers love to talk about excellent principals who can turn schools around.
The trick to improving education is to make schools better, not to find ever-more-cunning ways to reward and punish teachers. Especially when there’s no evidence whatsoever that such reward-and-punishment schemes actually make those teachers better educators, rather than simply resentful. There’s a reason why certain schools develop a reputation for excellence which can last for centuries: there’s something institutional going on, a virtuous circle which lifts up everybody. Making education granular — isolating not only certain teachers but even certain aspects of how those teachers teach — is a classic example of missing the forest for the trees, for no good reason.
I don’t doubt that student surveys could, in theory, be very useful in the large task facing administrators and teachers — how to make schools better and improve the quality of the education they provide. They would show where schools were weak and where they were strong; which teachers have managed to crack certain nuts where the rest of the faculty is having difficulty; that kind of thing. In short, they could be tools for diagnosing and improving the quality of a school’s education as a whole.
But the reformers rush straight past all that, and decide that the first best use of such data is to use it in performance reviews, and use it to give raises to good teachers and pink slips to bad ones. And, of course, the minute you start doing that, it becomes impossible to use the data for anything else, since the scores then become an end in themselves, rather than a means to an end.
The toothpaste is out of the tube, now: I frankly can’t see how anybody is going to be able to use these surveys effectively at this point, now that they’re associated in teachers’ minds with performance evaluations and disciplinary procedures. This is the bit that reformers seem to have a great deal of difficulty understanding: it’s incredibly difficult to improve the quality of teachers just by promising to pay them more if certain numbers are high, or by threatening to fire them if certain numbers are low. Student surveys, as originally conceived by Ronald Ferguson, could have been a great tool for improving the quality of eduction. But at this point, I fear, it’s too late.