Sunday, January 20, 2013

Last week's Gates Report

So the Gates Foundation released a report last week about measuring teacher quality.  Their analysis took several unusual steps:

  1. First, they took three sources of data--observational evaluations, value-added metrics applied to prior years, and student evaluations of teachers--and considered which combination of them was more predictive of teacher success.
  2. Second, they measured teacher success using a value-added metric applied to randomized classes.  That is, teachers were compared based on their success (or failure) with students who had been randomly assigned to their classes;  by contrast, many elementary schools exert some effort in putting kids with the "right" teachers.  The Gates study attempted to minimize selection bias in this way.
  3. Third, they considered other criteria for an evaluation systems's success besides its predictive power on state tests:  year-to-year stability and student performance on tasks that they euphamistically described as "higher-order" (i.e. involved actual thought, not just rote facts and skill application).
The findings were pretty striking:
  • The VAM was predictive of student's future success in testing; in fact, of the four weighting systems they studied, the one that was most predictive of student's test scores was the one that weighted VAM most highly:


  • The equal weights scheme produced a reasonably high correlation with state tests gains without sacrificing as much on the higher order tests or on reliability.
So what should we make of this?  Is this a vindication of VAM?  I'm not so sure, for a couple of reasons:

  1. Whether a test is designed to measure growth or not is an important issue that isn't addressed by simplistic "Look at whether test scores increase or not."  Increasing from a 400 SAT to a 500 SAT is different from increasing from a 500 to a 600, a 600 to a 700, or a 700 to an 800.  So simply pointing to these data and saying "Look, VAMs work!" doesn't address the underlying issue of the test itself:  the reason these VAMs might work is that the underlying tests are better.
  2. There's an odd kind of solipsism to this report, as my friend Sendhil pointed out.  I mean, we're talking about predicting students' gains on tests by using prior students' gains on the same tests. So it shouldn't surprise us that the predictions went pretty well. (Although a graph like this one -- showing VAM scores for the same teachers, same classes -- shows that it's not exactly a slam dunk:

    ).
  3. As a classroom teacher, I can attest that there's no such thing as a double-blind study:  the students know whom they're getting, even if they're randomly assigned.  As a longtime teacher in my school with a good reputation, I can ask things of my students that other teachers simply can't ask:  harder projects, more retakes, etc., because students trust me in a way that they might not trust another teacher.  So it's still possible that students who know--as the VAM people keep telling us, "everyone knows" who the good teachers are--that their teachers are among the good ones therefore do harder work, challenge themselves more, and make more gains, not because of better teaching technique, but because of what they themselves are doing.
  4. Look at those terrible correlations with HOTS (higher-order thinking skills)!  Shouldn't we be trying to figure out what will make students do better on those?