Monday, October 1, 2012

What Could Go Wrong with Value-Added Metrics?

In my last post, I explained what a value-added metric is.  Simply put, a value-added metric combines three things:
  1. Data taken before and after some intervention, and
  2. A model that uses pre-intervention data, possibly along with other factors, to predict the post-intervention data.
  3. An interpretation of any differences between the post-intervention data and the model.
In the last post, the data were heights of trees; the intervention was a fertilizer treatment, and the model was the linear model based on the data from the unfertilized trees.  In the case where the treated trees grew more than the model predicted, the interpretation is that the fertilizer was effective.  In a value-added metric for teaching, the data are test scores, at the beginning and end of year.  The model predicts end-of-year gains for "typical" students.  The interpretation is typically that the differences between actual and predicted results are a measure of teacher quality.

There's been lots of misinformation about value-added metrics; before we deal with what's wrong with this scheme, we need to make sure that we're not spouting half-baked criticisms that make us all sound ignorant.

Half-Baked Objection 1:  It's not fair to penalize teachers whose students don't end the year at grade level when those kids start the year behind grade level.
The VAM doesn't simply score students based on their end-of-year scores, but looks for growth from the beginning of the year to the end of the year.  So if a group of students starts 5th grade reading at the 3rd grade level, and finishes the year reading at the 4th grade level, the teacher is supposed to get credit for a year of growth.
Half-Baked Objection 2: Students aren't plants, and teachers aren't fertilizer.
Of course they aren't.  But by itself, this objection says "You can't measure anything." And while measuring teachers badly hurts the profession, claiming that what we do can't be measured doesn't help either.
*     *     *     *     *
What is it reasonable to expect of a measure of teacher quality?  Let's establish a few criteria:

  1. Longitudinal Consistency Teachers change over time, but not necessarily that much in any given year.  So unless we have evidence that a teacher is taking substantial steps to improve his or her practice, or strong evidence that something has come unhinged, we would expect teacher scores to stay roughly the same from one year to the next. If teacher scores fluctuate wildly, that casts doubt on whether the score is really measuring something that the teacher is doing.
  2. External Validity There are research-based strategies for exemplary teaching; that is, people have actually compiled lists that describe what teachers need to do to be effective.  One such model is Charlotte Danielson's Framework for Teaching, but it's not the only one.  Because these strategies are themselves validated by research demonstrating their impact on student learning, we would expect that, in general, teachers who are doing the things on these lists would score highly on the value-added metric, and that teachers who are not doing these things would score poorly.  Of course, there's no canonical list that we need treat as gospel: it's possible that, over time, our views of what constitutes good teaching will evolve, and that this evolution will be informed by results of a metric system.
  3. Fairness We don't want our measurement system to treat one group of teachers differently from another, and it should be mostly immune to sabotage or "gaming" by malevolent or savvy administrators and teachers.
  4. Appropriate Incentives Peter Drucker's maxim "What is measured, improves," has a corollary:  make sure you measure the things that you want to improve.  In an era when almost any fact can be Googled, when the phrase "21st Century Skills" has gone from a war cry to a banality, we need to be careful that our metric creates incentives for teachers to teach the skills, concepts, and habits that we want kids to learn.  We also want to ensure that the metric doesn't create perverse incentives for teachers to skip over crucial content, revert to large-scale rote memorization, or avoid teaching certain students.  For example, the current NCLB regime has the well-documented "Bubble Effect":  it's to a teacher's advantage to concentrate on those students who are near the proficiency borderline, to the exclusion of students who are so far from proficiency that a single year's work is unlikely to make the difference.  
There are probably lots of other criteria we could use, but this list makes a fair start.  The next question is: how well do current systems measure up?