Angles of Reflection: September 2012

In conversations in and around Chicago these last two strike-filled weeks, one item has captured center stage: the use of value-added metrics in teacher evaluations. As I've been part of these discussions, I've noticed that both opponents and proponents have something in common: they really don't know what a value-added metric is.

You can tell these people by how they argue about the metric. For example, "If the kids start out the year behind, how is it fair to penalize the teacher for the fact that they end the year behind?" (It wouldn't be, but the "added" part of the value-added metric means that the metric is trying to describe change, not just absolute performance at the end of the year.) Or "If the class starts out at 80% and then ends at 85%, the teacher's responsible for the other 5%." (The value-added model doesn't compare average scores directly, which is good, because we would hope that students grow over the year anyway.)

I'm no fan of value-added metrics in teacher evaluations, but we won't get anywhere arguing about them if we don't even know what we're arguing about. So this blog is a sort of crib sheet for teachers and education people who haven't gotten totally immersed in the statistics and psychometrics stuff.

To make this description go, we'll apply it to a situation where a VAM might actually be useful. Say you want to determine whether a particular fertilizer treatment makes plants grow faster. If this were your fourth-grade science fair project, you'd just take two groups of plants, compute the mean height of each group, and then treat one group with fertilizer. At the end of the experiment, you'd compute the mean heights again. If the fertilized plants have a higher mean height, then the fertilizer works. Right?

Wrong. Computing means of groups doesn't tell you much about what happens to the individual plants. For example, in the (totally cooked-up-to-make-this-point) dataset below, at the end of the experiment, two groups of plants have mean heights of 3.89cm (fertilized) and 4.05cm (unfertilized). At the end of the experiment, the fertilized plants have mean height 5.97cm, while the unfertilized plants have mean height 6.05cm. So the fertilized plants have it, by a whisker: unfertilized plants grew an average of 2cm, while fertilized plants grew by 2.08cm. Problem is, in my model, all I did was assume that the below-average-height fertilized plants grew 4 cm, while the above-average-height fertilized plants didn't grow at all. All unfertilized plants grew by 2cm.

We can see these differences clearly in the scatterplots below, comparing fertilized (left) with unfertilized (right). In both graphs, the blue line is y = x, representing no growth at all. Points above the blue line represent plants that grew; points on the line represent plants that didn't grow, and points below the line (of which there aren't any) would represent plants that actually shrank.

In these graphs, the amount of growth is just the vertical distance from a point to the "no growth" blue line. Stats types often graph those distances separately, and give them a special name: "residuals". In a residual plot, the y-value is the difference (actual value - predicted value):

These residual plots make the situation clear: the "fertilizer" leaves half the plants worse off than they would have been without fertilizer. Simple means don't tell this story.

The other problem with comparing means is that data never fall on nice straight lines: as a representation of possible reality, the made-up dataset above is garbage. A much more realistic set of data for the unfertilized plants (not even worrying about the rest of the experiment) might look like this--but again, I made this one up too:

Heights of Unfertilized Plants

Again, the blue line is y = x: if a point is on this line, it represents a plant whose final height and initial heights are the same, i..e, it didn't grow at all. Points above the graph represent plants that grew more; the further the point is from y = x, the more it grew. Not all plants grew the same amount, but there is a clear upward trend, and it seems like plants that started out taller grew more. So we can draw a best-fit line:

Here, the slope of the trendline, 1.27, suggests the average plant grew 27% over the experiment. But not every plant grew exactly 27%. If we look at the residuals, we see some "noise":

The fact that these points are not exactly on the 0 line tells us that there are other factors at work, but the fact that they seem randomly distributed about the line suggests that there's no systematic factor (affecting all the plants) that our model is missing. In fact, if we calculate the mean residual for this dataset, we would get -0.06, suggesting: the average plant is within 0.1 cm of the height predicted by the model. (Whether this difference is significant or not is a whole nother story....)

Applying that same trendline to the data for the fertilized plants yields an "aha!":

In this case, all the fertilized plants lie above the 27% growth trendline, suggesting that they all grew more than 27%. Again, we can look at the residuals:

In this case, the average residual is 1.93, confirming what we see on the plot: the residuals seem clustered about 2cm above the 0 line. This result suggests that there is another factor at work, possibly the fertilizer.

If we add a new trendline to quantify the growth of the fertilized plants, we get a model whose mean residual is nearly 0:

The slope of 1.76 suggests that the typical fertilized plant grew by 76%. So we might say: fertilized plants grew almost 3 times as much, relative to their original sizes, as unfertilized plants!

The quant way of describing this experiment is in three parts. In the first part, we acquired some data on how unfertilized plants grow. In the second part, we used this data to develop a model for plant growth: a typical plant grows about 27%, and, because r² =0.83, we conclude that about 83% of the variation in the final plant heights can be attributed to the starting plant height. The other 17% is the randomness we see in the residual plot. The model doesn't tell us anything about the other 17% of variation, or even about why the plants grow proportional to their original height--it just describes what seems to happen. In the third part, we used the model to develop a conclusion about the fertilized plants: the fertilized plants all grew more than the model predicted, so (we conclude) the fertilizer must be effective.

Because we have all these numbers about effectiveness, we can even crunch them together to get an effectiveness score. We can crunch the growth rates: 1.76-1.27, or 1.76/1.27, or 0.76/0.27. We can use the mean residual of 1.93 for the fertilized plants against the unfertilized model: "The typical fertilized plant grew 1.93 cm more than unfertilized plants." Notice, though, that the "effectiveness score" quickly stops being meaningful outside of the context of how we computed it: it's just a way of summarizing the relationship between one set of data and another set of data. Because every plant grew a different amount, we can't point to a single number as being completely representative of the unfertilized plants, or of the fertilized plants, and so we can't just crunch those not-incredibly-meaningful numbers together to get something that's more meaningful.

There's lots of room to argue about why this is a lousy approach to measuring educational performance, but that's not our goal today. I want to leave with one last set of ideas about modeling.

First, what about the variation in the data--the original dataset, and the 17% "missing" variation we saw in the residuals that we said was due to "other factors"? If I'm trying to test fertilizers, I do my best to either eliminate other factors or spread them out so much that they cancel each other out. In the first approach, I might plant all my plants in the same field, or make sure that they are watered completely evenly. But of course there might be minute but significant variations in soil, sunlight, etc., which I can't physically control to be exactly the same across all plants. I probably won't level a mountain, or tear down my neighbor's barn, to make sure that all the plants get the same amount of light. So if I'm being really tricky, I divide up my growing space into a hundred or more small squares, plant one plant per square, and then randomly determine which squares get fertilizer. While one particular swath of land might get more sunlight, or less standing water, that patch will have both fertilized and unfertilized plants growing on it. Another advantage of randomizing conditions in this way is that I might be able to get valid results across a wider range of conditions: it doesn't do a farmer in a flat field of acidic soil in North Dakota any good to know that my fertilizer works really well on hilly alkaline fields in Arizona.

Of course, that kind of experimental control is really hard to obtain in education. In education, most of the variables are completely outside a researcher's control. Many are clustered: the students who are low-income or high-income come with other baggage that affects their education.

So I might use a third approach: adopt a more complex model that takes more factors into account. My original meta-model was very simple: growth is a function of original height. But if I'm trying to measure the effects of fertilizer on 50-year-old trees, I can't just plant a bunch of trees in small random plots and wait 50 years to start measuring effects. So what I do is I think of every factor that could affect tree height, measure those factors for each tree, and then come up with a more complex model that predicts height based on all these factors: soil acidity, hours of direct sunlight, proximity to sidewalks, whatever. But then when I've finished fertilizing and growing, I can do the same basic procedure I did in the simpler case: apply the model I've developed to the fertilized plants and see whether their growth pattern is substantially different from what the complicated model predicts. In education research, we might have a model for student performance that takes into account class size, school socioeconomic statistics, etc., and then we'd be trying to see how this model predicts student performance for the students we're trying to study (because they have a particular teacher, or are using a particular curriculum, or ...).

The last caveat is that, of course, the correlation we've observed doesn't tell us much about cause. We don't know why plants that start out taller tend to grow more, or why the fertilizer is effective, or even--unless we've carefully controlled or randomized other variables--whether it's the fertilizer that is making the difference. There's a great XKCD that makes this point:

I've tried to write this so that it sounds pretty reasonable. Next time we'll talk about what goes wrong when this approach is applied to measuring teacher quality in education, and the gloves will come off. I promise.

Angles of Reflection

Thursday, September 20, 2012

What the heck is a Value-Added Metric?