As if there weren’t enough already, I found yet another few flaws in ‘value-added,’ flaws which, I believe, makes it beyond salvation.
‘Value-Added’ is, once again, how much better or worse a teacher’s students do on a standardized test compared to what a complicated formula predicted the students would get. If the formula predicts teacher A’s students will get a 2.7 and they get a 3.0, then that teacher gets a +.3 as his ‘value-added’ score. For teacher B, the formula predicts they will get a 3.2 and they get a 3.3, so that teacher gets a +.1 as his ‘value added’ score. In this scenario, teacher A will be more likely to get merit pay while teacher B will be more likely to get a pay cut (de-merit pay?) or even get fired.
The flaw in this is that even though the teachers are trying to meet different benchmarks, there is an element of ‘luck’ involved. In assigning this .3 as teacher A’s ‘effectiveness,’ there is an assumption that it is fair to compare that to B’s .1 score since these formulas have ‘equalized’ everything. But does this mean that if these teachers had switched classes in the beginning of the year that A would have moved those 3.2 kids up to 3.5 and that B would have only moved those 2.7 kids up to 2.8? It is unlikely. Maybe B would be really great at helping the lower classes and had he taught those students he would have gotten a +.4 ‘value-added.’ In this way, it is not fair since a teacher who happens to have an assignment in his ‘comfort zone’ can get a higher ‘value-added’ than another teacher, even though that other teacher may have been even better had he gotten the first teacher’s students. Teachers are not ‘value-adding’ widgets who will get the same score regardless of what group of students they teach.
The next issue is whether ‘value-added,’ then, is appropriate for comparing two teachers for whom the formula predicts the class scores to be the same. So you’ve got teachers C and D. Both teach the same grade and have the ‘same kids,’ meaning that they have the same pretest scores and the formula predicts they will have the same post test scores, let’s say 2.8. So C’s classes get a 3.3 and D’s classes get a 2.7. Does this mean that in a parallel universe where they swapped classes before the first day of school that C would have gotten the same +.5 gain with D’s classes, and D would have gotten the same -.1 loss with C’s classes?
As we can’t go into a parallel universe to check this, I did an experiment with the 2007-2008 and 2008-2009 New York City teacher data that was released in the media. I located about 2,000 teachers who, in both school 2007-2008 and 2008-2009, kept the same grade, the same subject, and whose classes in the two consecutive years had starting scores that differed by less than .05 (on the 1 to 4 scale, this means that, at least according to the formula, they taught the ‘same kids’ for two straight years).
Only if the percentile rank of these teacher’s ‘value-added’ stay around the same from one year to the next can we say that the ‘value-added’ really is a reliable metric. Not surprisingly, even with every other variable being the same, the ranks did not correlate at all. (For math people, the r value was .3, but that was to some line other than the y=x line, so it really is even lower than that.)
This is the best we can do to check if these numbers actually measure anything. Again, these are the ranks in two consecutive years of all the teachers who in those two years taught the ‘exact same’ group of kids. As these scores don’t correlate, it means that it is not fair to even compare two teachers who get the ‘same kids.’ It is almost certain that if they had switched their two groups of kids, their ‘value-added’ scores would be different than if they didn’t switch.
This analysis is even more significant than the one I did part 2 where I showed that a teacher who taught two different grades in the same year got wildly different ‘value-added’ scores. I could see how someone who knows nothing about education could argue that it is not all that surprising that this happened since the teacher might be quite good at teaching 7th grade, but weak as an 8th grade teacher — or that maybe the 7th grade group had very different ability levels than the 8th grade group so since this teacher has strengths and weaknesses, the two different ‘value-added’ scores can be averaged together to get a more accurate picture of this teacher’s ‘effectiveness.’
But this new analysis shows that a teacher teaching the same grade, the same course, with the ‘same students’ does not get consistent results. It is truly like weighing yourself on a scale, getting off the scale and then one second later getting on the same scale, and having your ‘weight’ change by twenty pounds.
Along with a lot of other analysis by people with a lot more tools than me in their statistics tool belts, I hope this post adds a new wound to the misuse of standardized test scores in teacher evaluation. Like Jason in the Friday The Thirteenth movies, ‘value-added’ is tough to kill off.