Less technical post about VAM: What ‘value-added’ is and is not

As if there weren’t enough already, I found yet another few flaws in ‘value-added,’ flaws which, I believe, makes it beyond salvation.

‘Value-Added’ is, once again, how much better or worse a teacher’s students do on a standardized test compared to what a complicated formula predicted the students would get. If the formula predicts teacher A’s students will get a 2.7 and they get a 3.0, then that teacher gets a +.3 as his ‘value-added’ score. For teacher B, the formula predicts they will get a 3.2 and they get a 3.3, so that teacher gets a +.1 as his ‘value added’ score. In this scenario, teacher A will be more likely to get merit pay while teacher B will be more likely to get a pay cut (de-merit pay?) or even get fired.

The flaw in this is that even though the teachers are trying to meet different benchmarks, there is an element of ‘luck’ involved. In assigning this .3 as teacher A’s ‘effectiveness,’ there is an assumption that it is fair to compare that to B’s .1 score since these formulas have ‘equalized’ everything. But does this mean that if these teachers had switched classes in the beginning of the year that A would have moved those 3.2 kids up to 3.5 and that B would have only moved those 2.7 kids up to 2.8? It is unlikely. Maybe B would be really great at helping the lower classes and had he taught those students he would have gotten a +.4 ‘value-added.’ In this way, it is not fair since a teacher who happens to have an assignment in his ‘comfort zone’ can get a higher ‘value-added’ than another teacher, even though that other teacher may have been even better had he gotten the first teacher’s students. Teachers are not ‘value-adding’ widgets who will get the same score regardless of what group of students they teach.

The next issue is whether ‘value-added,’ then, is appropriate for comparing two teachers for whom the formula predicts the class scores to be the same. So you’ve got teachers C and D. Both teach the same grade and have the ‘same kids,’ meaning that they have the same pretest scores and the formula predicts they will have the same post test scores, let’s say 2.8. So C’s classes get a 3.3 and D’s classes get a 2.7. Does this mean that in a parallel universe where they swapped classes before the first day of school that C would have gotten the same +.5 gain with D’s classes, and D would have gotten the same -.1 loss with C’s classes?

As we can’t go into a parallel universe to check this, I did an experiment with the 2007-2008 and 2008-2009 New York City teacher data that was released in the media. I located about 2,000 teachers who, in both school 2007-2008 and 2008-2009, kept the same grade, the same subject, and whose classes in the two consecutive years had starting scores that differed by less than .05 (on the 1 to 4 scale, this means that, at least according to the formula, they taught the ‘same kids’ for two straight years).

Only if the percentile rank of these teacher’s ‘value-added’ stay around the same from one year to the next can we say that the ‘value-added’ really is a reliable metric. Not surprisingly, even with every other variable being the same, the ranks did not correlate at all. (For math people, the r value was .3, but that was to some line other than the y=x line, so it really is even lower than that.)

This is the best we can do to check if these numbers actually measure anything. Again, these are the ranks in two consecutive years of all the teachers who in those two years taught the ‘exact same’ group of kids. As these scores don’t correlate, it means that it is not fair to even compare two teachers who get the ‘same kids.’ It is almost certain that if they had switched their two groups of kids, their ‘value-added’ scores would be different than if they didn’t switch.

This analysis is even more significant than the one I did part 2 where I showed that a teacher who taught two different grades in the same year got wildly different ‘value-added’ scores. I could see how someone who knows nothing about education could argue that it is not all that surprising that this happened since the teacher might be quite good at teaching 7th grade, but weak as an 8th grade teacher — or that maybe the 7th grade group had very different ability levels than the 8th grade group so since this teacher has strengths and weaknesses, the two different ‘value-added’ scores can be averaged together to get a more accurate picture of this teacher’s ‘effectiveness.’

But this new analysis shows that a teacher teaching the same grade, the same course, with the ‘same students’ does not get consistent results. It is truly like weighing yourself on a scale, getting off the scale and then one second later getting on the same scale, and having your ‘weight’ change by twenty pounds.

Along with a lot of other analysis by people with a lot more tools than me in their statistics tool belts, I hope this post adds a new wound to the misuse of standardized test scores in teacher evaluation. Like Jason in the Friday The Thirteenth movies, ‘value-added’ is tough to kill off.

Less technical post about VAM: What ‘value-added’ is and is not

Pingback: Testing teacher through “VAM: « Deborah Meier on Education

Another complication is how much a student might be expected to grow. A student scoring a 2 and moving to 2.2 shows a 10% growth while a student starting at a 3 would have to grow .3% to show 10% Students already functioning at a higher level might show less growth and therefore make their teachers look less effective. A teacher working with LD students or special needs might struggle to get her/his students to make .1% and would appear less effective compared to teachers of normal students. Electives like music, PE and art are other problem areas. 50% of an art teacher’s evaluation in Florida is supposed to be based on an average of the VA scores for his/her students in reading (??!!)

Pingback: Gary Rubinstein Skewers VAM « Diane Ravitch's blog

This is an interesting and important analysis. Would it be accurate to make a slight qualification to your statement that the scores do not correlate? The middle part of the graph is the densest; this suggests to me that teachers whose value-added scores are in the middle range have a greater likelihood of receiving similar scores in consecutive years.

Also, the graph suggests these patterns:

Teachers whose scores approach 0 or 100 are not likely to repeat those scores two years in a row.

Teachers who received 100 (or close) in 2009 are unlikely to have received lower than 30 in 2008.

Teachers who received 0 in 2009 are unlikely to have received higher than 60 in 2008.

All that said, the year-to-year score correspondences are “all over the map,” as you point out. Moreover, the patterns I have identified don’t tell us a whole lot anyway. (If teachers regularly get 0 in one year and 50 the next, with the same grade, subject, and pretest scores, what meaning do these scores have?) I just wanted to ask whether you see the patterns I see in this graph.

Regression to the mean; understand it and you will have your answer.

I’m not convinced that regression toward the mean explains these patterns, but I’m still puzzling over the possibility. Will reply at greater length later if I have more to say.

Correction to the third sentence of the last paragraph of the comment: I meant: “If teachers routinely get 50 in one year and 0 in the next…” (though the reverse holds true as well).

Hi Diana,

You are right. There is a small degree of correlation. I actually would have expected it to have more correlation just because of possible biases that make one teacher’s value-added good or bad.

I should also mention (and I’ve written a bit about this before — I’ll look up to see where) that as a teacher I do take it very hard when my students don’t do well on one of my own tests. I’ve given homework and classwork and have assessed them informally by asking questions in class and having students come to the board to explain — the test should be a formality since I should already know that they will do pretty well on it. Now, if someone is absent for a lot of days, I don’t blame myself for that — sometimes value-added accounts for students who have poor attendance. D.C. uses the attendance from the year before so the teacher doesn’t encourage the student to be absent!

Gary

Thanks, Gary, for your reply. Please know that when I challenge your points (as I have done twice by my count, and only slightly), I do so in good spirit, with appreciation. I like being challenged in that way as well.

I, too, take it hard when students don’t do well on my tests. And like you, I take it in stride if they have been absent for many days. Sanity should reign in these cases.

Gary, your contributions to the VAM debate are immeasurable! Thank you so much! The problem is that the totally ideologically driven reformers are not moved by the pragmatic implications of your work, or the work of many others. Or the facts. Or the truth that challenges their claims.

Their mission, in fact, seems to be taking place in that parallel universe that you mentioned. They are not moved by earthly measures and mathematical facts. After all, the work you and I did to point out the utterly embarrassing measures of “success” at Urban Prep, or Miami Central, or all of the schools we studied from the infamous “Miracle Schools” of Newsweek didn’t seem to faze the reformers. They are staying on their message, and touting new examples (just as bogus as others we have exposed) and they have lots of people working full-time helping them.

It is scary, and it is devastating to teachers and students all over this country. How can we wake our country up from this national nightmare. It is, in fact, like the nightmare that children (and adults) sometimes have where they see the horrible event coming, and know what to do to stop it, but can’t find their voice. In our waking nightmare, we have to find a way to break into the mainstream, into the critical mass of people who will stop for a moment to think critically about the “reforms” and realize this is not the “Brave New World” anyone REALLY wants for their children! I think it is time we join with other critics of the crazy parallel universe of the reformers, and develop a message that moves the masses. I welcome your thoughts, Gary.

Really compelling argument! And certainly something that ought to be explained before moving forward with this particular VAM. I’m just curious: it seems like you did by hand something that you could have also controlled for through a multivariable regression. How do you think that might have changed the results? Also, you said you chose about 2000 teachers who fit the aforementioned criteria. 2000 out of how many? If there were more than 2000 teachers you could have chosen from (with same pretest, same subject, etc.), how did you choose these 2000?

There were about 18,000 entries in the 2007-2008 database, representing 12,000 teachers (half of them had two evaluations, either because they taught elementary and had math and verbal or because they taught 2 different grades of the same subject). Of those 18,000 entries, this was all 1,592 of the entries that me the criteria.

Gary, I have no idea what you’re talking about, but I absolutely agree. 🙂

Unfortunately for Diane, I think the force of gravity is too strong for this incredibly dense post to go viral (maybe it can go rolling instead, if on a hill, with a little push) but I think it’s important that these topics are dissected both philosophically and scientifically.

Also this:

http://larrycuban.wordpress.com/2012/09/14/chicago-teachers-strike-performance-evaluation-and-school-reform-jack-schneider-and-ethan-hutt/

Gary – I had posted this as a response to an Education Week article, thought I would copy it here, as I was questioning other aspects of VAM – which I bet you have addressed in other posts.

“VAM scores are statistical models – regression models that ideally should include a good majority of the variables that impact learning. When a regression model does not include important variables then the model is prone to significant error – thus all the complaints that the VAM scores are inconsistent and highly variable.

As an elementary teacher who was once an economist who worked with regression (yes, I did a weird midlife career switch) I view the VAM models I have seen as missing key data that impact teacher’s ability to teach:

– class make up (one or more disturbed children can have a big impact on all learners)

– support staff (if you are stuck with a horrible aid, a weak teaching team, or even worse, a horrible sped teacher as a partner your scores will vary)

– if you were moved into teaching a new grade, or a new subject area or if your curriculum materials are poor, or new or untested

Non of these critical variables are measured in any of the VAM regression equations I have seen. Not to mention individual student factors that are missing, such as recent family deaths/illnesses/divorces – or, if suddenly parents have decided to medicate their child for behavior modifications.

So many missing variables make a regression analysis unreliable. When such a model impacts peoples livelihoods and children’s learning, such unreliability is not acceptable. This is in addition to important consideration of whether an analysis is measuring what you want it to measure and whether there are undesired consequences from focusing importance on such a measure. VAM is just too week on so many accounts. But we will continue with it and many happy economists will continue having their nice gig selling it to education officials.”

Hi Gary,

I just published a similar analysis of the MET project’s value-added data in this post:

http://mathnerd.teachforus.org/2012/09/20/value-added-part-2-the-met-project-jesse-rothstein/

I think we are both in agreement that the correlations in teacher value-added on state tests between different classes taught by the same teacher are too low for that data to be useful. I am curious: would you support the use of a value-added model that achieved a higher correlation through the inclusion of other measurements of teacher effectiveness? In other words, a teachers’s “value-added” wouldn’t just depend on his or her students’ performance on a state test, it would also depend on other measurements like conceptual tests, student feedback, and quantitatively-scored observations. Hopefully this would be more consistent across classes. If it were, what would you think about using it to evaluate teachers?