In today’s New York Times there was a story about a research study which supposedly proved that students who had teachers with good value-added scores were more successful in life. This inspired me to complete something I have been working on for several months, off-and-on, a detailed analysis of the raw data supplied in the most quoted value-added study there is, a paper written in Dallas in 1997. This is the paper which ‘proved’ that students who had three effective teachers in a row got dramatically higher test scores than their unlucky peers who had three ineffective teachers in a row. I’ve written about it previously much less formally here and here.
The New York Times story frustrated me since I know that value-added does not correlate with future student income. Value-added does not correlate with teacher quality. Value-added doesn’t correlate with principal evaluations. It doesn’t correlate with anything including, as I’ll demonstrate in this post, with itself.
The way the Dallas study is often described is as follows: There were two groups of third graders. They both had gotten math scores of about 55% at the end of second grade. One group got three effective teachers in a row. The other got three ineffective teachers in a row. After the three years they were retested and the group with the three effective teachers now had an average score of 76% while the poor suckers who had three ineffective teachers were down to 27%.
This study conjures up images of identical twins separated at birth and given different upbringings. But this is not actually what happened. It is not that they took two groups of kids and did this. Instead, what they did is take 3,000 kids. They tested them at the end of 2nd grade. Then kids went to school for the next three years and got tested again at the end of fifth grade. They were shuffled around, classes split up, just as happens in school.
Every teacher was rated before the study on a scale from 1 to 5. How they measured that quality isn’t clear, but after three years, in 1996, every kid was then given a three digit group number ranging from 111 to 555. There ended up being one hundred twenty-five groups of about 30 students each. If a student had 1 rated teacher in 1994, a 2 rated teacher in 1995, and a 4 rated teacher in 1996, they were part of the 124 group. Every student who had a 1, a 2, and then a 4 were treated as a group, though they had different teachers, mostly, just that they supposedly had that quality of teacher in that order. Note that there is a different group called 241 which are students who first had a 2 in 1994, then a 4 in 1995, then a 1 in 1996.
They did this for ten groups of 3,000 kids and called the groups R4, R5, R6, R7, R8, M4, M5, M6, M7, M8 where the R stands for reading and the M for math while the number stands for the grade that groups was in in 1996. They then created, for each group, bar graphs similar to the one you see below proving that three effective teachers can close the achievement gap.
Below is one of the graphs from the study. The first three bars on the left represent the starting scores of three different groups of students: The leftmost bar of the three represents what the students who were about to get three ineffective teachers (111) got in 1993. They had a 57% scaled average. The second bar indicates that the students who were about to get three ‘average’ teachers (323) had a starting score of about 56% and the third bar indicates that the students who were about to get three effective teachers (455) got in 1993, about 55%. The three bars on the right represent the scores of those same three groups of students in 1996. So the 111 group went down from 57% to 27%, the 323 group went down from 56% to about 54% and the 455 group went up from 55% to 76%.
When you look at this graph, the first thing that might seem unusual is that the high group is not the 555 group, but the 455 group. Why is that? Well, because the 555 group had a different starting point than the 111 group, so it would not be a valid comparison. None of the twenty graphs have both the 111 and the 555 groups since those groups never had a close enough starting point. They explain in the paper that this is because:
So they admit that the assignment to these teachers was done with bias which, it seems to me, invalidates the entire study. But I learned that even with this bias, the authors of the report had to further distort their results. When you see those three bars it seems that those were the only three groups that began with a starting score of around 56%.
Looking at the actual raw data, I learned that there were actually twelve other groups that had a starting score in that range. Since I didn’t have enough room to make 30 bars on my graph, I split my graph up into two. On the top graph, I have the starting scores for all 15 groups that got between 55% and 57% in 1993. The graph on the bottom is the scores for those same 15 groups in 1996. So the 151 bar on the top graph means that the group of students who were about to have an ineffective teacher, followed by an effective teacher, followed by another ineffective teacher (151) started with about a 56% score. Then on the bottom graph, there is a 151 bar which indicates that those same students who had an average of 56% in 1993 had an average of about 41% in 1996. Again, this was not a group of thirty students who stayed together through the same teachers, but the scores of the kids who had any 1 teacher followed by any 5 teacher followed by any 1 teacher.
1993 scores of groups who had scores around 56%
Now it is not exactly clear how to interpret some of the other combinations. Should it be better to have a 151 combination or a 214 combination? They both add up to 7. I sorted the bars from low sum to high sum and used a complicated tie-break procedure to create this full picture. One funny thing is that the 125 is actually the second best group. They are better than the 525 even though they had a 1 teacher the first year. Also, they are better than the 521 combination since apparently the order that you get the teachers makes a big difference. Just taking the 125, 333, 424 bars, I could make a graph that seems to show that ‘better’ teachers get worse results.
I could analyze all twenty to find evidence that value-added is, at best, a pseudoscience. Only someone not involved in education could think that the top twenty percent of teachers are these heroes that can perform miracles. I take particular offense to this myth as I consider myself to be quite a good teacher, and I know that I would not be capable of such feats.
Here is another graph from the paper, followed by my more complete graph with analysis:
1993 scores of the 14 groups who had starting scores around 32%
1996 scores of the same 14 groups three years later
When you take just the 112, 233, and 553 bars it looks like a clear positive linear progression. But when you see all 14 groups that had similar starting points there are many exceptions to the supposed correlation between teacher ‘quality’ and test gains. Like the fact that the 515 group did better than the 553 group. The 125 group destroyed the 441, 532, and even the 542 group. And as bad as the 112 group was, they still beat both the 323 and the 441 group. If I just isolate the 112, 323, and 441 groups, I can, again, ‘prove’ that the ‘better’ teachers get lower scores. Though there is a slight upward trend when you look at all the groups– I’d expect there to be since there was an element of value-added, surely, in the initial assessment of the teachers to put them into those groups, the random ups and downs of these graphs seems to prove, more than anything, how unscientific this study is.
Yet this study continues to be quoted, especially the M5 example that was my first example. It is in Whitney Tilson’s powerpoint slides (shown below). It is in The New Teacher Project’s Denver report (shown below the Tilson slide). I don’t think that Michelle Rhee has ever given a talk where she has not quoted this study.
It took me longer than I’d like to admit, but I did this type of analysis for all twenty graphs and collected the results here. The original 1997 Dallas paper is here. My extensive spreadsheet is here. I encourage anyone who is interested to download them and see what things you can find in them.