The ‘three great teacher’ study — finally laid to rest

In today’s New York Times there was a story about a research study which supposedly proved that students who had teachers with good value-added scores were more successful in life. This inspired me to complete something I have been working on for several months, off-and-on, a detailed analysis of the raw data supplied in the most quoted value-added study there is, a paper written in Dallas in 1997. This is the paper which ‘proved’ that students who had three effective teachers in a row got dramatically higher test scores than their unlucky peers who had three ineffective teachers in a row.  I’ve written about it previously much less formally here and here.

The New York Times story frustrated me since I know that value-added does not correlate with future student income. Value-added does not correlate with teacher quality. Value-added doesn’t correlate with principal evaluations. It doesn’t correlate with anything including, as I’ll demonstrate in this post, with itself.

The way the Dallas study is often described is as follows: There were two groups of third graders. They both had gotten math scores of about 55% at the end of second grade. One group got three effective teachers in a row. The other got three ineffective teachers in a row. After the three years they were retested and the group with the three effective teachers now had an average score of 76% while the poor suckers who had three ineffective teachers were down to 27%.

This study conjures up images of identical twins separated at birth and given different upbringings.  But this is not actually what happened.  It is not that they took two groups of kids and did this.  Instead, what they did is take 3,000 kids.  They tested them at the end of 2nd grade.  Then kids went to school for the next three years and got tested again at the end of fifth grade.  They were shuffled around, classes split up, just as happens in school.

Every teacher was rated before the study on a scale from 1 to 5.  How they measured that quality isn’t clear, but after three years, in 1996, every kid was then given a three digit group number ranging from 111 to 555.  There ended up being one hundred twenty-five groups of about 30 students each.  If a student had 1 rated teacher in 1994, a 2 rated teacher in 1995, and a 4 rated teacher in 1996, they were part of the 124 group.  Every student who had a 1, a 2, and then a 4 were treated as a group, though they had different teachers, mostly, just that they supposedly had that quality of teacher in that order.  Note that there is a different group called 241 which are students who first had a 2 in 1994, then a 4 in 1995, then a 1 in 1996.

They did this for ten groups of 3,000 kids and called the groups R4, R5, R6, R7, R8, M4, M5, M6, M7, M8 where the R stands for reading and the M for math while the number stands for the grade that groups was in in 1996.  They then created, for each group, bar graphs similar to the one you see below proving that three effective teachers can close the achievement gap.

Below is one of the graphs from the study.  The first three bars on the left represent the starting scores of three different groups of students:  The leftmost bar of the three represents what the students who were about to get three ineffective teachers (111) got in 1993.  They had a 57% scaled average.  The second bar indicates that the students who were about to get three ‘average’ teachers (323) had a starting score of about 56% and the third bar indicates that the students who were about to get three effective teachers (455) got in 1993, about 55%.  The three bars on the right represent the scores of those same three groups of students in 1996.  So the 111 group went down from 57% to 27%, the 323 group went down from 56% to about 54% and the 455 group went up from 55% to 76%.

When you look at this graph, the first thing that might seem unusual is that the high group is not the 555 group, but the 455 group.  Why is that?  Well, because the 555 group had a different starting point than the 111 group, so it would not be a valid comparison.  None of the twenty graphs have both the 111 and the 555 groups since those groups never had a close enough starting point.  They explain in the paper that this is because:

So they admit that the assignment to these teachers was done with bias which, it seems to me, invalidates the entire study.  But I learned that even with this bias, the authors of the report had to further distort their results.  When you see those three bars it seems that those were the only three groups that began with a starting score of around 56%.

Looking at the actual raw data, I learned that there were actually twelve other groups that had a starting score in that range.  Since I didn’t have enough room to make 30 bars on my graph, I split my graph up into two.  On the top graph, I have the  starting scores for all 15 groups that got between 55% and 57% in 1993.  The graph on the bottom is the scores for those same 15 groups in 1996.  So the 151 bar on the top graph means that the group of students who were about to have an ineffective teacher, followed by an effective teacher, followed by another ineffective teacher (151) started with about a 56% score.  Then on the bottom graph, there is a 151 bar which indicates that those same students who had an average of 56% in 1993 had an average of about 41% in 1996.  Again, this was not a group of thirty students who stayed together through the same teachers, but the scores of the kids who had any 1 teacher followed by any 5 teacher followed by any 1 teacher.

1993 scores of groups who had scores around 56%

1996 scores of the same 12 groups three years later

Now it is not exactly clear how to interpret some of the other combinations.  Should it be better to have a 151 combination or a 214 combination?  They both add up to 7.  I sorted the bars from low sum to high sum and used a complicated tie-break procedure to create this full picture.  One funny thing is that the 125 is actually the second best group.  They are better than the 525 even though they had a 1 teacher the first year.  Also, they are better than the 521 combination since apparently the order that you get the teachers makes a big difference.  Just taking the 125, 333, 424 bars, I could make a graph that seems to show that ‘better’ teachers get worse results.

I could analyze all twenty to find evidence that value-added is, at best, a pseudoscience.  Only someone not involved in education could think that the top twenty percent of teachers are these heroes that can perform miracles.  I take particular offense to this myth as I consider myself to be quite a good teacher, and I know that I would not be capable of such feats.

Here is another graph from the paper, followed by my more complete graph with analysis:

1993 scores of the 14 groups who had starting scores around 32%

1996 scores of the same 14 groups three years later

When you take just the 112, 233, and 553 bars it looks like a clear positive linear progression.  But when you see all 14 groups that had similar starting points there are many exceptions to the supposed correlation between teacher ‘quality’ and test gains.  Like the fact that the 515 group did better than the 553 group.  The 125 group destroyed the 441, 532, and even the 542 group.  And as bad as the 112 group was, they still beat both the 323 and the 441 group.  If I just isolate the 112, 323, and 441 groups, I can, again, ‘prove’ that the ‘better’ teachers get lower scores.  Though there is a slight upward trend when you look at all the groups– I’d expect there to be since there was an element of value-added, surely, in the initial assessment of the teachers to put them into those groups, the random ups and downs of these graphs seems to prove, more than anything, how unscientific this study is.

Yet this study continues to be quoted, especially the M5 example that was my first example.  It is in Whitney Tilson’s powerpoint slides (shown below).  It is in The New Teacher Project’s Denver report (shown below the Tilson slide).  I don’t think that Michelle Rhee has ever given a talk where she has not quoted this study.

It took me longer than I’d like to admit, but I did this type of analysis for all twenty graphs and collected the results here.  The original 1997 Dallas paper is here.  My extensive spreadsheet is here.  I encourage anyone who is interested to download them and see what things you can find in them.

This entry was posted in book proposal, Research. Bookmark the permalink.

18 Responses to The ‘three great teacher’ study — finally laid to rest

  1. Michael Fiorillo says:

    As always, thanks for your important work.

    Value-added, by its very definition, does have a 1:1 correlation to two things: the reduction of children to the status of commodities, and of teachers to factors of production.

  2. Ms. Math says:

    So what does happen if kids get an effective teacher? I believe this, yet I’m still I’d sure pull my kid out of the 111 track.

    I like to think that working on improving teaching is an important goal-does this analysis mainly say that measuring teachers with test scores isn’t logical?

    Or is it just that other factors, such as culture and the environment in which teachers work are much more important? That concept doesn’t seem very TFA to me, though I’m skeptical that any TFA teacher has made a dent in the math gap in their time in the core.

  3. Tom says:

    tl;dr: This is a terrible study. It’s not taken seriously nor cited by any serious economists. On top of that, it’s 15 years old. For you to use this study as a “take-down” of value-added literature is disingenuous at best.

    Likewise, I think it is a real shame that TNTP continues to use this study in their literature.

    That being said, your analysis of it is also seriously flawed. Anyone can see from your graph that there is a clear positive linear relationship between the quality of teacher and average score. OF COURSE you will be able to pick out exceptions, but to say that they invalidate the findings shows a serious lack of statistical understanding. I agree that the authors of this paper make the same mistake.

    To demonstrate, I’ve thrown together a quick linear regression of the score after three years on the sum of teacher quality ratings (e.g., 151 = 7). You can see from the regression results that there is beyond a shadow of a doubt a positive relationship between teacher quality and scores.

    I’ve only given the paper a quick look over, but I also see no evidence of controlling for potential sorting (all the good kids from the 55% class went to the 5 teacher and all the bad kids went to the 1 teachers.)

    My point is not to defend this study, and again it’s a terrible one, but mainly just to show that you are using a 15 year old straw man to reject an entire field of analysis.

    I think you and Diane Ravitch do a real disservice to debate by taking potshots on the internet at VAM without a sophisticated understanding of it. I’m not suggesting that you have no right to question studies, just that when you draw incorrect conclusions the vast majority of your readerships take it at face value without a true understanding. It’s very frustrating.

    • Gary Rubinstein says:


      I’m only challenging this study because the corporate reformers keep using it to defend their agenda. I admit that I’m not a statistician. I’m a math teacher with a better understanding of statistics than most people, and I’m sure that people at StudentsFirst, TNTP, and TFA have even less of an understanding than I do. The reason that this is a good paper to study vs. the more recent ones is that they actually supply a lot of raw data to analyze. They had left of the percentiles except for the few they used for their graphs. I had to learn how to reconstruct the other percentiles from the ones they gave. It took a very long time to do this analysis and though it is not perfect, it definitely shows that those three bars are quite misleading to show a strong linear correlation, while the correlation is actually much weaker. That there is a correlation is not a surprise because of circular reasoning. Teachers who get better VAM get better VAM.

      Now, you seem to have a very good grasp of the issues. I invite you to take all my data and do a better analysis of the flaws in this study. I’d be happy to publish it here. I’m not trying to deceive anyone with my analysis, just trying to open the discussion about how much we should base policies on old and flawed (even from my limited knowledge) studies.


    • Mr. K says:

      It should be noted, Tom, that your back-of-the-envelope analysis ignores the concern that only students who have similar starting scores should be compared to each other. That being said, I appreciate your willingness to take statistics at face value. In my major field (astrophysics), results like yours (a clear linear trend with a low standard error) would likewise be considered useful and not discredited simply because of exceptions to the trend.

      Gary, I took a look at your spreadsheet, but there was nothing there to indicate the “complicated tie-break procedure” that you used to sort the teacher score bins. Without a description of this procedure and the rationale behind it, you are being as disingenuous as the authors. Frankly, I prefer Tom’s method since it eliminates the need for additional assumptions regarding the order of teachers, though it perhaps introduces the single assumption that the order of teachers does not have as big of an effect as having effective teachers, period.

      • Gary Rubinstein says:

        The tie-break isn’t that complicated. I just didn’t want to burden my blog readers with it. All I did was square the numbers and add them together. That way a 355 would be different from a 445 even though they both add up to 13. The 355 becomes 59 and the 445 becomes 57. You can see how I did it in the tabs where I separated the intervals to correspond with what they did in the report.

        I’m not trying to be disingenuous at all. This study is bogus.

    • Tom Graham says:

      To suggest that this value added study has any validity is nonsense because it does not take into account so many other factors beyond a teacher’s influence and control, ie: a student’s home environment, IQ, economic status, emotional stability, substance abuse problems, etc.

      These types of studies are flawed because they try to put too much emphasis on a teacher’s impact on a student’s life. There are a lot of other influences that impact their life as well.

      I’m curious about who funded this study, what preconceived notions (or agenda) they had, and how far they’re reaching to justify their own existence. I’m against cookie cutter approaches to education.

  4. squeers says:

    I am at this point very interested in someone (Gary!) looking at the work of the economists who put together yesterdays data reported on in the Times. Correlating a particular teacher to future income seems like a big, if not huge stretch. How could the study have normalized all the other potential variables that might have caused income to be reduced? I believe these economists are the same fellows whose studies on the transition to middle school is convincing policy makers in NYC that K-8 is the model of the future. This is already influencing plans for schools in NYC without any real understanding of the social impact of keeping students ready for HS in an elementary school setting. Who are these economists and what is motivating their work?

  5. Prof. Heidi Weiman says:

    Tom: It’s not old for those of us teacher educators who attended conferences in the 90s and 2000s where William Sanders touted his Tennessee Value Added Assessment System and discussed this study at length. It was very alarming to me and buzzed like tinnitus in my ears for all these years.

    Gary: Sanders didn’t provide details about how the study was conducted and, although I searched for info on it many times, I was unable to locate it because I didn’t know it was from Dallas, so thank you very much for your analysis!

    • Gary Rubinstein says:

      The Sanders was actually a year earlier. This was by Mendro and Jordan, but their goal was to verify Sanders. Sanders doesn’t give the raw data the way they did.

      • Prof. Heidi Weiman says:

        Thanks for the info, Gary. Sanders’ presentations took a shock doctrine approach and focused more on the negative impact of ineffective teachers than the positive impact of effective teachers, as demonstrated by some rather alarming graphs. That’s really what I was looking for and still can’t find. Have you run across that?

  6. Wait… but wasn’t there that cartoon in Waiting for Superman where the great teachers took the top of the kids’ heads off on the assembly line and dumped in a year and a half of knowledge? You mean that’s not really how it works?!

  7. Monica says:

    Thank you for revisiting a base assumption.
    I’m not a statistician, and I’d be curious to see
    how the teachers ratings were arrived at, but if one takes that as reliable, at a glance on the new bar charts doesn’t there seem to be a correlation that students who had a 5 teacher
    the year of the follow up test improved the most? Perhaps because 5 teachers were savvy enough to teach to the test?


  8. Pingback: Friday Ed Bites in a New Semester - the weighted pupil

  9. Liann Sumner says:

    Until the definition and criteria for designating a teacher “effective” is published, the whole study is worthless.

  10. Chad says:


    If you haven’t seen this you should check it out:

  11. I’m reading a review of literature for my course. This was the statement:

    Does evidence suggest that some teachers are significantly more effective than
    others at improving student achievement?

    Yes. Ample evidence indicates that there is wide variation among teachers in their ability to
    produce student learning gains, as measured by standardized achievement tests (Murnane, 1975; Armor, Conry-Oseguera, Cox, King, McDonnell, Pascal, Pauly, & Zallman, 1976; Murnane &
    Phillips, 1981; McLean & Sanders, 1984; Hanushek, 1992; Sanders & Rivers, 1996; Wright,
    Horn, & Sanders, 1997; Jordan, Mendro, & Weerasinghe, 1997; Rivers-Sanders, 1999;
    Aaronson, Barrow, & Sander, 2007 Rockoff, 2004; Nye, Konstantopoulos, & Hedges, 2004;
    Hanushek, Kain, O’Brien, & Rivkin, 2005; Rivkin, Hanushek, & Kain, 2005; Kane, Rockoff, &
    Staiger, 2006). Hanushek (2002), for example, notes that the magnitude of differences among
    teachers is so great that within a single large urban district, “teachers near the top of the quality
    distribution can get an entire year’s worth of additional learning out of their students compared
    to those near the bottom.”

    I saw the study you analyzed in this list and was SO GLAD I had read your blog-I’m going to bring this up in class discussion. There are certainly a whole host of studies for you to debunk if you ever get bored teaching!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s