Analyzing Released NYC Value-Added Data Part 4

Value-added has been getting a lot of media attention lately but, unfortunately, most stories are missing the point.  In Gotham Schools I read about a teacher who got a low score but it was because her score was based on students who were not assigned to her.  In The New York Times I read about three teachers who were rated in the single digits, but it was because they had high performing students and a few of their scores went down.  In The Washington Post I read about a teacher who was fired for getting low value-added on her IMPACT report, but it was because her students had inflated pretest scores because it is possible that the teachers from the year before cheated.

Each of these stories makes it sound like there are very fixable flaws in value-added.  Get the student data more accurate, make some kind of curve for teachers of high performing students, get better test security so cheating can’t affect the next year’s teacher’s score.  But the flaws in value-added go WAY beyond that, which is what I’ve been trying to show in my posts — not just some exceptional scenarios, but how it affects the majority of teachers.

In part 1 I demonstrated that the same teachers generally get wildly different percent ranking in value-added from one year to the next.  In part 2 I showed that many teachers who taught multiple grades got rated ineffective and effective in the same year.  The only thing crazier, I guess, is teachers getting rated effective and ineffective in the same year IN THE SAME CLASS.  Yet, this is exactly what happened.

The TDR spreadsheet has a lot of columns.  Looking through them, I found that in addition to getting a value-added percentile rank for the entire class, there were also separate value-added scores for the top 1/3 of each class, the middle 1/3, and the bottom 1/3.  So I made a plot to compare how the 5th grade math teachers did in adding value for the top 1/3 vs. the bottom 1/3.  Again, there was little correlation.

Percentile ranks are often not a good way to compare things.  When you see the different plots I’ve made showing little correlation between things that should relate closely, you might think that just because the value-added percentile ranks fluctuate so much from year to year, or from class to class, (or now from portion of class to other portion of class), that the value-added scores also do.  Actually these wild changes in percentile rank are not due to the value-added scores changing by much.  The issue is that all teachers have pretty much the same value-added scores.  They range from about -.5 (the class scores went down by about 10%) to +.5 (the class scores went up by about 10%).  A zero on this scale means that the students stayed in the same relative position as they were in the year before, not that they didn’t learn anything.  Since the scores are so close, a slight change one way or the other could drastically change someone’s percentile score, which are the scores we hear about in the paper where this teacher got a 1 or this teacher got a 7.  This is why all the graphs that compare percentile ranks look so random.

When things are grouped together so closely, it does not make sense to do a percentile ranking.  Instead, there should be some kind of cutoff score that, beforehand, is determined as ‘good.’  It’s like if I give my class a test and the average is an 85 and most of the kids get between a 75 and a 95 while there are 2 kids who got under a 65.  So two kids fail because I have a cutoff of 65 and it is possible for everyone to pass.  But if I do a percentile ranking, a kid who got a 75 might now be in the bottom 10 percentile since he has the third lowest grade in a class of 34.

Here is a plot showing how a teacher’s pretest scores relate to her value-added score.  Notice that the scores are all pretty much between -.6 and +.6.  Also see that there is no correlation so the teachers at the ‘failing’ schools (left side of the plot have the lower pretest scores) add as much ‘value’ as the teachers at the ‘good’ schools.  I’ve added color for the five categories, high (top 5%), above average (5% to 25%), average (25% to 75%), below average (75% to 95%), low (bottom 5%)

I also made a histogram to see how clustered these points really are.  What I learned is that 98% of the teachers in this data set had scores between -.3 and +.6.  If there were to be some kind of ‘cutoff’ for passing, it would probably be around -.6 in which case 99% of teachers would have to be rated effective.

Statistically, allowing for a margin of error, these 99% of teachers are ‘equal’ in terms of ‘value-added.’  Ironically, a tool that was designed to show how widely different the quality of teacher is, is actually showing that all teachers are about the same.  It is only with the improper use of percentile rankings where the bottom 5%, no matter how close they are to the average, get the awful seeming single digit ‘scores’ we see in the paper.

I want to make it clear that I’m not saying that all teachers are equally good.  But when a tool designed to show how vastly different teachers are at improving test scores just ‘proves’ all teachers are equally good, that tool needs to be scrapped.

This entry was posted in Research. Bookmark the permalink.

6 Responses to Analyzing Released NYC Value-Added Data Part 4

  1. meghank says:

    Thank you for doing these posts, Gary.

    Did you know that in Tennessee, they are probably going to publish not the teacher’s value-added score, but the actual evaluation score on a scale from 1 to 5? It will include the principal’s evaluation of the teacher on the massive new observation checklist rubric.

    I don’t understand how publishing performance review information is legal. In Tennessee, a 1992 law prohibits publishing the TVAAS (value-added) ratings.

  2. Fran Chase says:

    I have really enjoyed your post concerning value added and why it doesn’t work. I agree with your point that all of the examples in the papers make it sound like the problem can be fixed with a few adjustments and clearly that isn’t true.

    When are you going to do another spreecast. I enjoyed the two that you did already. This would be a great topic to cover because people could ask you question. I think you would draw a big crowd.

  3. Pingback: Remainders: UFT official pushes back against principals protest | GothamSchools

  4. John Smith says:

    This seems to actually be an argument for the metric. We already knew that the vast majority of teachers are, at a minimum, quite competent. I don’t know if -0.6 is a good cutoff, but if it is, then we should be quite happy that 99% of our teachers are effective.

    I also disagree that you must choose a cutoff beforehand. It is quite common to grade on a curve. In that method, passing or failing would depend on the number of standard deviations one’s score is from the mean.

  5. Pingback: Brett Keller » Blog Archive » Group vs. individual uses of data

  6. CitizensArrest says:

    Hello Gary, I would appreciate your insights and future writing on this aspect of VAM. To the best of my knowledge, and logically since the data does not exist in an intelligible let alone usable form, the failure to incorporate data on absenteeism into a VAM based teacher evaluation system greatly reduces it’s already questionable accuracy and therefore further invalidates it’s use for that purpose. Absenteeism data does not track in school absences such as cutting class or leaving early after having been marked present, an additional shortcoming.
    The links in the article get you to the report itself. Thanks for the great work!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s