In part 1 I demonstrated there was little correlation between how a teacher was rated in 2009 to how that same teacher was rated in 2010. So what can be more crazy than a teacher being rated highly effective one year and then highly ineffective the next? How about a teacher being rated highly effective and highly ineffective IN THE SAME YEAR.

I will show in this post how exactly that happened for hundreds of teachers in 2010. By looking at the data I noticed that of the 18,000 entries in 2010, about 6,000 were repeated names. This is because there are two ways that one teacher can get multiple value-added ratings for the same year.

The most common way this happens is when the teacher is teaching self-contained elementary in 3rd, 4th, or 5th grade. The students take the state test in math and in language arts and that teacher gets two different effectiveness ratings. So a teacher might, according to the formula, ‘add’ a lot of ‘value’ when it comes to math, but ‘add’ little ‘value’ (or even ‘subtract’ value) when it comes to language arts.

To those who don’t know a lot about education (yes, I’m talking to you ‘reformers’), it might seem reasonable that a teacher can do an excellent job in math and a poor job in language arts and should not be surprising if the two scores for that teacher do not correlate. But those who do know about teaching would expect the amount the students to learn to correlate since someone who is doing an excellent job teaching math is likely to be doing an excellent job teaching language arts since both jobs are set up by some common groundwork that benefits all learning in the class. The teacher has good classroom management. The teacher has helped her students to be self-motivated. The teacher has a relationship with the families. All these things increase the amount of learning of every subject taught. So even if an elementary teacher is a little stronger in one subject than another, it is more about the learning environment that the teacher created than anything else.

Looking through the data I noticed teachers, like a 5th grade teacher at P.S. 196 who scored 97 out of 100 in language arts and 2 out of 100 in math. This is with the same students in the same year! How can a teacher be so good and so bad at the same time? Any evaluation system in which this can happen is extremely flawed, of course, but I wanted to explore if this was a major outlier or if it was something quite common. I ran the numbers and the results shocked me (which is pretty hard to do). Here’s what I learned:

Out of 5,675 elementary school teachers, the average difference between the two scores was a whopping 22 points. One out of six teachers, or approximately 17%, had a difference of 40 or more points. One out of 25 teachers, which was 250 teachers altogether, had a difference of 60 or more points, and, believe it or not, 110 teachers, or about 2% (that’s one out of fifty!) had differences of 70 or more points. At the risk of seeming repetitive, let me repeat that this was the same teacher, the same year, with the same kids. Value-added was more inaccurate than I ever imagined.

I made a scatter plot of the 5,675 teachers. On the x-axis is that teacher’s language arts score for 2010. On the y-axis is that same teacher’s math score for 2010. There is almost no correlation.

For people who know education, this is shocking, but there are people who probably are not convinced by my explanation that these should be more correlated if the formulas truly measured learning. Some might think that this really just means that just like there are people who are better at math than language arts and vice versa, there are teachers who are better at teaching math than language arts and vice versa.

So I ran a different experiment for those who still aren’t convinced. There is another scenario where a teacher got multiple ratings in the same year. This is when a middle school math or language arts teacher teaches multiple grades in the same year. So, for example, there is a teacher at M.S. 35 who taught 6th grade and 7th grade math. As these scores are supposed to measure how well you advanced the kids that were in your class, regardless of their starting point, one would certainly expect a teacher to get approximately the same score on how well they taught 6th grade math and 7th grade math. Maybe you could argue that some teachers are much better at teaching language arts than math, but it would take a lot to try to convince someone that some teachers are much better at teaching 6th grade math than 7th grade math. But when I went to the data report for M.S. 35 I found that while this teacher scored 97 out of 100 for 6th grade math, she only scored a 6 out of 100 for 7th grade math.

Again, I investigated to see if this was just a bizarre outlier. It wasn’t. In fact, the spreads were even worse for teachers teaching one subject to multiple grades than they were for teaching different subjects to the same grade.

Out of 665 teachers who taught two different grade levels of the same subject in 2010, the average difference between the two scores was nearly 30 points. One out of four teachers, or approximately 28%, had a difference of 40 or more points. Ten percent of the teachers had differences of 60 points or more, and a full five percent had differences of 70 points or more. When I made my scatter plot with one grade on the x-axis and the other grade on they y-axis I found that the correlation coefficient was a miniscule .24

Rather than report about these obvious ways to check how invalid these metrics are and how shameful it is that these scores have already been used in tenure decisions, or about how a similarly flawed formula will be used in the future to determine who to fire or who to give a bonus to, newspapers are treating these scores like they are meaningful. The New York Post searched for the teacher with the lowest score and wrote an article about ‘the worst teacher in the city’ with her picture attached. The New York Times must have felt they were taking the high-road when they did a similar thing but, instead, found the ‘best’ teachers based on these ratings.

I hope that these two experiments I ran, particularly the second one where many teachers got drastically different results teaching different grades of the same subject, will bring to life the realities of these horrible formulas. Though error rates have been reported, the absurdity of these results should help everyone understand that we need to spread the word since calculations like these will soon be used in nearly every state.

I’ve never asked the people who read my blog to do this before since I prefer that it happen spontaneously, but I’d ask for you to spread the word about this post. Tweet it, email it, post it on Facebook. Whatever needs to happen for this to go ‘viral,’ I’d appreciate it. I don’t do this for money or for personal glory. I do it because I can’t stand when people lie and teachers, and yes those teachers’ students, get hurt because of it. I write these posts because I can’t stand by and watch it happen anymore. All you have to do is share it with your friends.

amazing; I have tweeted emailed & Facebooked it. thanks!

I found the link from your first post, and put it up on Facebook. This is an outrage — thank you for doing the statistics. (I am one of the elementary teachers in your plot.)

Yes, the many aspects of classroom environment affect teacher performance, but I wish you or someone would discuss the lack of supports we receive from administrators and the DOE themselves.

Disruptive behavior, usually by only a couple of students, is never dealt with. And some teachers are set up to have less progress with their students because they get the students with the most emotional and scholastic needs. Add to this a cut in services to these kinds of kids and a cut of supplies and learning materials — it’s a wonder we ever teach anything at all.

So very true!

Great work! Thank you for your hard work and dedication!

I am a passionate elementary special education teacher. As a teacher of special education, it is obvious that I am extremely concerned about VAM. Here are a few thoughts from a special education teacher’s point of view.

Developmentally, do we expect our children to grow equally in both reading and math at the same time? It is well documented that when babies/toddlers begin to walk, their speech might decline. When babies/toddlers begin to speak in sentences, other milestones might maintain status quo. How can we expect our children to win both the reading AND the math “Races” in the same year? Olympic sprinters are not expected to win the marathon, also.

Success in reading and/or math in school is a team approach. In my school, I would not, could not (we are celebrating the birthday of Dr. Seuss this week) take credit for a student’s growth without acknowledging the hard work and dedication of the child, family, reading specialist, math specialist, regular education teacher, speech pathologist, OT, lunch server, recess supervisor, secretary, principal, parent volunteers, school custodian, etc. How can 1 teacher be measured for 1 child’s success?

When standardized tests are given in the fall of the school year, how can the current teacher who has worked with the child for approximately 6 weeks take credit for the hard work and dedication of the team with whom worked with the child the previous year?

I’m a retired teacher living on Vancouver Island, British Columbia. You reached me – so I’m hoping your incredible work will spread far and wide. Thanks for taking the time.

Hi Gary, maybe I’m just looking in the wrong places, but I can’t seem to find a way to download the entire dataset. Could you give the link that you used? (I’m sure others would appreciate the same.) Thanks.

Sure. They are at http://www.ny1.com/content/top_stories/156599/now-available–nyc-teacher-performance-data-released-friday#doereports

Thanks for sharing the analysis, and the location of the data!

Pingback: GR on Value-Added Metrics | The Very Spring and Root

This reminds me of a scatter plot my sister used in her masters thesis defense that showed no correlation – she connected the dots to make a picture of a donkey! It got a big laugh. Shared, and will disseminate. Thanks for the hard work on this, it’s so valuable.

Ditto on the link. It’s really frustrating trying to find the data.

And those R^2 values….damn.

here they are http://www.ny1.com/content/top_stories/156599/now-available–nyc-teacher-performance-data-released-friday#doereports

Thanks – both for your analysis, and for sharing the data source!

Pingback: Big Apple’s Rotten Ratings « InterACT

You failed to run what I think would be the most obvious relationship, that between one years scores and the next.

As a test run, I looked at just 4th grade math teachers in the 08-09 and 09-10 years. The correlation coefficient for this relationship turned out to be .44 , which, for one variable in the social sciences, would be considered quite high.

What this suggests is that, while one-year of results should be taken with a grain of salt, after 3 or 4 years of data these numbers will become quite significant.

While I do agree that no good comes from publishing this data (in particular single year scores), I think you too easily dismiss their usefulness in evaluating teachers over multi-year periods.

Публичнымый

oops! I see you did that now in pt. 1. I would still argue that the .35 coefficient you found is high enough to draw conclusions from over a multi-year period.

One other point that I think is getting missed here is the desire of the NY DOE to release these scores. As far as I can tell, they were forced to via the Freedom of Information act.

Pingback: Remainders: Retelling the story of the city’s “worst teacher” | GothamSchools

NYCDOE told reporters to FOIL it (1) after signing agreement with uft that said they would oppose FOIL requests.(2)

1) http://www.cjr.org/cover_story/tested.php?page=all

2)http://www.schoolleadership20.com/profiles/blogs/how-to-demoralize-teachers-by-diane-ravitch

Gary – This is a fantastic analysis. Thank you.

Given the latest deal NYSUT has made around teacher evaluation, we have a lot more of this coming….

http://socialistworker.org/2012/02/28/bitter-fruits-of-race-to-the-top

Great study! You should turn this into a published paper, but as it is will be recommended reading for my postgraduate students in higher education. Of course, it is already tweeted. Thanks! (from Sydney, Australia…)

I was persuaded by your arguments, which brought logic and empiricism together in service of truth (and fairness). I admire your work, motivation and passion. Looking forward to the next installment.

Today, I heard the Deputy Chancellor Shael Polakow-Suransky defending the just-released teacher data reports (TDRs) in the weirdest way. Having just issued them, he said the public shouldn’t over-react to teacher grades, because they are only one piece of evidence and besides, THE DATA ARE A FEW YEARS OLD. What!?

So they count but they don’t count, except to create predictable misunderstanding by parents and the public at large–and to leave teachers open to scorn based on knowingly misleading information.

In all the doublespeak it looks like the results are a sham, but nonetheless worthy of release to the press for widespread dissemination.

There was another research piece done on the state’s testing program–the testing engine driving the city’s TDRs. It appeared on the New York City Public School Parents blog right before the teacher grades came out and showed that the results generated over the last five years (covering the years of test data used in the TDRs) were based on exams whose items yielded inherently contradictory (uncorrelated) results. Here’s the link:

http://nycpublicschoolparents.blogspot.com/2012/02/testing-expert-points-out-severe-flaws.html

In effect, an item-level analysis, revealed evidence that called into question the meaning and use of the results in general, much less to reach high-stakes decisions.

Your excellent work on the strange test results and the analysis of the test itself–with findings that may explain why the patterns you found defy rational expectation–are the kind of research we need to stop the continuing test insanity that has taken over education.

The harm that is being caused to students and teachers is immeasurable.

.

This explains why Michelle Rhee, BIll Gates and other reformers were so against publishing this data.

They knew how bad VAM would look when properly and publicly analyzed.

Protecting teachers? Of course not — protecting themselves!

Pingback: SchoolBook: Black History Archival Photos, and Charter School Teachers in the Spotlight

I have mixed feelings about this. I have taught for about 19 years, and I really stink at teaching math above maybe fifth grade. I could get better at it, no doubt, with work and training. But I am certain that if you scored me on both fifth and sixth grade math, and reading (I’m a reading specialist), all at the same time, I would have a tremendous variance in my scoring. And, yes, I’m certified to teach all of them, although not “highly qualified”,in NCLB speak. I’m a very good teacher of reading and writing. I do teach remedial elementary math. I took beginning calc in college – one hundred years ago. Don’t just disregard these stats. We need to be evaluated, just as everyone should be. I am sure the stats are not as well done as they should be. But my ego can take the battering.

I believe the question is this: across all teachers measured, should we expect to find some correlation between their performance at two grade levels on a single subject (if this were a measure of the performance of those teaching the subject)?

I think it’s safe for Gary to assume that we should. All things considered, someone who is a “good” math teacher for 6th grade is probably more likely to be a “good” math teacher for 7th grade than another teacher picked at random from the entire set.

But these findings don’t reflect that, and so they probably don’t reflect teacher performance.

We’re undergoing a revamping of our teacher evaluation system here in Detroit Public Schools (2010 CM) and I’d love to share a pretty revealing e-mail exchange with you. How can I get in touch?

sure. Just email me at garyrubinstein the-at-sign yahoo.com

One problem with the second plot is that it under-plays the problem visually. If you make a box in each corner that is 20% of the scale (so, the zero-zero corner up to 20, the 80-100 and the 0-20 in the lower right corner, etc) and count the number of people in those areas, you need to ADD the lower right and upper left counts (the two areas where the same people was simultaneously great and horrible at teaching THE SAME SUBJECT), then compare them to each bad-bad and good-good corner, because it shouldn’t matter ON WHICH STUDENT’S test you were bad.

In other words, the density of points in the good-bad and the bad-good corners LOOK like they are half the density in the bad-bad and good-good corners, which would mean that fewer people appear to be simultaneously effective and ineffective in the same subject and so there might be some validity, but when I count the spots (quickly and carelessly) I get something like 35 people in the bad-bad corner, 37 people in the good-good corner, and 33 people in the good-bad corners. Those numbers are essentially the same for a data set like this. You couldn’t seriously say that the test predicts ANYTHING. That’s what a correlation of 0.24 means in practical terms: who ends up in which corner is very close to random.

‘ben’, above, has a good point about how the NYCDOE treated FOIL requests after its agreement with the union. Whether or not one agrees that it’s a pressing concern or the primary concern of the union, it’s clear why both sides could agree that NYCDOE fight FOIL requests in early years before a lot of data were gathered: because it’s sensible to refrain from publicly indicting individual teachers as “bad” if the numbers don’t actually mean anything (yet).

But NYCDOE caved on this eagerly, which means one of two things:

1) They don’t care if good teachers are harassed into early retirement over misleading numbers, or

2) they are hoping that good teachers will be harassed over misleading numbers because that makes it easier to get rid of teachers.

Either way, students lose.

Pingback: The Value Added Teacher Model Sucks « mathbabe

Article VI of the US Constitution specifies that “no religious Test shall ever be required as a Qualification to any Office or public Trust under the United States.” However, given the amount of faith that Administrators and Legislators need to believe in the legitimacy of teacher ratings based on standardized test scores, I assert that hiring and firing teachers based on this data is nothing more than an unconstitutional form of state-sanctioned paganism.

Gary-

What is the correlation coefficient for the first graph in this post? I notice it’s .24 for the second one, but don’t see it for the first one. Thanks!

SB

I believe it was about .35

Pingback: Analyzing Released NYC Value-Added Data Part 2 by Gary Rubinstein « Guzman's Mathematics Weblog

Man I love when someone kicks ass with hard data. Those graphs are just impossible to ignore. I love how you handheld through each step…”okay, if you don’t believe /that/, then look at /this/.” Well done.

Great visualization of the data, Gary.

Could I have permission to use your scatter plots in a post at slog.thestranger.com , with proper credit and links of course?

Sure. Feel free to use whatever you like as much as you’d like. Gary

Thanks, Gary. We just passed teacher evaluation legislation here in WA state, that requires that “student growth data must be a substantial factor in evaluating summative performance.”

Here’s a link to my post: http://slog.thestranger.com/slog/archives/2012/03/07/teacher-evaluation-formula-fails-to-evaluate-teachers

Those plots look like about what you would expect if the value added model was actually measuring teacher effectiveness, but with a margin of error or something like 50 to 60%. Which wouldn’t mean that it is useless, but it does mean that it is only useful for the extreme cases: If somebody is scoring the bottom 5%, say, then it is very likely that they are in the bottom 50% of teachers (and similarly for those with very high scores). Anybody in the middle, the scores tell us nothing at all.

Of course, even if you have 95% confidence that somebody is in the bottom 50% of teachers, only a lunatic would suggest that this should be grounds for automatic firing or whatever. Even the best statistical methods should only ever be a tool used by humans as part of the process.

Also, if subject knowledge is so unimportant, then explain to me how division of fractions works ; )

Math is HARD.

It is actually simple. The question is what is 2/3 of 3/4?

Write 2/3 in the numerator. Write 3/4 in the denominator. It is OK to have a fraction in the numerator and to have a fraction in the denominator.

Remember the goal of division of fractions is to get a 1 in the denominator! When 1 is in the denominator you have a “whole” number.

NOW, to get that 1 in the denominator, it is necessary to multiply the 3/4 x 4/3 which gives 12/12 which is equal to 1.

Since the denominator was multiplied by 4/3 the numerator (complains like a sibling that it wants to be treated the same as the denominator) must be multiplied by 4/3 to keep peace in the fraction family.

Now your numerator looks like 2/3 x 4/3 (DO you recognize the invert and multiply “rule” which is the division of fractions mantra?) .

Now 2/3 of 3/4 is 8/9.

Not quite. You’ve taken two thirds and divided it by three quarters, but two thirds of three quarters is found by multiplying the two fractions: 2/3 of 3/4 = 6/12 (or 1/2)

So, your “how to divide by fractions” part is right but your set-up problem isn’t. (I think most of the problems in this area come because people like to use terms like “divide by half” to mean “divide in half” which is really “divide by two”.

Seems a report like this could also be interpreted as that there are no “good” grammar school teachers, only that there is a basic teaching job and there is no better or worse way to do it that affects student performance consistently. Given that, charter schools are also no better or worse, so the only issue is the cost.

If one didn’t care about seeming sane, was willing to sacrifice kids’ education to save money, and would stoop to any twisting of the available data, no matter how ridiculous, to support that end, then yes, one could interpret it that way.

So what do you do when the teacher has the data to prove improvement and the administrator ignores the results? I exited 52% of my literacy students the first year I taught literacy because they all met the 40% or higher on the CRTs. The administrator I worked for totally ignored those results and put me on remediation the next year.

The district assessment specialist told me he had never seen such improvement.

Eventually I concluded that it was because I was “too tall” to teach elementary students. By the way, I was told by the previous administrator that I was “too tall” to teach elementary school. The current administrator told me I talked over children’s heads when I differentiated during math class observations.

I believe my conclusion was at least humorous while explaining the foibles of those who are administrators.

Your blog has been extremely valuable. I am an EdD candidate & am writing a paper on VAA. Would you happen to know of other links to VAA data which has been made available?

After seeing your posts, I downloaded the data and ran a few checks myself.

I can replicate your work, and added a few new observations about the behavior of the VA metrics.

See:

http://blog.metasd.com/2012/05/more-nyc-teacher-vam-mysteries/

@Douglas:

The LA Times has its own teacher value added ranking data site. You can get individual teacher data by name, but I’m not sure if they’ve released the full results in tabular form.

The Ohio Dept of Ed also has value added data at the school and district level at least.

If I’ve sent this to a different thread I apologize, but I’m curious to hear your thoughts on how absenteeism has been factored into VAM since from the article linked below, there does not seem to be data of sufficient quality (in most places) to do so, and yet the report indicates absenteeism to be a significant factor. There are links to the report and executive summary in the article.

http://www.nytimes.com/2012/05/17/education/up-to-15-percent-of-students-chronically-skip-school-johns-hopkins-finds.html?_r=2&ref=richardperezpena

The technical report says,

“Absences and suspensions

The analysis data set also includes variables indicating the extent of absences and suspensions for the student. These variables are drawn from the student biographical data set in the pretest year. The pretest year is chosen to avoid endogeneity; students who are absent or suspended more often may grow more slowly regardless of who their teacher is, but particularly effective teachers may also cause their students to be less frequently absent or suspended. The absences variable is equal to the log of days absent if the student was absent for at least one day; it equals zero otherwise. The suspensions variable is equal to the log of days suspended if the student was suspended for at least one day; it equals zero otherwise. In both cases, the log is used to keep outliers with many absences or suspensions from being too influential.”

Hard to say whether the data’s any good, or the use of logs makes sense, without seeing the data. Also, I’d assume that there’s an interaction effect – one disruptive student is manageable; ten is hard – but they haven’t considered that.

Actually, I’d go a bit further and say that I think the use of logarithms is suspect.

If a student is suspended 9x, it seems fair to assume that they are 9x as disruptive as a student suspended 1x.

Absenteeism decreases exposure to teaching, with a response something like 1-absent_days/school_days, which is also not logarithmic.

What base would the log be? Base 10 wouldn’t make sense since then the difference between 1 and 10 days would be equal to the difference between 10 and 100 days. Maybe natural log, approximately 3, so 1 to 3 would be one level, 3 to 9 would be the next, 9 to 27 would be the next, 27 to 81 would be the next.

Seems like it minimizes excessive absences too much. I could argue for the opposite extreme, exponential, or, at minimum, linear.

I skimmed the full report, examining some sections in more detail. It is not an exaggeration to say that absenteeism is the as yet invisible 800 ton gorilla in the room. Based on the rather astounding level of absences in the populations most affected by poverty, this may well be the biggest factors impairing student learning no matter how good the teacher and preventing the achievement gap from being closed. While NYC seems to attempt to track this to some extent, I would like to know if in addition to tracking absence, if they track classes cut , early dismissals/departures and late arrivals. My sense is that with this being such a pervasive problem that is so badly measured, it’s severely negative affects on a teachers ability to “add value” to a group of students are not taken into account and therefore is disproportionately unfair to teachers serving in schools where poverty is a major issue, poverty being the single biggest factor correlating with absenteeism. In conclusion, the failure to factor this into any value added model due to the data not existing totally invalidates VAM. Rather than post snips I urge all to skim the report. Finding relevant data/charts takes but a few minutes.

Pingback: Why we oppose the release of test-based teacher evaluations « morecaucusnyc