It’s all about the ‘error rate’ — or so, even I, used to think.

Nearly everything I’ve read that questions the validity of the value-added metric mentions the astronomical ‘error rates.’ When the NYC Teacher Data Reports were first released, the New York Times website ran an article with some numbers that have been frequently quoted:

But citing both the wide margin of error — on average, a teacher’s math score could be 35 percentage points off, or 53 points on the English exam — as well as the limited sample size — some teachers are being judged on as few as 10 students — city education officials said their confidence in the data varied widely from case to case.

Thinking so much about value-added for my analysis, it recently hit me that the above critique actually understates the real problem with value-added error rates. Implied is the possibility that if they could just figure out some way to get those error rates down to under some acceptable threshold, then the measure would be much more useful. But I plan to show in this post that it would not matter if they got the error rates down to 0% because the error rates do not actually mean what most people think they mean.

So what is an ‘error rate’? Well, if I have a thermometer and when the temperature outside is 70 degrees while my thermometer says the temperature is 77 degrees then my thermometer, at that moment, has a 10% error. If I read the thermometer twenty times when the temperature is 70 degrees and I get readings as low as 56 degrees and as high as 84 degrees, we can say that my thermometer is not very accurate since there is a 20% error rate compared to the ‘true’ temperature. This is what we think about when we hear about error rates. How the measurement compares to the ‘true’ number.

In the case of the temperature, the ‘true’ temperature that my readings are compared to is measured by some kind of very expensive and very accurate thermometer. Without that other thermometer that has the ‘true’ temperature, there is no way to measure the accuracy of my thermometer.

So when we hear that the value-added metrics have a 35 percent error rate for a particular teacher and that the teacher scores at the 40th percentile, we think that this means that the teacher’s ‘true’ quality is somewhere between the 5th percentile and the 75th percentile. There is no way the teacher’s ‘true’ quality can be lower than the 5th percentile or higher than the 75th percentile otherwise the error rate for that teacher would not be 35 percent.

This is making the very reasonable assumption that these ‘error rates’ are defined by how the value-added measure compares to the ‘true’ measure of teacher quality. But since there is no equivalent, in teaching, of the super accurate thermometer that measures the ‘true’ quality, how can it possibly be compared to that?

Because the error rates are more meaningless that I had realized. They don’t compare the value-added number to the ‘true’ teacher quality number — they can’t. Instead, all that the error rate measures is how the value-added number for that teacher compares to what the value-added number for that teacher would be if we re-calculated it with about fifty times the amount of data. That’s it. With more data the error rates go down so that with fifty years of data, the error rate would be pretty close to zero and then we could say, definitively, that this teacher is in the 40th percentile as a ‘value-adder.’ But that is not the same thing as saying that the teacher is in the 40th percentile in her ‘true’ teacher quality number.

Now that is not to say that this more accurate ‘value-adder’ percentile would be completely useless — but it still would not deserve to count as a large portion of a teacher’s evaluation.

My point is that most people who hear about these ‘error rates’ do assume that it means that the error rate is based on comparing the number to the teacher’s ‘true’ quality. Even I’ve written in the past things like “The 30% error rate means that 30% of the time an effective teacher will be rated ineffective by this measure and an ineffective teacher will be rated effective.” Now I realize that this was too generous. It would have been more accurate to write “30% of the time an effective ‘value-adder’ will be rated as an ineffective ‘value-adder’ and vice-versa.” Until the ‘true’ quality of a teacher can be measured accurately with some other method, we’ll never be able to say anything more definitive than that about value-added.

I was going to comment something to this effect on your previous value-added posts. In the language of statistics, it appears that the scores have neither precision nor accuracy, which makes them doubly invalid.

How do we know that the scores are not accurate? Perhaps they converge in probability to the true value, or maybe they are even unbiased estimators of the true value.

True. But until we are sure that they are, why would we implement them, especially if they actually cause teachers to teach worse and students to learn less, which is what I think is the result of this.

I don’t think that we can be absolutely sure about consistency without, literally, decades worth of information. To an outsider, it certainly seems like a reasonable assumption to make in this model.

Your blog is absolutely fantastic. The posts are always so well written and thought provoking.

In the old world of statistics, this was called reliability and validity. I think the corporate types who use these matrices and data have forgotten about these for their own convenience. They do not allow independent review of the data in any meaningful way and run public schools systems like corporations. They purposely change the measurement sticks every few years so they can claim distorted success. It is like keeping the stock price artificially inflated. By the time their crimes are realized by the public, it will be too late, the damage will have been done.

I do not think analysis and statistics can solve this problem you should see this statistics

http://www.badongo.com/file/26678681

Gary, you and I both found that the ‘temporal stability’ of the NYC value added scores were extremely low, with r^2 values of about 0.05 to 0.09 (thus, correlation coefficients of about 0.22 to 0.3). That’s what most researchers are finding for this sort of thing in other states as well.

But not in Los Angeles.

Apparently there, the correlation coefficients start there at 0.6 and go all the way up to 0.96.

I just don’t believe it.

Any thoughts?

These tests are a scam. I feel sorry for NYC teachers right now. It is a rigged game. The media says that education is all dependent on having a good teacher. So what do they do? They increase class sizes, cut the budgets, and then they want to find the “bad” teachers. Guess what, most beginning teachers are bad teachers. It takes around 10 years to really get into the groove. Also, a teacher can only control maybe 10% of how well a student does on a standardized test, if they even take it seriously. Teaching as a career is really going down the drain. We are letting billionairs like Bill Gates who know nothing about education design teaching reform because they are rich. Good luck fighting this garbage. We are getting this in Chicago too. Hopefully this whole movement will blow up soon.

Thanks, Steve. Mind if I repost your comments on my blog? You raise some excellent points.

BTW I do see your long comment.

Guy, go ahead and use my post as you see fit. But looking back at what I wrote two years ago, I would clarify this part a bit more:

“Any educator would have had very little difficulty coming up with several internal factors school that influence students and which could be used in a more rigorous study: differences in class size; disparities in facilities allocated to staff members; disparities in resources made available to teachers; school-wide and grade-level intervention programs adopted by faculty/staff; team-teaching efforts between instructors.”

Value added models usually ignore school-level effects, and Buddin’s LA Times study did exactly that. But, what I called school-level effects may not necessarily fall into that category. School-wide reading and intervention programs would certainly fit the bill, but would a team-teaching situation count? Funding disparities between two similar schools would count, but would mismanagement of funds within a school (e.g. disproportionate funding going to some departments or small learning communities within a school) be classified as a school effect? We can come up with a dozen of these questions, I’m sure. But, the people who create value added models gloss over all of this simply because it is HARD to quantify all of these considerations and it would muddy their models even more (and expose them to even more criticism).

*Gary, thanks for reposting my entry. I know that it was overly long, but I think it can be useful to a few like-minded people out there.