Last year I spent a lot of time making scatter plots of the released New York City teacher data reports to demonstrate how unreliable value-added measurements are. Over a series of six posts which you can read here I showed that the same teacher can get completely different value-added rankings in two consecutive years, in the same year with two different subjects, and in the same year with the same subject, but in two different grades.

Here is an example of such a scatter plot, this one showing the ‘raw’ score (for value added, this is a number between -1 and +1) for the same teachers in two consecutive years. Notice how it looks like someone fired a blue paint ball at the middle of the screen. This is known, mathematically, as a ‘weak correlation.’ If the value-added scores were truly stable from one year to the next, you would see a generally upward sloping line from the bottom left to the top right.

There is actually a slight correlation in this blob. I’d expect this as some of the biases in these kinds of calculations will hurt or help the same teachers in the same schools in two consecutive years, but the correlation is so low that I, and many others who have created similar graphs, concluded that this kind of measurement is far from ready to be used for high-stakes purposes like determining salaries and for laying off senior teachers who are below average on this metric.

Teacher evaluations are a hot topic right now as Race To The Top required ‘winners’ to implement evaluations that incorporate ‘student learning’ as a significant component. Though value-added is not the same thing as student learning, most states have taken it to mean that anyway. In some places, value-added is now as much as 50% of a teacher’s evaluation. A big question is what is the appropriate weight for such an unreliable measure. In D.C. it was originally 50%, but has been scaled back to 35%. I’d say that it should currently be pretty close to 0%.

Bill Gates has spent $50 million for a three year project known as the MET (Measures of Effective Teaching) project. They just concluded the study and released a final report which can be found here. In the final report they conclude that teacher evaluations have an ideal weighting of 33% value-added, 33% principal observations, and 33% student surveys. They justify the 33% value-added because they have analyzed the data and found, contrary to everyone else’s analysis of similar data, that teachers DO have similar value-added scores from one year to the next. To prove their point, they print on page 8 this very compelling set of graphs.

These graphs are scatter plots comparing the ‘predicted achievement’ to the ‘actual achievement’ for the teachers in the study. This ‘predicted achievement’ is, presumably, based on the score that the teacher got the previous year. As these points line up pretty nicely on the slanted line, they conclude that value-added is actually very stable.

Well, there were a lot more than twenty teachers in the study. The reason that there are twenty dots on these graphs is that they averaged the predicted and actual scores in five percentile groups. In doing this, they mask a lot of the variability that happens. They don’t let us see the kind of scatter plot with thousands of points like I presented above.

To test how much this averaging masks the unreliability of the metric, I took the exact same data that I used to create the ‘paint ball’ graph at the top of this post. Here’s what that data looks like when I do that.

As even a ‘paint ball’ produces such a nice line when subjected to the principle of averaging, we can safely assume that the Gates data, if we were to see it in its un-averaged form would be just as volatile as my first graph.

It seems like the point of this ‘research’ is to simply ‘prove’ that Gates was right about what he expected to be true. He hired some pretty famous economists, people who certainly know enough about math to know that their conclusions are invalid.

Wow! great work,Gary.

Excellent post, Gary. I don’t know how you could question the validity of the Gates-supported work in a more coherent fashion.

Now, you should do a comparison between Rhee’s StudentFirst announcement (re: Louisiana scores highest in education reform, etc.) and your post about how some of NY’s worst performing schools received praise for having the “best results”.

Do they explain why they used standard deviations for their units on the axes? The entire range shown is between +/- 0.2 SD. In terms of robust change, that’s TERRIBLE. The data points are all clustered around zero actual achievement. It would be easier and as correct to simply predict no change at all.

Pingback: A Beginning List Of The Best Posts On Gates’ Final MET “Effective Teaching” Report | Larry Ferlazzo’s Websites of the Day…

Thank you. I assumed that each point from the MET study graph was actually an average of 5% of the teachers, which doesn’t mean the same thing as the wording in the study – “representing” 5% of the teachers. Very misleading.

Of course the sword cuts both ways. When you plot averages, it greatly reduces the range of scores. An equally invalid conclusion would be that the plot shows that teachers are pretty much all the same so we don’t even need evaluations. Somehow I doubt that suggestion came up in the report.

Fantastic breakdown!

The report also asserts that observation scores show evidence of principals inflating them. The GF draws this conclusion based on disparities between observation scores and VAM scores. They assume, of course, that VAM is infallible and sacrosanct.

The TN Dept of Ed–an ed reform-loving group if there ever was one–drew the same conclusion in their review of the TEAM evaluation system they published last summer:

“In many cases, evaluators are telling teachers they exceed expectations in their observation feedback when in fact student outcomes paint a very different picture. This behavior skirts managerial responsibility and ensures that districts fail to align professional development for teachers in a way that focuses on the greatest areas of need” (p. 32).

“This disparity between student results and observations signifies an unequal application of the evaluation system throughout the state” (p. 32).

Page 33 mentions that Tennessee “leads the nation in available data on teacher performance and effectiveness” and that it possesses a “tremendous amount of student outcome data received through TVAAS.”

However, there’s no mention anywhere that the disparity maybe, just maybe, results from a lack of reliability and validity in TVAAS-based data!

Although the report focuses on the disparity between the Level 1 scores, the chart on page 32 clearly indicates disparities at Level 4 and Level 5 as well:

Level 4

TVAAS 11.9%

Observation 53%

Level 5

TVAAS 31.9%

Observation 23.2%

If the issue is actually observation score inflation, why then does the department not cite Level 4, the level with the largest disparity? Or if the source of error is the observation, not the TVAAS, why not argue that more teachers should’ve received a 5 for their observation score?

A VERY important footnote that will be overlooked by the media and policymakers:

“Different student assessments, observation protocols, and student survey instruments would likely yield somewhat different amounts of reliability and accuracy. Moreover, measures used for evaluation may produce different results than seen in the MET project, which attached no stakes to the measures it administered in the classrooms of its volunteer teachers.”

And, perhaps pointing out the obvious.

All of this data is based on multiple choice tests. Do we really believe that there should be such an emphasis on this? It is putting a lot of faith in just one measure. I realize that Gates is saying test scores, observations and surveys are best. “Multiple Measures”

But there are many things teachers do that you just can’t measure. (Ex: teaching and modeling patience, persistence, empathy, civic engagement, communication, etc…). How do we take that into account? I suppose Gates would say the student surveys, but often times, students return to me years later and say “Now I understand what you were trying to teach us.”

But if the Ed policy folks believe our nation needs to focus more on test scores, I suppose we’ll adjust. But I don’t think we want our students to be like students from other great test taking countries like South Korea. We want our students to be Americans, and our Ed policies should follow that. I find it odd that high test score nations are trying to reform their education systems to look like ours while we’re getting the U.S. to look more like theirs!

And to clarify “Now I understand what you were trying to teach us” I was referring to the non-content academic based responsibilities I have to teach like collaboration, communication, etc…

I believe I’m responsible for both academic content learning, and some of the other life skills I mention above. In fact, this is a mandate from my school. Academic content is only two of the six main goals our school is trying to achieve. The other four deal with life skills (that are hard to measure in my opinion).

It seems that hiding data is common among those with political agendas. Let’s see the data that went into the Gates’ graphs and the AGW graphs.

Pingback: Remainders: An ed committee rises out of new Senate rules | GothamSchools

Gary,

Hope you saw this:

http://www.reddit.com/r/IAmA/comments/166yeo/iama_blogger_for_fivethirtyeight_at_the_new_york/

Nate Silver, the statistics guru, minimally chimed in about using test scores to evaluate educators.

Pingback: Remainders: School closure foes borrow from the civil rights era | GothamSchools

Pingback: Review [UPDATED]: “How to Evaluate and Retain Effective Teachers” (League of Women Voters of SC) | the becoming radical

Great work!

I guess you can massage statistics to show whatever you want.

http://ed2worlds.blogspot.com/2013/01/gates-foundation-wastes-more-money.html

This professor has also reached the same conclusion.

Pingback: In which I read research articles with interesting contrasts « Learning to Fold

Great work- thank you for posting!

You’ve committed a classic graphing error in using a scatterplot for dense data. The size of your blue dots is hiding any sort of information about the relative importance of the fringe dots on the blob of blue. The fact that your 5-percentiles average so clearly shows a linear correlation demonstrates this. More appropriate plot would be a heat map.

IMO the scatterplot is just as misleading. Yes there is some variance that you can’t detect from the 5% increments. But on the scatterplot you don’t see if points overlap. You would need to do a scatterplot that gets “darker” on the spots where points overlap, e.g. if there are a lot of dots on the same spot this spot will have a much darker blue color compared to the outliers. Otherwise you are lead to think the variance is greater then it actually is, as the edges of the scatterplot look as crowded as the center making you think that variance is equally distributed from the top to the bottom (except the outliers). There is no way to detect if the majority of the spots is actually in the middle or not with that kind of scatterplot you did there. There are obviously lots of cases where some low-rated teachers are actually above in the next years, but these aren’t the majority, its the minority of the total numbers. Does it show that there is still lot of room for improvement? Certainly.

I’m a bit confused as to why you are using scatterplots with overlapping points. 8 months ago I pointed out that such plots are inappropriate, and used your prior blog graphs on this topic as examples of what not to do.

http://www.chrisstucchio.com/blog/2012/dont_use_scatterplots.html

You commented on my blog post, and I responded to your question as to how to make graphs which don’t obfuscate the data.

So why more bad graphs? I said it 8 months ago and I’ll say it again – don’t use scatterplots. If you insist on scatterplots for dense data sets, at least tune the opacity to make sure the density is displayed.

Hi Chris,

The point here is that clustering makes the correlation seem higher than it is, which I think you’d agree with. It is also true that a scatter with a lot of overlapping could make he correlation seem lower than it is. Unlike the graphs from those previous posts where all the points were lattice points, in this case there is less overlap so it is not as bad as those others. Either way the correlation coefficient doesn’t get fooled by overlap and the r value was pretty low for this one, I think .3

There is obviously some correlation or the clustering would not produce such co-linear points.

Gary

As a practictioner who works with regressions a lot, the correlation in your top scatterchart is massive (especially for a noisy estimate like small-sample multiple choice tests), and your percentile aggregation makes it even clearer.

The r value of 0.3 is very large — just have a look at the p-stats of the x-variable; I will eat a hat if it is not waay above 2 or even 3.

I meant t-stats above 2 or 3 and p-values below 5% or even 1%…

Marton, Email me at my yahoo account myfirstandlast@yahoo.com and I’ll email you the data for that scatter plot and you can do some analysis on it. I won’t be surprised if there is more correlation that it seems. Could be explained by biases in the formula. Just because it has some consistency doesn’t mean it actually measures teacher quality. I’d be interested, though, in what would be considered a ‘good’ correlation for this many data points.

This seems to be the crux of the matter. It’s fascinating that the same data visualization can mean such different things to the trained eye vs. the untrained eye (e.g. mine). For what it’s worth Gary, you had me pretty convinced with your series on VAM (regarding both its inaccuracy and its imprecision), so I’d like to see where this conversation goes.

I don’t disagree with the main point. I’m just being a math/visualization geek here. The problem with this graph isn’t lattice points, it’s visual overlap – one of the many pitfalls of scatterplots, unfortunately.

I don’t agree that 0.3 is a bad correlation at all. Given the amount of noise and the small sample size (20-35 kids per sample), I’d be suspicious to see anything > 0.5.

I don’t know if we have enough data for it, but a wonderful thing to see would be repeat data across multiple classes/years for the same teacher. Would be nice to see if individual teacher scores cluster (i.e., are simply drawn from a somewhat noisy distribution) or are actually just wildly varied.

A friend who began teaching in the LA public schools made an interesting point a few months ago about pay for performance and teaching. She was a new teacher so she didn’t have a lot invested in the “system”. What she noted was on standardized tests that many of her students would simply fill in the answer sheet as quickly as possible without looking at the questions simply to be done with the test. This kind of phenomenon has got to play hell with data.

Pingback: The 50 million dollar lie | Gary Rubinstein’s Blog | fozbaca's WordPress

Pingback: “A Danger to National Security: Mainstream Education Reporting” | Diane Ravitch's blog