$50 million. 3 years. No clue.

In my last post, I showed how in the final report from the Gates Foundation MET project they produced a very misleading graph.  Though the implication of this graph — namely, that value-added measures are consistent from one year to the next — was not the only point of this study, I called it THE $50 million lie because it is the thing that will be used by ‘reformers’ for years to come as ‘proof’ that test scores should be a significant factor in teacher evaluations.

Here is the cover of the report.  Note the hip white male teacher and the black student with both a hoodie (and wearing it with the hood on in class!) and glasses (note to the sensitive reader:  I’m not saying that no black kids who wear hoodies also wear glasses nor that no black kids with glasses also wear hoodies.  Also I’m aware that anyone can need glasses which has nothing to do with intellectual ability or academic motivation.  I just found this picture to contain a lot of ‘subtext’).  Also how the effective ‘teaching’ is actually using ‘blended’ learning as the kid learns at his own pace with the aid of a very old computer with the old-school giant monitor.

There are an infinite number of ways to make charts and graphs of different results obtained.  The choice of which ones to include and which ones to omit can reveal what the authors THINK the data proves.  The authors also describe their conclusions which are supported by the charts and graphs they chose to make.  As the readers don’t get all the objective raw data, it is easy to be swayed into believing the author’s interpretations of the numbers.

In this post I intend to show that the few numbers they present in this paper can just as easily be interpreted to contradict the conclusions the authors make.  In this way, it seems to me, this costly undertaking will have served little purpose.  There will be a few things (mostly misleading) that will be quoted by ‘reformers’ and also plenty that I hope, based on this post, can be used by ‘anti-reformers’ when this study is used as evidence by ‘reformers’ in any sort of debate.

One thing that isn’t mentioned in the paper, but is very evident by the graphs I showed in the previous post, is that according to test score gains, the vast majority of teachers are statistically ‘equivalent.’  It seems like 90% of the teachers are within .05 standard deviations of the mean.  They don’t say how much extra learning the .05 is, but they do say that .25 is equivalent to one year.  So I’d say (if I had to, that is, I don’t think ‘learning’ is measured in time units) that .05 would amount to a few weeks of learning.  Where are the mythical teachers who get a ‘year and a half of growth’ and the even rarer ‘effective triplets’ who, if you happen to get three of them in a row, can erase the built up achievement gap?  An implication of this is that finding the ideal weighting for components of teacher evaluation will not make a significant difference in student achievement.

The main conclusion of this study is that when it comes to teacher evaluations, the best policy is to base them on ‘multiple measures’ and that observations are not as good a tool for measuring teacher quality as test score gains are.  Though they don’t commit to an exact breakdown of how the measures should be weighed, they suggest that  33% to 50% could be based on standardized test scores while the rest is based on equal parts of classroom observations and student surveys.  This 33% to 50% figure is not different than the primary author, Thomas Kane — one of the leading figures in value-added, has been saying for years.

Though science does sometimes prove things that are not intuitive, science does depend on accurate premises.  So, in this case, IF the conclusion is that “you can’t believe your eyes” in teacher evaluation — just because you watch a teacher doing a great job, this could be a mirage since that teacher doesn’t necessarily get the same ‘gains’ as the other teacher that you thought was terrible based on your observation — well, it could also mean that one of the initial premises was incorrect.  To me, the initial premise that has caused this counter-intuitive conclusion is that value-added — which says that teacher quality can be determined by comparing student test scores to what a computer would predict those same students would have gotten with an ‘average’ teacher — is the faulty premise.  Would we accept it if a new computer programmed to evaluate music told us that The Beatles’ ‘Yesterday’ is a bad song?

One thing that struck me right away with this report is that the inclusion of student surveys — something that aren’t realistically ever going to be a significant part of high stakes teacher evaluations — is given such a large percentage in each of the three main weightings they consider (these three scenarios are, for test scores-classroom observations-student surveys, 50-25-25, 33-33-33, and 25-50-25.)

Conspicuously missing from the various weighting schemes they compare is one with 100% classroom observations.  As this is what many districts currently do and since this report is supposed to guide those who are designing new systems, wouldn’t it be scientifically necessary to include the existing system as the ‘control’ group?  As implementing a change is a costly and difficult process, shouldn’t we know what we could expect to gain over the already existing system?

This large inclusion of student surveys in each scenario is not very good scientific practice since it adds an unnecessary (for practical purposes) variable into the mix and makes it that much more difficult to make decisive claims about what the weighted averages mean (perhaps this was something they intended.)  This study would have been much more useful if they focused on value-added and classroom observations.  And by including the student surveys so heavily, they won’t be able to guide policy makers who don’t have student surveys as an option.  If a district wants to do the 33-33-33 model, but student surveys aren’t permissible, then since they can’t just do 33-33 for test scores and observations, they would make it 50-50.  Likewise if they wanted to to the 50-25-25 split, but couldn’t use student surveys, it would require a 67-33 split to keep test scores double the weight of observations.

For many reasons, I don’t think that student surveys will ever be a part of teacher evaluation.  I know that the proposed surveys are not just the students ranking the teacher on a scale of 1 to 10, which is something that would not be very accurate and would be something that might cause teachers to game the system at the expense of student learning (like giving everyone high grades).  Instead they ask a series of enigmatic questions like “Does your teacher spend a lot of time doing test prep?” or “Does your teacher care about how much each student in the class learns?” by which the teacher’s quality is pieced together by the answers to these questions.

Now I’m a teacher and I like to think that my students think that I’m doing a good job.  Certainly there are students who like me more than other students.  But I don’t know if I want to trust my job on whether or not my students interpret what I do in class.  Maybe I see that my class is looking lethargic and I spend a minute telling them I saw a good movie over the weekend, as something to wake them up.  Now they get a survey at the end of the year with the question “Does your teacher often stray from the subject matter?” and though I only did it a bit, and for a specific reason, this could have stuck out in their minds and suddenly I’m losing my job or getting a pay cut.  On ratemyteacher.com, I have an average of 4 out of 5 stars, but I have seen some bizarre comments.  A recent rater gave me two stars and wrote “he seems to be popular…although i’m not sure why. he gives zero notes, and if you are not good at teaching yourself you will be screwed for his tests (which are mostly problems similar to ones we do).”  I have good ones too, like “Really really great teacher. He truly cares about his students and help them do their best. Offers help even if the student doesnt want it haha. Gives fair tests, and is a very encouraging teacher!”  On the other hand, a teacher who retired a few years ago who I felt was really ‘phoning it in’ at the end has five stars including comments like “ideal teacher: never checks hw, doesn’t yell, doesn’t get upset, you can eat in class, you can sleep in class, you can play your PSP or listen to music and he doesn’t notice =)” and “Best teacher ever… He understands his students and handles everything very maturely. You can do ANYTHING in his class. ANYTHING. You can answer 1/8 ?s on the test and he’ll give you a 65. Not too clear but so nice and cool.”  Though I do understand that students observe a teacher for hundreds of hours and therefore do have insight into the teacher’s quality, I’m not convinced that the information we get from these student surveys adds much to what can be obtained by a competent principal.

Some of the results of the study are prominently displayed in figure 4 on page 12 of the report.

It took me a while to process what this graph is supposed to tell us, but I’ll attempt to explain this thoroughly here.  I’m hoping this will help people discuss this study from an informed perspective.

This graph shows the correlation between the middle school language arts teachers ratings under four weighting systems when compared to three other measurements.  When two things are highly correlated, they have a ‘correlation coefficient’ close to 1.  When correlation coefficients are low and you make a scatter plot of the data, it looks like a bunch of random points.  Though it depends on the scenario, anything less that .4 is considered to be pretty weakly correlated.

The yellow bars (system 2) is when value-added is 50%, observations are 25%, and student surveys are 25%.  The green bars (system 3) is when each is weighted 33%.  The blue bars (system 4) is when observations are 50% and value-added and student surveys are 25% each.  The red bars (system 1) for this set is the weighting that would have caused the highest correlation with value-added.  This is something that is hypothetical and calculated after the fact, to see what weighting would have given the most accurate predictor of state test gains.  In this case it would have been if they used 81% value-added, 2% observations, and 17% student surveys.

The first bar graph shows that an evaluation that uses nearly all value-added correlates best with state test score gains.  This does look compelling, but this would imply that if the teacher is evaluated with primarily value-added for two straight years, she should get around the same score, which contradicts what we see in the third set of bars where this system is, by far, the least reliable.  Most interesting, however, is the second set of bars.  The fact that the red bar is the shortest of the 4 means that of the four systems, the one that is nearly all value-added correlates the least with other higher thinking tests.  Now the other three systems are equally predictive of scores on these more difficult tests, and as all the numbers are under .4, it seems that none of these systems are very predictive of scores on these other tests which, we can assume, are what the common core tests are supposed to be like.

Throughout the report the authors describe the value-added as ‘student learning.’  While these other tests might be more indicative of student learning it’s hard to say if they do either.  To me this second set of bars indicates that none of the systems really correlate with these other tests so if you really think that these other tests measure student learning, you’d want to go with the system that is most reliable and also one that wouldn’t be needlessly expensive.  So from this perspective, I’d say that this report “defeats its own purpose” (‘Raging Bull’ reference!) if the point was to demonstrate that value-added should be a significant part of evaluations.

Now these graphs were all based on the data for middle school language arts.  Presumably these numbers and the conclusions that go along with them can’t be only good for one grade level and subject.  Otherwise we would need a different weighting system for every grade level and for every subject within that grade level.  Well, though they only produced one set of these graphs, they did give data on page 14 for the other three situations:  elementary ELA, elementary math, and middle school math.

Aside from middle school math, the classroom observations are as good or better predictors of scores on higher order tests than value-added.  What a mess.

Even the authors, despite their obvious bias in using the term ‘student learning’ throughout the paper when they mean ‘gains’ compared to computer predictions, aren’t very enthusiastic about their conclusions.  I think a relevant quote is something I found on page three of the supplemental technical report which said:

To guard against over-interpretation, we add two caveats: First, a prediction can be correct on average but still be subject to prediction error. For example, many of the classrooms taught by teachers in the bottom decile in the measures of effectiveness saw large gains in achievement. In fact, some bottom decile teachers saw average student gains larger than those for teachers with higher measures of effectiveness. But there were also teachers in the bottom decile who did worse than the measures predicted they would. Anyone using these measures for high stakes decisions should be cognizant of the possibility of error for individual teachers.

This report will surely be quoted by ‘reformers’ as some kind of scientific proof that value-added has finally been vindicated.  But my examination of the same data (and I look forward to the certain deep analysis that will soon happen by others) tells me that they really didn’t come up with anything we didn’t already know about the problems with these crude metrics.  But generally these numbers, from my perspective, don’t really reveal any $50 million secret.

This entry was posted in Research. Bookmark the permalink.

19 Responses to $50 million. 3 years. No clue.

  1. Pingback: Review [UPDATED]: “How to Evaluate and Retain Effective Teachers” (League of Women Voters of SC) | the becoming radical

  2. l hodge says:

    I am in no way defending the MET report. I believe it is intentionally misleading and that their own findings actually confirm the unreliability of Value Added measures, and show that observations & student surveys are poor predictors of test score gains.

    You seem to be using the same misleading graph when you write that the vast majority of teachers are statistically equivalent, with almost all teachers within .05 standard deviations (or a few weeks of learning) of average. Are you making this argument? Or just pointing out another possible incorrect interpretation of the misleading graph?

    It is not surprising that value added models do a better job of predicting test score gains than administrators – that is all they are designed to do. Administrators are generally more reliable (consistent) from year to year because they are focusing on more than just test scores. As an extreme example, if every teacher had their mom evaluate their teaching, you would see excellent reliability from year to year, but little ability to predict test score gains. If value added ratings were a “strong” predictor for an individual teacher’s test score gains, then it would also be consistent for individual teachers from year to year. But it isn’t a “strong” predictor; it is just better than administrator evaluations at predicting test score gains.

    • Gary Rubinstein says:

      Good question. I think that principal evaluations probably show more variation in teacher quality, which is more accurate and also ironic since the value-added is supposed to do that, but it doesn’t. Then when these closely rated teachers are ‘ranked’ it becomes that much worse.

    • Steve M says:

      Gary used the first graph in his previous blog post, where he detailed how its’ method of lumping teachers into 5% groups gives the false impression that the MET’s results have a good correlation.

      He then showed that, by lumping teachers in such a fashion, a set of data with almost no correlation could be made to give such a false impression.

      If anything, Gary could have been more clear in this post while referring to the first graph.

  3. Another point regarding value-added that seems to be missing from the conversation/debate is that these tests, since our country’s move to “standards-based education,” have been designed – although poorly – to measure student achievement on a standardized exam, not teacher “efficacy.”

    It defies any sort of logic to use these tests to also measure the effectiveness of a teacher and is also terribly unethical.

  4. E. Rat says:

    Such is the power of the young white educator that s/he can make new, exciting “blended learning” applications run on computers so old they take floppy discs. That’s pretty impressive. If the nation’s teachers were just more innovative and driven, their districts could simply purchase these expensive programs without worrying about supplying and maintaining the hardware to run them. And clearly, my inability to get such results out of my classroom’s eighteen year old eMac is due to my union contract, personal laziness, and low expectations.

    Your breakdown of the data in the report is great, but I am hung up on the subtext of the cover. It is essentially education reform distilled into a single image.

    • Frederika says:

      Such is the power of the young, white MALE teacher. We women are 29% less powerful. Your reply is laugh out loud hysterical. You are so right–this is the singular image of edreform–from the outside looking in from an uninformed and disconnected perspective.

    • HS Teacher says:

      Your reply is completely correct. Shame on you! You should be able to squeeze Lincoln off a penny, and dance him around your class, while giving a detailed lecture on class mobility. Your union contract is stifling your ambition.

  5. Steve M says:

    Melinda Gates put out an article on Huffington Post the other day, expounding the merits of the MET study. Several times, I attempted to post (normal, neutral) comments after her article pointing readers to Gary’s previous entry. Of course, none were allowed to go through by their moderators.

    No links, no poor language…simply suggestions that people look at Gary’s blog for a good refutation of what Melinda was preaching. It’s pretty disgusting when liberal people and liberal organizations with a modicum of power have an agenda that includes stifling/smothering those who simply question what they do.

    I expect such behavior from the right, but when this comes from the left it sticks in the craw that much more.

  6. Dan McGuire says:

    Gary, I think you’re only touching on the tip of the ice berg when it comes to the problems with value added measurements.

    Nobody’s talking about the assessments being used to get the value added. The false assumption is that all assessments are equally valid which is not even close to true, as anyone who has ever created a 10 point quiz knows.

    There are so many variables that aren’t being included. We can only hope that this worship of a false god will become self evident soon.

  7. l hodge says:

    On second look, the MET plot does not show such a nice relationship – even with the misleading averaging. The 2nd worst actual achievement came from a group that was predicted to be slightly above average. This shows a very poor model when you consider that each point is an average of about 40 teachers.

    This would be like choosing 40 baseball players with batting averages in the middle of the pack and then in the next year, on average, they were in the bottom 10%. That just doesn’t happen. A few will do much worse the next year, and a few will do much better. But, with 40 players, it would be very surprising if those improving didn’t come pretty close to evening out those that did worse.

  8. Great post, Gary. All of this, of course, is based on the premise that hoards of horrible teachers are running around our schools destroying America, while scads of high-quality youngsters are chomping at the bit, waiting to get into schools and take awesome low-paying and low-prestige jobs that they have been denied because of “the blob” (or whatever term the reformyists are using these days).

    In other words: even if everything that the Gates folks say is true, so what? Identifying “bad” teachers isn’t even half of the problem; how are you going to get BETTER teachers than what you have now?

    One other topic worth further exploration: what if the tests, designed by psychometricians to produce standard distributions, are being used to “evaluate” teachers who may well NOT have a standard distribution of quality?


    • Frederika says:

      Now, JJ–you do know the answer to the question “how are you/we going to get BETTER teachers than what you have now?” Improvements to teacher preparation programs on the agenda include a very important component. Guess what: an exit test so that candidates can PROVE their worth and demonstrate the value of the pre-service program itself. Another piece of quantitative data–another deliverable. Talk about added value–it don’t get much better than this.

  9. Pamela Harbin says:

    Pittsburgh Public Schools (one of the recipients of the Gates Grants for Empowering Effective Teachers) has proposed the following formula for teacher evaluation: 50% classroom observation, 30% teacher specific VAM, 5% Building-level VAM, and 15% Student Survey Data (Tripod). Here is a link to the January 3, 2013 presentation to the school board of the proposed teacher evaluation plan to be voted on next week. http://www.pps.k12.pa.us/14311059122535553/lib/14311059122535553/121227_-_EET_EduCom_Final.pdf

    Gary, thanks for providing this information. I will use it wisely to counter the misinformation our district and Gates has been selling to my community.

  10. Pingback: SHORT READS: THURSDAY, JANUARY 17 « The Teachers' Lounge

  11. Cameron says:

    I just attended a webinar by Heather Hill-a professor of education at Harvard who writes great papers.
    Her team has a large study with access to a number of teachers VAM scores as well as videos of them teaching multiple lessons. She picked 9 teachers with the lowest VAM scores and 9 with the highest VAM scores. Then she had education experts(PhD’s or years of experience or both) watch videos and guess if the teacher was high or low scoring. They got it right about 2/3 of the time. This is not good considering that you’d get it right 1/2 of the time due to random chance. Also note that the point of her research project was to develop a rubric specific to teaching mathematics and the people watching the video were highly trained at using a statistically reliable rubric to evaluate quality of math teaching.

    If this is as good as it gets, I’m not very excited about VAM. Maybe we can just spend the money helping teachers improve by sharing the vast amounts of resources and research about student learning that rarely trickles down to the classroom.

  12. Pingback: Michelle Rhee And The Relentless Marketing Of Education ‘Reform’

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s