I was inspired to get to the bottom of the New York City school progress report grades after reading this story from the New York Times Schoolbook website about P.S. 84 which was one of the thirty F rated schools this year despite seeming to be a very good school.

To understand and analyze the accuracy of the 26 calculations that go into the final score, from 0 to 100, which then gets translated into a letter grade of A, B, C, D, or F, requires a math major, which, fortunately, I was.

I’ve learned, and will attempt to fully explain in this and future posts, three major flaws in the system that make the progress score completely invalid. Two things I hope to accomplish with this are 1) To make the staff, students, and parents of students at P.S. 84 feel better and have some clear explanations so they know how this happened and how they might (or might not) be able to stop it from happening again, and 2) To let the media know about this invalid metric which has been, and continues to be, used to shut down ‘failing’ schools to make room for charter schools.

As I’ve made a name for myself in debunking ‘miracle’ schools by showing they are not as great as they claim to be, this is an unusual role for me in debunking a ‘failing’ school. (Is this ‘bunking’ or ‘rebunking’?) Still, I use the same tools, in this case, the actual report cards and also the database which is available at the DOE website.

**Flaw #1: Assuming that ‘two standard deviations below the mean’ is a lot worse than ‘average’**

The school report card is based on thirteen categories which have a maximum total of 100 points. Then, the bottom 3% of schools, regardless of how high their final scores are, are assigned Fs. This 3% is determined before the calculations are done. There will be about 30 Fs. For 2010-2011, the bottom 3% all got under 18 points out of 100. These numbers are so low, that no school could argue that they were cheated. I mean, an 18 out of 100? They should be ashamed of themselves, right?

Name | Total Points Possible |

ELA Progress | 15 |

ELA Progress bottom 1/3 | 15 |

Math Progress | 15 |

Math Progress bottom 1/3 | 15 |

ELA Percent Proficient | 6.25 |

ELA Average Score | 6.25 |

Math Percent Proficient | 6.25 |

Math Average Score | 6.25 |

Academic Expectations | 2.5 |

Communication | 2.5 |

Engagement | 2.5 |

Safety and Respect | 2.5 |

Attendance Rate | 5 |

Total | 100 |

Five of the 100 points are based on attendance. When I looked at the progress report for P.S. 84, I saw that they had 92.8% Not bad. But when that score got converted to a number between 0 and 5, I was shocked. 92.8% attendance translated to 18.4% of the total, which was a .92 (that’s point nine two) out of 5. So they got an ‘F’ in attendance.

The reason this score is so low has to do with the fact that the system does not care if the school got some kind of acceptable number or not. The goal is to locate and punish the bottom 3% of schools no matter how good they are so the metric serves to exaggerate percentages that are below average. So a 92.8% attendance becomes an 18.4% score when they are through with it.

Here’s how the score was calculated: First they calculated the average attendance rate for the entire district and also for the 40 ‘peer’ schools related to P.S. 84. ‘Peer’ schools are the ones that supposedly have similar demographics so schools are judged against other schools with similar kids and also against all schools.

For all schools, the average was 93.6% while for the 40 ‘peer’ schools, the average was 94.5%. So the 92.8% is a bit below the average school, and a bit more below the average of their peers. So how does this turn into an 18.4% out of 100% for attendance?

Well, there’s a statistic in math called the ‘standard deviation.’ This is a measure of how close the numbers in a data set are. The closer the numbers are, the smaller the standard deviation. If I have a class and everyone gets a 90 on a test, the average is a 90 while the standard deviation is 0. If there are a lot of 100s and a lot of 80s, the average can still be a 90, but the standard deviation will be higher, maybe a number like 5. So in the second scenario with the standard deviation of 5, what can we say about a score like 85? Well since it is 5 points below the mean of 90, we say that it is ‘one standard deviation’ below the mean while 80, since it is 10 points or 2*5 points below, we say that it is ‘two standard deviations below the mean.’

The phrase ‘two standard deviations below the mean’ sounds like something that is always bad, but really it is just relative to how big the standard deviation is. If it is a small number, like 1, then it is just the same thing as 2 below the mean, and isn’t all that different from the mean or even from the exalted ‘two standard deviations’ above the mean.

In the attendance example, the standard deviation for the peer schools was 1.1 while the standard deviation for all schools was 1.9. So for the peer schools, the 92.8% was nearly two standard deviations below the mean while it was nearly one standard deviation below the mean for all schools. Big deal, right? Well, actually, for the conversion to the five point scale it is. You see, when you are 2 standard deviations below the mean, you get scaled to 0%. 1 standard deviation below the mean is scaled to 25%. At the mean is scaled to 50%, 1 standard deviation above the mean is scaled to 75%. 2 standard deviations above the mean is scaled to 100%. For the peer groups, this made the 92.8% become 11.4% and for all schools it became 39.5%. Then the peer percent is multiplied by 3 and added to the other percent and then divided by 4 (the peer comparison is 75% of the score and the other is 25%) to get 18.4%, which is then multiplied by 5 points to get .92. (Click on the graphic to enlarge)

So what this type of calculation does is turn anything below average, even if it is just a little below average into something that seems like it is way below average. It then makes it a lot easier to justify the F in attendance. 18.4% sounds a lot worse than 92.8%.

This, believe it or not, is what’s done with all thirteen calculations. None are based on some kind of absolute score that signals that a school met some kind of target. Everything is compared to the average and the schools that are two standard deviations below, with no consideration to how small those standard deviations might be, are slammed with failing grades in that category.

Another extreme example for P.S. 84 is ‘Academic Expectations’ where the tiny standard deviation of .5 caused them to get just .36 points out of 2.5 possible because the peer average and total school averages were 7.9 and 8.1 respectively, while P.S. 84 had gotten a 7.1. So even though they were very close to the average (and the high scores) on this ambiguous metric based on voluntary parent and teacher surveys, they lost valuable points that could have prevented them from getting that F. Only half the parents responded to the survey. It was one of those 5 point scale surveys and nearly all the parents said they either agreed, or strongly agreed that the school had high academic expectations.

Punishing schools that are ever so slightly below average by turning their raw scores that are so obviously close to the mean into scores in single digits, making them feel like complete failures is an awful thing to do and also terrible for morale. Imagine if I, as a teacher, scaled my tests this way. A kid who got a 93% gets it turned into 1n 18% just because everyone did pretty well and the scores were so close together. This is crazy, and, believe it or not, this is only the first and most benign reason that the progress scores are mathematically invalid.

I will examine two other ways in my next two posts coming soon.

Fascinating work. Aside from the moving goalposts of ever-changing “peer schools”, the NYC DOE’s confusing and muddled grading rubric is purposely designed to keep those away who wish to understand it.

It is the new business model, and as it has financially ruined many in our nation, it now aims to bring our children and teachers to ruin.

Again, great stuff and thanks.

Keep it up. Are you using Mathematics or mathematics here? Maybe an integrated application of both?

Definitely capital ‘Mathematics’ — analyzing how people can lie with statistics. It doesn’t have to be abstract and ‘pure’ to count as Mathematics. In ‘math’, they just learn mean, median, and mode.

This is amazing. My jaw is on the ground. Would you do another series when they release scores for high schools?

Gary: this is great, but in the NYT article the principal noted that school is gentrifying quickly and was being compared to other schools w/ middle class pops, while the test scores of older students in 3-5 grades do not yet reflect this changing demographic composition. Are you also going to look at this possible confounding factor?

And why does the city progress report look at overall student composition rather than the students in the testing grades?

Finally, I hope you might look at the year to year variation in test scores gains at the school level, which are highly random. Liebman said the school grades would be based on 3 yrs of test scores, but this never happened. thanks!

The principal is correct. That issue about peer groups is the biggest factor which I’m building up to for part III.

It’s also kind of hard, don’t you think, for the principal to slam the DOE policy of destroying the ‘bottom” 3% through a ‘sham’ report card grading policy? After all, those who run this sham are the principal’s bosses. It’s more politic to focus on the peer group argument rather than curse the whole darn enterprise which is exactly what Gary describes above: a demoralizing sham at best–a way to privatize our public system, at worst.

Last year I dug out the system RI used to calculate “persistently low performing,” which has some similarities:

http://www.tuttlesvc.org/2010/08/rides-approved-definition-of.html

It left me wondering whether there was any sort of academic discipline underlying these particular methods… or, any research at all, really. In particular, despite how this is supposed to be influenced by business methods, I can’t imagine an investor using a system like this to evaluate businesses.

Great post!!!! Have you done the same sort of analysis for the value added model? Regarding PS 84’s F, it would seem Ms. Moskowitz (sp?) has her eyes on this building and the DOE is again obliging her. She is still hunting.

Pingback: Remainders: Getting under the hood of a low report card score | GothamSchools

Gary,

I hope you can look at this, true for many of the metrics: the “ranges” are based on Years 1 and 2, let’s say. Then the score in Year 3 is applied to it.

I have found a good number of examples where the year 3 score is OUTSIDE the nominal “range.” To me, this is nonsense, and any school getting a zero for falling outside PRIOR years’ ranges, but perhaps not the CURRENT year’s range, would have been wronged. I’d love your take on it.

Michael,

Yes, that’s something for part III. The 2 years of data thing mainly affects the ‘student performance’ grade. First part was mainly about ‘environment’ and second part was mainly about ‘progress.’ Part III will cover the problems with measuring and, more importantly, comparing, performance.

Gary

This is a great post and great work. Keep it up.

I do differ with you on the interpretation of a 92.5% attendance rate. At the elementary level that is not a great attendance rate. If you use the DOE’s prog rpt data the average for elementary schools is 93.2 and the median is 93.4% indicating that 50% of the elementary schools achieve a better attendance rate than 92.5% . Attendance is such an impt factor in a child’s success that even a small deviation from the average is impt.

That said, I don’t think any school should be penalized for an attendance of 92.5% but certainly this kind of data is useful. It’s an indicator that says this should be investigated – not dismissed as no big deal.

Thank you for all the work that you are doing! Fascinating!

Thanks Gary for breaking it down for us, great work!! After reading this report it eased my mind a bit. I have been very nervous about the failing grade PS-84 received this year. Certainly no parent wants their child in a failing school. I do believe that PS- 84 is a good school, but I have to say that the F grade has me very worried.

Pingback: Progress Report Follies « JD2718

I agree that the Progress Report is not a sufficient, or really even good, measure of schools. But I disagree with your critique of the use of Standard Deviations to create a range for measuring how schools perform on the inputs used for the Progress Report. I especially disagree with your statement that this is an example of how statistics lie. Statistics don’t lie – it is the insufficient explanations and/or misuse of statistical concepts that lead people to incorrect conclusions using statistical calculations.

Let me explain…standard deviations are used to describe a distribution. All sets of data have a distribution. And standard deviations are used to determine the range within which a certain percentage of the data points fall. Within 1 standard deviation of a mean, just over 68% of observations will fall. (Of course, with a small sample, it won’t be exact, but the sample size of test scores for NYC is large enough.) Within 2 standard deviations of the mean, just over 95% of all observations (data points) will fall. So, a school that is more than 2 standard deviations below the mean is in the bottom 2.3% of ALL schools. I think by anyone’s standards that is a low performing school on that particular measure.

The reason for standard deviations is exactly the reasons you try to use to refute the use of standard deviations. Sometimes the range is very tight and so a few points off the mean is low – everyone else managed to score even closer to the mean. Sometimes the range is quite large and so it takes a bigger difference from the mean to be in the very high and very low end of performers. Standard Deviation takes into consideration the spread of observations (data points).

Perhaps the better question is, if 92% attendance is close to 2 standard deviations below the mean, then why are we using attendance as a measure? If 92% is low, is attendance really a problem in general?

In my opinion, the biggest issue of the Progress Report is not the formulas the DOE is using to compare schools – they have actually made improvements to these formulas over time – rather the inputs they are using. As the saying goes “garbage in, garbage out”. 85% of these grades are based on 2 test scores – Math and ELA. Since when does the result of a single test in 2 subjects measure how well a school is preparing students, how well a school is educating students. (And for elementary schools – a single test taken by only 50% of the students, assuming equal number of students per grade, in the school.)

A secondary issue is the DOEs definition of a Peer Group. But that is another long explanation.

Hi i am a member of the Save Legacy Coalition. We are working hard on trying to fight the DOE with numbers VS. Numbers. They want to close our school but we aren’t going without a battle. We could sure use some help in putting a peer index report together so we can tell the media. Please if you can email us saveyourlegacy@gmail.com

Any help would be appreciated