Thursday, August 19, 2010

Using Student Growth Data in Teacher Evaluations

The District wants to include measures of student growth as part of the annual teacher performance evaluation. That seems reasonable. In fact, I don't think there is anyone who would be opposed to a fair and thoughtful use of student outcomes as part of the teacher evaluation.

How will the District measure student growth for the purposes of teacher evaluation? That's not entirely clear, but we do have some hints:

From the letter to teachers dated 8/3/10
Student learning and growth will be based in both teacher-determined and District-determined data and measures, and will account for the fact that not all children are the same. The District measure will be based on the overall growth of a teacher's students relative to students of similar demographics who have performed like them in previous assessments, and will be calculated as a two-year rolling average on at least two student assessments.
I'm not sure what this means.

A. "the overall growth of a teacher's students" I'm not sure how they will measure "overall" growth for a teacher's students. For elementary students I suppose they mean in various disciplines (reading, writing, math, etc.). Do they mean the same for secondary students? Will the District measure of student growth for math teachers, for example, include the students' growth in other disciplines?

B. "relative to students of similar demographics" So the District's expectations for student growth will vary based on race and gender. What other demographic grouping will they make? Free and Reduced Price Lunch Eligible? English Language Learners? How is this going to work and how will it be expressed? Will the District expect greater student growth from some demographic groups than others? Why don't we have the same high expectations for all of our students? Don't we believe that all students can learn and can reach the same high levels of achievement?

C. "who have performed like them in previous assessments" This appears to suggest that student growth targets will be set by historic outcomes for various demographic groups. What form will that take? What will this mean in concrete terms? Will the teacher get a report that says: "Student A is a fourth grade Latino boy who is not FRE or an ELL. Based on previous assessments for students in this demographic group, we expect this child to advance 0.8 years in math, 0.9 years in reading, and 1.1 years in writing." And the teacher's effectiveness will be measured relative to that expectation of student growth? Do we even have historic outcomes for students? The MAP assessments have only been done for one year. Does one year of data - the initial year - constitute a reliable data set? With just one year of data, can the District speak about "previous assessments"?

D. "will be calculated as a two-year rolling average" What is calculated on a two-year rolling average - the target or the individual student growth? It can't be the student growth, otherwise the teacher's evaluation would be based in large part on the previous year, before the student was in the teacher's class. It must be the District's targets for growth. But the District doesn't have two years of MAP data to average yet, so what will they use the first year?

E. "on at least two student assessments" Oh! So they won't rely on a single test of student growth. They will use the MAP and... well, what else? They don't have anything else that claims to be able to measure student growth. The MSP or HSPE (the tests formerly known as the WASL) don't measure student growth at all; they are criterion-referenced tests. So what else will the District use? Maybe they could use Classroom Based Assessments (CBAs) if they ever fulfill their dream of standardizing them - oops! I mean aligning them. Will this require teachers to do the District-written and approved CBAs? Or do they mean that student growth will be measured using the math MAP and the reading MAP? Or do they mean that student growth will be measured using last year's MAP and this year's MAP?

In the FAQ sheet on SERVE, the District re-states many of the words, but doesn't clarify the meaning very much:
Individual student growth: Individual student growth will include two types of measurements, each based on the extent to which a teacher’s students meet or exceed scores on standards‐based assessments that are typical of their academic peers (students who have performed like them in previous assessments).
• District‐required: A teacher’s score will be based on a two‐year rolling average of the overall growth of his or her students relative to their academic peers in at least two common assessments.
This statement is very similar to the one before, but rather than providing clarification, the subtle, interesting differences muddy the water a bit.

1) "the extent to which a teacher’s students meet or exceed scores". This suggests target scores instead of target growth. Hmmm.

2) "standards‐based assessments" The MAP is not a Standards-based assessment. The only Standards-based assessment is the former WASL, now known as the MSP and the HSPE. Hmmm. These assessments, however, because they are criterion-referenced tests designed to assess the effectiveness of schools and districts, are utterly inappropriate for measuring year-over-year academic progress for individual students.

3) "typical of their academic peers (students who have performed like them in previous assessments)" Now we have a different definition of academic peers - instead of demographics, they are using "students who have performed like them". So now student growth will be relative to students who achieved similar scores last year. The inherent problem here is that, by definition, half of the scores will be below the median. How would this look in real life? Last year Student B scored a 330 on the third grade reading MSP. This year Student B scored a 360 on the fourth grade reading MSP. The average score on the fourth grade MSP for students who scored 330 on it in the third grade was 350, so this is regarded as evidence of the teacher's effectiveness because Student B's score met or exceeded the score typical of Student B's academic peers. Did I read that correctly?

If I understand these statements correctly, and if they mean what I think they mean, then I'm not impressed with this as a measure of teacher effectiveness. It relies on a perfectly dreadful misuse of assessment data. It relies on unfounded beliefs in correlations (if not causation) between student assessment data and teacher effectiveness.

Worse, I don't think the District folks know what they mean by any of this. I don't think they know what assessments they will use or how they will use them. I don't think they have decided on the peer groups or how to determine relative growth for individual students. I, for one, would like to see all of this spelled out and I would like to see a backtest before I agreed to any of it.

They can and should perform a backtest. In fact, I don't see how they propose this thing without having conducted a backtest. The backtest would be an example of how these measures would have appeared for a few teachers last year. Choose a few classrooms from schools across the District with a diversity of students and programs, and see how the District-determined measure of student growth would have worked and been scored for each of them. What would be the District-determined measure of student growth for a third grade Spectrum teacher at Lafayette, a kindergarten teacher at Gatzert, a math teacher at McClure, a Language Arts teacher at the NOVA Project, a high school teacher at Middle College, a Special Education teacher at Lowell, and a fifth grade teacher at View Ridge? Let's see some concrete examples so we have some idea of what is proposed and so the District has some experience doing the calculations.


dan dempsey said...

Charlie said:

"Worse, I don't think the District folks know what they mean by any of this. I don't think they know what assessments they will use or how they will use them."

So what else is new on the voyage through Fantasyland?

This Board will buy anything without checking ... thus the Administration does not need evidence, logic, or a coherent plan.

Bingo!!! sure enough another "Tri-Fecta" no evidence, no logic, no coherence called SERVE.

Where was any of this interest in test analysis when it came to adopting Everyday Math on 5-30-2007?

on Adopting "Discovering Math" for high school on 5-6-2009?

on Approving the NTN contract twice
2-3-2010 and 4-7-2010?

All of the test analysis I did of these Board approved selections indicate inferior programs were purchased. So now it is the teachers responsibility to make the Crap work that Central Admin purchased.

Nobody knows anything about the achievement gap because there is no substantive research done ... for that might produce politically incorrect solutions. Promoting the fraud is more important than producing a solution.

Look at TEAM MGJ leadership and the Board if you want an ongoing example of FRAUD PROMOTION..... Hey the plan has only had two years .. three more before it gets evaluated.

Unfortunately what is in place has hardly anything academically sound to rest on. What it does have is a politically based power agenda... that is very consistent in all the ridiculous non-evidence based Board approvals.

Anonymous said...

The NWEA has data on thousands of students that have taken the MAP test throughout the US. These are the basis of the norms developed and discussed in their 2008 report: "RIT Scale Norms, For Use with Measures of Academic Progress."

Here's my take:

1) When the District says "students who have performed like them in previous assessments", they are probably referencing the norm data from MAP. Based on a student's intial RIT score, NWEA can categorize the student and give the expected growth of that student.

The expected growth rate varies with the grade and "initial status", as they put it. So a low performing 1st grader could have a higher expected fall-spring growth than a higher performing 1st grader, and because the growth levels off at the upper grades, a 6th grader could have a lower expected fall-spring growth than a similarly matched academic peer in 1st grade.

2) When the District says "based on a two-year rolling average" they probably mean fall-spring "value-added" scores from 2009-10 and from 2010-11. By the end of this coming school year they will have two years of data for most SPS students.

3) When the District says "on at least two student assessments" they probably mean both the reading and the math MAP assessments.

I believe it's all about MAP. That's just my take.

-anonymous reader

dan dempsey said...

Note MAP does not really connect with WA State Math Standards... that is what teachers are supposedly teaching to in the SPS.

So SERVE this entire analysis to the BAD EXPENSIVE JOKE category.

TechyMom said...

I'm just guessing here, but...
The MAP is used all over the place, not just seattle. They have nationally normed data.

Perhaps they have that broken out by demographics?

Or, if it's based on past performance of the same student, it seems like it should be possible for the MAP people to derive expected progress from their data. For example, how likely is it for a 4th grader who has been between the 45th and 60th percentile on all tests to jump to the 70th? to the 90th? This sort of number crunching should be possible with the national MAP scores data.

Megan Mc said...

Good thinking, Charlie.

Does anyone know if the teachers union is asking these kinds of questions at the negotiation table?

I astounds me that the district is trying to pull this bait n switch at the last minute with a promise of working out the details later. The board needs to manage their super and tell her she is out of line. I know that they do not have any influence over the negotiations but they do have influence over her.

Lori said...

I get that the MAP data are nationally normed, but if the concern is that there is a national crisis in education, then some of the students in that national sample are currently being taught by ineffective teachers (whatever/however that can be defined) and therefore, asking our students to meet or exceed expectations based on that population is clearly not ideal, right?

I go back to my post on another thread: a valid model for all of this has to come from students who are being taught by "an effective teacher" in order to take teacher quality out of the equation entirely as far as the reference population goes. The only way to measure the "value-added" benefit from the teacher is to control for teacher quality from the start and see how other factors affect measurable learning. I do not believe that this has been done.

Like Megan Mc, I want to know if SEA is asking Charlie's questions too and I also want to know if they have the statistical expertise at hand to evaluate the very important nuances inherent in this proposal. There are going to be false positives and false negatives in any system that demands hundreds of independent statistical analyses, such as evaluating each individual teacher in SPS to determine if their VAM is statistically significant or not, even if the model is designed appropriately. Multiple testing in statistics is always a problem. How is SPS correcting for this? If they don't, some unknown but potentially large number of effective teachers are going to be declared ineffective based on systematic errors inherent in using small sample sizes (numbers of students taught) and short follow-up (2 years of data instead of 10).

I think it's a laudable goal to try to find ways to identify effective teachers, but we aren't there yet. Rushing into this is a huge mistake.

seattle citizen said...

"...expectations for student growth will vary based on race and gender. What other demographic grouping will they make? Free and Reduced Price Lunch Eligible? English Language Learners?"

The whole idea of students being boxed into their demographics for statistical analysis seems preposterous. How many demographics are there? Thousands of parent sets available: For example, just one set of parents could be recently remarried, recently bankrupted, bilogical father in jail for physical abuse, step-father unavailable due to shy and undemanding affect around children; the four grandparents of these two "dads" include a Scots/Cambodian, a recent immigrant of Chilean background, an eighth generation blueblood from Boston, and a second-gneration immigrant from Somalia via the Kenyan relocation camps.

The mother smokes pot recreationally...indoors...and has a diet to match. The children eat a lot of mac and cheese. She herself is 1/8th Italian, 1/8 Ghanan, 1/4 Norwegian and half African American (her father was from generational roots down to Memphis, with a liberal connection to post-civil-war families and communities that moved to Chicago and others that stayed in the sharecropper communities well into the 1950s.

The student, in additional this multiplicity of factors, listens to new age music on an original Walkman, perused architectural digests as a child, is notoriously shy among strangers and refuses to "perform" EVER....

Which box should this child's parent/guardians check when they are given the form upon registering their child as a student?

Which boxes will the district checkoff when using "demographics" and "percentiles" and "stanines" to compare this kid's growth to some "similar" kid from last year?


What about the kids who don't check ANY boxes?

Am I remembering incorrectly or did I recently hear of a case where school officials somewhere were actually checking the boxes for students who hadn't? " 'African American,' " seems to be...., yep and maybe a little Asian-y....

Anonymous said...

For the 2008 NWEA/MAP norm sample, the ethnic categories are listed as Native American/Alaska Native (2%), Asian/Pacific Islander (5%), African American (18%), Hispanic (20%), and European American (55%). The (%) refers to the approximate percentage of the sample size for each ethnic group.

When filtering the data for inclusion in the norm set, one of the criteria was that the student's "ethnic code" was included on the test record.

-anonymous reader

seattle citizen said...

Let's take the Native American code as an example:

in 2008, 2% were listed as "Native American/Alaska Native."

What does this tell us about these children? While we have perhaps been told that children in the past (whose parent/guardians checked that box) performed at X level, it tells us virtually nothing about EACH CHILD. One could be 1/8th (the determining number to claim tribal affiliation, I believe) NA/AN, and 7/8ths something else entirely, maybe a lot of elses entirely. Let's add in reservation vs non-reservation: some big statistical differences in poverty levels, etc in THHOSE numbers...

I mean, WTF? How do these little checkboxes tell us anything, really, about anybody, unless we know ALL the variables?

If someone can explain how we can place a supposed expected targeted learning "amount" on some kid because of a little check box on a form, please let me know now. I've been wondering this for years and it boggles my mind.

WV tells me I'm outalim, so I won't go out on one anymore.

seattle citizen said...

So here's the scenario, given the categories, and given a use of MAP (or some other test) in evaluating teachers:

Admin: "You started the year with 6 African Americans, 4 Asian Americans, one Native American/Alaskan Native, and five European Americans. Of these, five were Special Ed, two were ELL, undisclosed number (private, you know) was free and reduced lunch.
Some of these were in two or more of the above categories.
We've run the numbers, and given the performance of students in these exact same categories last year (which is just like this year, demographically: nothing changed in the world), each of the students in each of the categories in your class should have had a particular rise in score. That they didn't, that some were high and some were low, tells us that YOU are doing something strange in your classroom, NOT that the students are, in fact, highly unique, multi-dimensional beings!
We are docking your pay and sending you to Professional Development until you reach a 'standard of excellence' of at least 1.1 on some scale that might vaguely relate to the curriculum, assessments, and students in your classroom.

Now get back to teaching before we can your ass."

Charlie Mas said...

Ha! Good one, seattle citizen!

Actually, I think all of my questions would be answered by something sort of like that, an example of how this statistical measure would have worked out using actual data from last year or could have worked out using some representative hypothetical data.

Charlie Mas said...

If the SEA doesn't ask for something like that, they are fools.

If the District isn't ready with something like that, they are idiots.

Lori said...

The National Academies of Sciences and Engineering issued a report last fall saying that value-added measures (VAMs) are not yet ready for prime time: "At present, the best use of VAM techniques is in closely studied pilot projects..."

Too bad SPS is not proposing a pilot project. I don't know about you, but I trust a panel of expert scientists and mathemeticians just a wee bit more than I trust our Superintendent when it comes to understanding and using complex statistical techniques in our schools.

There is a ton of easy-to-read, important information in the full report. You can download it here: