Do Standardized Tests Accurately Reflect Deficiencies in U.S. Education?

The World Cup

In 1993, for the first time in thirty or forty years, Bolivia beat four other South American countries and qualified for the World Cup Soccer Championship. As is the case throughout most of the world, soccer is almost the national religion in Bolivia. We went nuts. At about the same time Luís García Meza, author of a particularly bloody coup d’ état back in 1982, was found guilty of genocide and sentenced to thirty years in prison. Although he was in hiding and tried in absentia, it was the first time in history that a South American dictator was legally held responsible for his crimes. All traffic on La Paz’s main avenue stopped as the marching bands of both the Workers Federation and the police (usually the bitterest of enemies) played the national anthem together. Along with everyone else on the street, I found I was crying with joy and pride. The moment was quickly lost in soccer madness. In the end it was sports, not justice, that captured our attention.

The crowning achievement of Bolivian soccer that year was beating Brazil. It had been decades since Brazil’s most recent defeat in World Cup run-offs: not even the superb teams from Argentina or Paraguay had ever been able to beat them. Yet our little country of six million souls, three quarters of whom are native American peasants simply not big enough to compete in professional athletics, had shut out Brazil, with its population of a hundred and forty million soccer gods. La Paz danced and partied all night long. I went to bed to the sound of revelry and bottle rockets, and woke up the next morning to the sound of more revelry and more bottle rockets. The entire country took to singing that old cueca that goes, “¡Viva mi Patria Bolivia: una gran nación, por ella doy mi vida, también el corazon!” No one had any sympathy for the Brazilian team, which clearly only lost the game because they were (as their captain explained) playing in “a altitude” of La Paz.

Why not? The Brazilians played a fine game, 12,000 feet above reproach. They played an hour and a half with only two subs and no oxygen against a team that regularly practiced 2,000 feet higher. Nobody outside Bolivia would have claimed they were actually the weaker team. (As it happened, two weeks later the same two teams met again at sea level and Brazil trounced us seven to zero.) The reason is simple: soccer games do not take the altitude into account. The size of the team, its experience, the quality of its play, and even its wealth, are all factors that impact its success or failure. But in a game, the only factor that distinguishes winning from losing is how many times the ball gets past the goalie and winds up in the net. Without reference to any other factors (except that nobody broke any major rules, of course), we got it past theirs twice, while they never scored. Ergo, we won the game. Ergo, the country partied all night long and I didn’t get any sleep.

And why did Bolivians care so much about a bunch of guys who could get a ball into a net, and relatively so little about a major victory over injustice like the sentencing of García Meza, a man who publicly compared himself to Augusto Pinochet and was responsible for the murders of dozens if not hundreds of people? Nobody could seriously compare the main Bolivian stories of April 1993 and say that soccer was most important in itself, either right then or in the long run. The reason is that for whatever reason, human beings (at least those of us in the “Western tradition”) respond much better to easily quantifiable results. There is something satisfying about being able to say that the Bolivians won the game 2-0 that isn’t there when someone else says the Brazilians were more powerful individually and a better organized team. Even with terms like “thirty years in prison” or “the first time in Latin America” the story of García Meza lacks a tangibility we tend to prefer.

Researchers and pollsters find a similar dichotomy. The two main tools they use in surveys are number-based questionnaires (counting similar responses and asking for judgements on a scale) or interviewing for “anecdotal” information. Gallup called me once and asked fifty or sixty single-word-answer questions. The pollster had no interest in full responses, and therefore got a very distorted view of my opinions. This is how we learn that 90% of Americans love Newt Gingrich and have never heard of Bill Clinton. A lot depends on how the questions are worded and who is being asked. Anecdotal interviews are deeper in many ways, though not as wide. The quantitative is more factual but less meaningful, while the qualitative speaks the truth but only part of it. At any rate, in our culture a news story reporting exactly what Mrs. Brown down on Maple Street thinks about race issues will, according to recent studies, have only 13% the persuasive impact of a well-worded statistic.

Tests and Standardized Testing

Generally speaking, testing in academic situations serves two purposes. One is principally educational, that is to say, tests can be used as teaching devices. I have a colleague named Jack who teaches math. On every test he includes at least a couple questions the likes of which his students have never seen or even thought about. The class has worked to develop the tools needed to approach these problems, but they’ve applied them in different ways. Jack grades the responses on their creativity and the legitimacy of their logic, with little regard for whether the answers are right. He uses the same strategy in classes too, but insists that test-like situations motivate his kids to stretch themselves. The other purpose of tests is simple assessment. In the “banking school” approach to education, teachers make deposits of knowledge directly into their students’ brains. It is then the student’s responsibility to regularly produce an accurate balance sheet.

Most tests in school combine both of these purposes in varying degrees. People who think seriously about education tend to prefer tests that are more like Jack’s and less like a balance sheet. At the very least, they prefer tests that demonstrate the thought process (I mark off very little for “wrong” answers if the method is correct, and give very little credit for “right” answers if my students don’t show their work). On the other hand, tests like the SSATs, PSATs, SATs and ACTs, are given strictly for assessment purposes. A lot of criticism is leveled at these tests, and I believe most of it is justified. However, the root cause of much of the dissatisfaction is confusion over their rationale. If the need for pure assessment is accepted, and if the manifest limitations of quantitative mensuration are recognized as the price of our cultural need for numbers, the position of these tests becomes relatively unassailable: within their context they do a pretty good job. But do we need them?

Nothing focuses the debate over the value of standardized tests better than their use in comparing the United States against other countries. In the seventies and eighties, when first the Arabs and then the Japanese looked as if they would threaten U.S. economic dominance in the world, politicians started looking for scapegoats. Public education seemed like a good target. It’s close to home, costs a lot, and like the Mail, people have complained about it for years. (I’m a fan of the USPS, and wish people would see its occasional screw-ups against the context of what it accomplishes.) Moreover, the primary indices of measuring education — standardized tests like the Standard Aptitude Test and the International Math and Science Study — seemed to corroborate the putative decline. Defenders of public education were quick to protest, but because these tests supply those seductive and incontrovertible numbers, the tests themselves became part of the controversy.

Arguments Against Standardized Tests

There are some strong arguments against standardized testing, and there are good reasons why they should not be interpreted as definitive evidence of the deficiency of the U.S. public education system. Most of the good arguments against them revolve around the conflict between Jack’s educational objectives and “assessment needs”. I remember resenting the SATs in high school because I felt they tested only my test-taking ability. I got bored quickly and lost interest. It may be that in other places these things are practiced and even rewarded, but where I went to school there wasn’t much real-life value placed on bubble tests: I put as little energy into them as possible. Who is to say this almost admirable trend hasn’t continued, explaining sinking SAT scores in the last twenty years? On the other hand, I got relatively embarrassing marks on those tests and will carry them to my grave. It never did me any good to be labeled with those numbers.

Standardized tests are also criticized for their lack of relevance. I recently heard a spokesperson for ETS on the radio. She claimed the day that computers will be able to check “open ended questions” is closer than most people imagine. Even so, the fact remains that the scope of these tests is still largely limited to multiple choice questions, and therefore cannot really address higher order thinking. Even where students enter their own replies (as is the case in some math applications) the machines can only check for correct answers; they cannot judge method. Multiple choice tests are less forgiving and less demanding than life itself. It shouldn’t be surprising that they are very poor predictors of real life “success” (whatever that means). It is well documented that SATs predict first year college achievement quite well, and indicate family income almost perfectly. Otherwise they’re a bust. This being the case, why should any standardized test be trusted?

Tests like the TIMSS (Third International Math and Science Study) are more sophisticated than the SATs, and handle most of these charges fairly adroitly. It supposedly contains a system designed to distinguish test-taking ability from actual aptitude, it includes truly open-ended components to test higher order thinking, and even puts video cameras into classrooms to judge academic interaction. Teachers are in effect tested along with their students. Even so, there are still legitimate complaints against the TIMSS. The sampling of U.S. students taking the test cuts across our economic spectrum. This is not the case in many of the other countries represented in the test. Certain sectors of our society did exceptionally well on TIMSS; it stands to reason that the elite of other societies would do better than our average. Likewise, students from other countries were often older and had more schooling. Had all things been equal, we might have faired better.

For all it may be cloaked in the language of open questions and videotaped classes, in the end TIMSS is just a standardized assessment device, and it suffers all the usual pitfalls of non-qualitative research. The results go through a further iteration of national averaging, and the “truth” about the individuals who took the tests gets lost forever. TIMSS officials are ready to admit their suspicion that U.S. students generally approach problems with more creativity than those from other nations, even though this does not show in the scoring. It is also probable that some countries produce a steady stream of solid performers, while others have a wide range from bad to decidedly brilliant. TIMSS does not suggest what this might imply regarding the relative standing of different countries, if it’s true. It is the nature of all quantitative assessment that despite the undeniability of numeric outcomes, what the numbers really mean is far from certain.

There may be a few more good reasons to doubt the validity of United States TIMSS results, but many of the others sound like a catalogue of ill-conceived excuses. For example, it is argued that we do poorly because:

•local, rather than federal, funding and direction of schools results in inconsistent quality
•U.S. education is “a mile wide and an inch deep”
•we spend more time than other countries on drill, and not enough time on problem-solving methods
•schools in the United States have yet to adopt a consistent and appropriate curriculum
•market-driven textbooks, rather than course-specific materials, are “over-stuffed and undernourished”
•our classrooms are constantly interrupted by non-academic distractions
•many of our children live in poverty, and took the test on an empty stomach

These arguments may explain why we can’t perform better on standardized tests, but they do not disprove the legitimacy of the TIMSS or other international tests. In fact, far from contradicting deficiencies in U.S. education they explain TIMSS results and reemphasize how far we need to go to improve education and cure the social ills of our society.

Finally, it is also true that some of the accusations leveled at TIMSS are complete nonsense. The SATs have been shown to have a racist and classist bias. Historically they were mostly written for white middle class America, and currently they are still largely written by white middle class Americans. Cultural bias is very hard to avoid. Even so, it has been suggested that the TIMSS is culturally biased against North Americans. This is ridiculous. Although many nations contributed to the design and realization of the test, the group was based in Boston, Massachusetts, and dominated by people from this country. If there was any bias it would have gone against the other countries, not us. It is also suggested that the domination of foreign students and recent immigrants in our high school and college level math and science departments has a negative effect on our national average. It’s hard enough to imagine what the relevancy is here, let alone how it hurts U.S. scores.

Arguments In Favor of Standardized Tests

Arguments in favor of TIMSS and other standardized tests are neither as numerous nor as desperate as those against it. The authority of the testing experts and the overwhelming power of numbers in our society place the burden of proof on those who would find fault with testing results. Test statistics are consistently quoted by respected politicians and journalists as if they were beyond question. ETS is celebrating its proud 50th anniversary this year. At the TIMSS website, Boston cognoscenti nonchalantly knock down all questions and criticisms of standard tests (and the TIMSS in particular) one by one. And then, more than any other factor, the numeric accuracy of standardized tests makes them practically immune to reproach. Whether this impartiality has any real basis in fact is almost a moot point: real or imagined, the concreteness of numbers catches the imagination of the public. Bolivia? Two zip. García Meza? Who cares?

There are, however, some legitimate arguments to support the conclusion that standardized tests reflect deficiencies in U.S. education. The strongest of these is the corroborating list of excuses above. If a significant number of our kids are prevented from learning in school because they haven’t gotten enough breakfast there is something wrong with our society. If we have to feed our students in order for them to learn, food should be part of the academic program. If the current funding of schools is uneven and prejudicial, it needs to be changed. If our curriculum is weak, our textbooks inadequate, our academics shallow, and our classes constantly interrupted by announcements, there is something seriously deficient about our educational system. Perhaps it’s even time we considered teaching algebra before high school, like the rest of the world: it doesn’t seem to harm them. TIMSS and other standardized tests clearly help bring these problems to light.

As a math teacher in a boarding school with a large population of foreign nationals, I get empirical confirmation of the TIMSS every day. Last year I taught three sections of Algebra II. The two in the morning were mostly Juniors and Seniors from the U.S. In the afternoon I had a class of freshmen, mostly Korean and Taiwanese. Not only could these Asian freshmen run mathematical circles around the North American upperclassmen, they frequently corrected or at least improved upon my math. The rest of the curriculum is much harder for these students, as they battle the language barrier, but judging from their grades they are still consistently among the best students in the school. It isn’t a complicated equation, either. Walk into any dorm any night and you’ll find North Americans sprawled out gabbing or watching TV, while the Koreans are in their rooms, at their desks, studying and doing homework. They get better scores because they work for better scores.

Depending on how it’s looked at, the objectivity of standardized tests can be a valid point in their favor. In soccer games, 99% of what happens on the field is ultimately ignored: only the goals are counted. Likewise, no major test, regardless of how open or politically correct it may be, will ever be able to take all factors into consideration. There will always be a subjective grading criteria. The scope of a country’s educational system and the maturity of its students, their experience with similar tests, and even their affluence or class, are all factors that impact national achievement. It may be harder to accept than in a game, but in the end, none of these factors count. Many people attack TIMSS and other achievement tests, but nobody seems interested in questioning the idea of testing itself, just as nobody questions soccer. As long as we accept testing, we’ll have to admit that standardized tests are already as objective and comprehensive as any are ever likely to be.

Conclusion

Americans have a very strange relationship with numbers. A few years ago a couple of isolated terrorist incidents in Europe reduced U.S. tourist traffic there by 20 or 30%, even though the chances of being involved in trouble were about the same as those of being struck by lightning in your basement. Intelligent, well-educated Americans believe that if there’s a 50% chance of rain on Saturday and a 50% of rain on Sunday, there’s a 100% chance it will rain over the weekend. People who would be mortified if they accidently said “funner” are completely unembarrassed about not being able to calculate small change. When we hear the U.S. placed twenty first out of twenty one countries on a math test, we’re certain we’re about to lose our place in the global economy and the Free World, which is ridiculous. It is indeed ironic that our national innumeracy contributes to our misunderstanding (and denial) of tests that claim to measure our innumeracy.

Looking at all the arguments objectively, it’s hard to escape the conclusion that those supporting standardized tests have a slight upper hand. Nevertheless, the debate inevitably boils down to a value judgement between the quantitative and the qualitative, soccer vs. García Meza. Personally, I will always be more interested in García Meza. I want a diagnosis that will help me improve, not a number. On the other hand, I’m afraid I’m a minority party pooper in a stadium of sports fans. My students typically ignore my comments and complain when they can’t find a grade on their work. Standardized tests may not be helpful in the truest sense, and may even distort the truth about the educational system in this country. They do, however, provide us with the tangible figures our culture seems to respond to, and as long as they are triangulated with other, more responsible and qualitative measures, their findings must be treated seriously.

philes/timss.html; written/revised 01 September 2011
copyleft 2011 James Gosselink