To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
It is often desired to extract more information from a test score than is available in a single number. The almost universal response to such desires is to divide the overall test score into subcomponents/subscores (e.g., math and verbal scores, reading fluency and reading comprehension, etc.). We summarize the rules governing the legitimate use of subscores and report on the frequency, in modern practice, that it is done correctly. In short, dividing up a test into subscores reduces its reliability, and subsequently, its validity. Using the military’s ASVAB test as an example, we show that the overall score is the only good predictor of later performance and the nine subtests are not effective in differentiating types of skills and knowledge.
Zombie ideas are awful ideas that ought to be dead, but which keep getting revivified and so are still walking among us. Three prominent zombies which we discuss are:
1. coaching for admissions tests gives a large unfair advantage;
2. admitting strivers (kids from lower SES who score higher than expected) makes things fairer;
3. making tests optional makes things fairer.
Test coaching companies like Princeton Review and Kaplan often claim that they can increase a person’s SAT score by over 100 points. The evidence used to support such claims typically involves a pre-post design where the student takes the test, receives coaching, and then takes the test again. In rigorous studies where a control group is used where the student simply takes the test twice, gains for 80–90 points are typical. Thus, the gains from coaching are much less than claimed. Strivers are students who score higher than expected based on their family income. Some have claimed that a striver that scores 1,000 on the SAT is really more like 1,050 because they have overcome hardship. However, due to regression to the mean, such students typically perform in college more like a SAT score of 950 would predict. Finally, many colleges have chosen to give applicants the option of whether to include SAT or ACT scores in their materials. Unfortunately, data suggest that this is a bad idea.
A continued example of testing misuse involves standardized tests administered in K-12 education. The results of such tests have been used to not only evaluate students, but also the schools, school personnel (e.g., teachers, principals, superintendents) and programs. We focus on one highly touted methodology, value-added models (VAM), that have been advocated as a rigorous scientific solution to what was previously an area rife with subjectivity. Proponents of VAM claim that a better measure of teacher performance is the amount of academic growth students experience after receiving instruction from that teacher. We discuss both the technical and logical flaws of these models. First, claims that changes in student test scores are caused by teachers, administrators, or schools are extremely weak given zero control. Second, the assumptions that achievement tests given at the end of one grade compared to the others can be equally scaled are nothing short of heroic – and very weak. Finally, missing data and small sample sizes make yearly growth estimates anything but reliable or valid. VAM is simply a well-intentioned, very bad idea.
When decisions are made there is a cost to making a mistake. This cost is often different for an erroneous positive decision than it is for an erroneous negative one. Decisions based on test scores are no different. We discuss this issue and provide several evocative examples. For admissions testing, two kinds of errors can be made: accepting a student who should not have been accepted (i.e., won’t graduate) and rejecting a student that would have graduated. The latter type of error likely results in very few actual errors as the student can simply enroll elsewhere. But the former type of error can result in time and money wasted. The proper use of test scores reduces both types of errors. For licensing tests, passing someone who should not have been passed can have more serious consequences than time and money. An airline pilot’s lack of knowledge and skills can lead to a crash; a doctor’s inadequacies and incompetence can lead to deaths. Using test scores can save lives.
Horace Mann can be credited with the beginning of accountability and high-stakes testing in K-12 education in the 1800s. This was also the beginning of test fraud. Terman later developed the National Intelligence tests for K-12, followed by the Stanford Achievement Tests and the Iowa Test of Basic Skills. Results of such tests have been used, unwisely, to drive school reform efforts. The National Assessment of Educational Progress (NAEP), Moynihan and Coleman reports in the 1960s, and A Nation at Risk in the 1980s continued to drive educational reform efforts such as No Child Left Behind, the Every Student Succeeds Act, and Race to the Top today. Using test scores to make decisions about hiring and firing of teachers and administrators is ill advised. Reform efforts over the past 60 years have not reduced the achievement gap. K-12 tests reveal societal, not educational shortcomings.
The use of test scores as evidence to support the claims made for them requires an understanding of causal inference. We provide a careful discussion of the modern theory of causal inferences with numerous evocative illustrations, including an admissions policy at the University of Nebraska, the 1854 London cholera epidemic, and the 1960s decline in SAT scores. We show how evidence drawn from test scores is comparable to credible evidence of other widely accepted sources. Rubin’s model for causal inference is explained and the importance of manipulation, random assignment, potential outcomes, and a control group is emphasized. The Tennessee Class Size Experiment of the 1980s is one of the best examples of how to measure the effects of a cause. Finally, we show how the size of the causal effect of fracking on earthquakes in Oklahoma can be established using an observational study by mirroring the structure of an experiment. Measuring the size of causal effects of testing and its alternatives requires data and control. Often, the data are kept hidden to avoid ruining the good with the truth.
Many colleges that required SAT or ACT scores before the pandemic suspended them during it. After the dangers of the pandemic subsided most have not yet resumed their use. The arguments supporting their continued suspension are based primarily on the fact that such tests, like most other tests, show differences among subgroups (e.g., races). We discuss the costs and benefits of no longer using such tests scores in admission decisions. College admission tests were developed in the 1920s to level the playing field and allow more students to qualify for college. Carl Brigham developed the Scholastic Aptitude Test (SAT) in 1926. Soon, the College Board adopted the SAT. In 1959, the American College Test (ACT) was born. Neither test is biased against minorities – rather they tend to overpredict minority performance in college. Yet, despite persistent group differences, the sentiment is to discontinue use of these tests. Doing so will place more emphasis on other metrics (e.g., high school GPA) that are less reliable, more subjective, and also prone to group differences. Admitting more students who are less likely to graduate comes with costs.
Although the use of testing has been of remarkable value for millennia, and has improved steadily over the past century, it is now experiencing heightened public dissatisfaction due partly to concerns regarding fairness and equity. We discuss some plausible causes for this apparent change in public attitudes. Only about 10% of all colleges and universities now require the ACT or SAT for admission. Fewer states are using tests to measure K-12 student progress and as a requirement for graduation. The major complaint about tests is preventing improvement through inclusion. But in reality testing simply measures this improvement as more groups have been included over the years. Virtuosos in music and the world record for running the mile are examples. Admissions testing was first developed to improve fairness to a system that relied on quotas. Compared to other metrics, tests are the only ones subjected to rigorous evaluation for reliability and validity.
Many professions (e.g., teachers, pilots, air traffic controllers, physicians) require applicants to pass a licensing exam, the principal purpose of which is to protect the public from incompetent practitioners. These exams also sometimes show the same sorts of race and sex differences observed in other test scores. Thus, they too are susceptible to equity criticisms. We discuss the implications of getting rid of such tests or even just lowering cutoff scores. Medical licensing has been around for over 1,000 years. The U.S. did not start licensing physicians until the late 1800s. Early exams were oral, subject to criticisms about objectivity, and resulted in disaster in West Virginia. Ultimately, the National Board of Medical Examiners was formed and multiple-choice exams replaced essay exams on the United State Medical License Exam (USMLE). To get into medical school, undergraduates must take the Medical College Admission Test (MCAT). Like other tests, the MCAT reveals race and sex differences. The same is true for tests to license pilots and air traffic controllers. K-12 teacher licensing formally began with the National Teacher Examination (NTE) in 1940.
How far have we come? What strategies will most likely aid in achieving our goals? What evidence must be gathered to go further? We have focused in the book on how tests provide valuable information when making decisions about who to admit, who to hire, who to license, who to award scholarships to, and so on. Given limited resources, efficiency in selection should be essential. However, tests used for these purposes also reveal race and sex differences that conflict with society’s desire for fairness. How do we make policies and decisions so as to maximize efficiency while also minimizing adverse impact? There is no statistical solution to this problem. We suggest an approach that will get us closer to an acceptable solution than where we currently stand. A first step is to gather all relevant data so that any selection policy can be evaluated as to both kinds of errors. Second, make such data publicly available so that all interested parties can have access and everything is transparent. As mentioned previously, numerous times such data are not made available due to a fear of criticism. Third, causal connections between policies and outcomes should be established. Finally, if considerations other than merit are important, those arguments should be made public and modifications examined to measure the impact of policy adjustments.
We trace the origins of testing to its civil service roots in Xia Dynasty China 4,000 years ago, to the Middle East in Biblical times, and to the monumental changes in psychometrics in latter half of the twentieth century. The early twentieth century witnessed the birth of the multiple-choice test and a focus on measuring cognitive ability rather than knowledge of content – influenced greatly by IQ and US Army placement testing. Multiple-choice tests provided an objectivity in scoring that had previously eluded the standard essays used in college entrance exams. The field of testing began to take notice of measurement errors and strove to minimize them. Computerized Adaptive Tests (CAT) were developed to accurately measure a person’s ability with the fewest number of items. The future advancement of testing is dependent on a continued process of experimentation to determine what improves and what does not.
Armed services tests have existed for centuries. We focus on the US Armed Services and how the tests used have adapted to changed claims associated with changing needs and purposes of the tests. World War I provided the impetus for the first serious military testing program. An all-star group of psychologists convened in Vineland, New Jersey and quickly constructed Army Alpha, which became a model for later group-administered, objective, multiple-choice tests. Military testing was the first program to explicitly move from very specialized tests for specific purposes to testing generalized underlying ability. This made such tests suitable for situations not even considered initially. The practice was both widely followed and just as widely disparaged. The AGCT, AFQT, and ASVAB were later versions of this initial test. Army Alpha also influenced the creation of the SAT, ACT, GRE, LSAT, and MCAT tests. Decisions based on military tests, like all tests, can be controversial. In 1965, Project 100,000 lowered the cut score and resulted in thousands of low-scoring men being drafted, many of whom later died fighting in Vietnam.