What's going on here? PART ONE

Mike Salovesh (t20mxs1@CORN.CSO.NIU.EDU)
Fri, 26 Apr 1996 07:10:11 -0500

I've just had a very strange thing happen on a test I gave to one of my
classes. I think that strange thing may be telling me something about a
culture shift of some sort. I wonder if other folks on this list have
seen anything like it in their classes (the views of both professors and
students are eagerly solicited!) Or maybe there are other lines of
evidence that might point to the same sort of thing.

The problem, or strange thing, or culture shift, or whatever it is that
I'm picking up, appeared in a bunch of questions that all dealt with
different ways of asking about a related set of definitions and ideas
about kinship, social organization, and all that kind of stuff. The only
way I know to tell you about it is to give one helluva lot of background
here, and present the problem in my next message. Here's the background:

For more years than I want to remember, I have been teaching a freshman-
level General Introduction To All Of Anthropology. Here at NIU, we do
that course in several sections, each taught independently by a faculty
member with help from graduate teaching assistants. As a general rule,
the section I teach enrolls just about as many students as there are
seats in the room assigned to me -- usually, about 120 people.

Ideally, I view exams as part of the learning process. If they really are
to function that way, I believe that students have to know the results as
quickly as possible, meaning at the very next class meeting. There's no
way I could do a proper job of grading 120 essay tests in a turnaround
time of 48 hours. (Grading essay tests is MY responsibility, since I'm
the one who signs the grade sheets eventually. My TA's almost always are
first year grad students, anyhow, and it wouldn't be right to foist the
job off on them.)

By historical accident, I have had special training in constructing
multiple choice tests. I happen to know quite a lot about how to do it
and get some kind of defensible results. So students in my intro course
take multiple choice tests. I try to make sure they're GOOD tests.

SIDE COMMENT: Textbook publishers regularly provide test banks or
teachers' manuals or even computer disks full of multiple-choice questions
as an inducement to get professors to require the purchase of their books
rather than one from some other publisher. From a technical point of
view, every single collection of this sort that I have seen stinks. It is
obvious that whoever writes the questions assembled in any one of these
accursed things knows nothing about designing multiple-choice tests.

Check it out for yourself. Here are two shortcomings that are all but

1) Test items in teachers manuals frequently take the form "Which of the
following is NOT . . . " That is just plain bad question-writing. It has
been demonstrated again and again that when a multiple choice question
follows that form the statistical validity of the item is markedly lower
than other ways of asking about the same material. Correlation
coefficients are likely to be near zero, or even negative, meaning that
there is little or no relationship between the likelihood that a student
gave the desired response to such a question and the student's standing
relative to the rest of the class as measured by the exam as a whole.
(Let me say that in human talk. A positive correlation coefficient for a
single question means that students who get a high score on the test as a
whole are more likely to get the question right than students who get a
low score on the whole test. The higher the student's score, the more
likely it is that the student got this particular item right.)

2) There's an even more simple-minded tendency in published test banks.
Let me get at it by asking a question: If one of the responses to a test
item is "E: None of the above", how often should that be the desired
response (the "right answer", if you will)? The use of the letter E
implies that there are five possible responses. It shouldn't take a
course in statistics to figure out that "E: None of the above" should be
the right answer about one time in five. If it's right considerably more
often than that, a sophisticated test-taker will tend to get the question
right without even having to read the rest of the responses. The teachers
manuals and test banks I get from publishers often use "E: None of the
above". It's the desired response anywhere from 80 % of the time up to
all of the time, depending on the source. (Responses in forms resembling
"E: Both A and C, above" are worse; they almost invariably are the desired
response. Here's where saying "desired response" instead of "the right
answer" is absolutely necessary: if that response gets credit, it can only
be because A is a correct answer AND C is a correct answer, and I would
hate to have to try to tell students they got the question wrong by
marking an answer that is perfectly all right by itself.) END OF SIDE

Our Office of Testing Services runs the Scantron answer sheets from my
tests through a program that does lots more than count right answers.
When I get back the test results, I also get a good statistical item
analysis of the exam, question by question. I use that analysis as a
check on the validity of single items on the test, and as an overall
check on the exam as a whole.

One major function of giving tests is to make distinctions among
students. It is a simple statistical quirk that you get the maximum
dispersion (or the smallest number of occasions where two or more
students get identical scores) if the average score on the test as a
whole is around 50 % right. That holds for individual items within the
test, too. Any question that the entire class gets right, or that the
entire class gets wrong, gives you no information about differences among
students. It's a waste of ink. Within an individual item, any response
that is not taken by at least some students gives you no information
about differences, either. It's also a waste of ink.

So I come close to busting a gut to design tests where the average
question has an index of difficulty of 0.5, and the range of that index
is from about 0.2 to about 0.8. I'm very unhappy with any
individual response that doesn't attract at least three or four
people, and if I use that item over again in a later semester, I
change the response that fails that criterion to make it more
attractive yet still undeniably wrong.

Above all, I want each and every question to show a positive correlation
(or, as the reports I get call it, a discrimination index) with overall
results on the test. To restate, what that means is that the higher score a
student gets on the test as a whole, the greater the likelihood that
student got this particular item right. The lower the total score, the
greater the likelihood that the student got this question wrong.

Although one or two students manage to get scores that are lower than
chance expectation on one test or another during a semester, that's rare;
given the criteria I shoot for in the test as a whole, I don't ever expect
to see anyone get a perfect score. Certainly nobody ever has in my

With all that I've tried to say about what I do on these tests, it's
obvious that grading is on a curve directly tied to performance on the
test. As a rule of thumb, my cutoff points on individual tests give A's
to about the 90th percentile on up, B's to the 70th percentile to the 90th
percentile, C's to the 25th through 70th percentiles, and something less
to those below the 25th percentile. (F's are limited to those whose
scores are on the other side of a notable discontinuity in the
distribution. There always are a few of these; they never have run above
3 % of the class.)

Grades for the course as a whole are more generous than the scale I use
for individual tests. In the end, I calculate a GPA for the class as a
whole, and aim for it to be somewhere between C and B. That's not
entirely arbitrary; I try to stick close to the average GPA in our College
of Liberal Arts and Sciences. VERY close.

The whole point to all the trouble I take with these tests is that I want
to have some kind of check on how well I'm getting different kinds of
points across to my students. I take a very close look at any question
that doesn't produce the results I tried to design them for, so I can try
to find out why. Maybe the trouble will turn out to be that I didn't
give enough time in class to a particular set of ideas. Maybe the
problem could be that the author of whatever textbook I'm using at the
time simply doesn't make certain specific points clearly. (I might miss
that, because I am supposed to know a fair amount of anthropology
myself. I tend to forget how all this stuff hit me the first time I
heard about it when taking my first anthro courses as a student.)

Sometimes, of course, shortcomings that show up in the analysis of a
question simply mean that I've written a bad question. That's very easy
to do, believe me! In fact, I expect to see serious flaws turn up in any
newly written question when I use it for the first time. That's why I
pay such close attention to the item analyses!

To be pretty sure that the results of a test are going to come close to my
design criteria, I use questions that have a known track record because
I've used them in the past. I simply throw out questions that give
results that are outside the limits I have set. To keep some kind of
control over the test design, I try very hard to limit new questions to
not more than 25 or 30 per cent of the questions on any test. Most of the
questions that got anomalous results on this last go-round were questions
that had been dependable in the past.

OK, that's the background. I'll describe the surprising results of my
last test in my next message.

--mike salovesh, anthropology department <salovesh@niu.edu>
northern illinois university PEACE !