The following provides information about how to interpret the various exam reports from test scoring.
Professors are commonly concerned about the accuracy and fairness of their tests. Establishing the validity of a test is a fairly complicated process, but a simple and practical criterion comes from the test scores themselves. If a test’s total score is used as an anchor, then each item may be judged against this anchor. For example, did students who scored well on the whole test tend to get item 14 right? And did students who fared poorly on the whole exam tend to get item 14 wrong? What is the correlation between responses on item 14 and total test scores? Of course, if the whole test is flawed, these questions can not give sensible answers. Statistical answers produced by the test scoring process can not substitute for careful test preparation. But if the test has a good coverage of content, and if items are carefully written, then the total test score serves as a sturdy anchor.
The Item Analysis — contains technical information about the test items and is extremely useful in revising a test. This shows how many students achieved any particular score, and what percentile rank is associated with any score. At the bottom of the frequency distribution are several summary statistics, including a coefficient termed INTERNAL CONSISTENCY. Internal consistency is an estimate of test homogeneity, and answers the question: “How well do the test items represent a single domain?” If the estimate is high (say, .75 or better), the items seem to be measuring the same property. If the estimate is low, then the test lacks homogeneity, and you should wonder whether it makes sense to total the items into a single score. This statistic may not be appropriate for criterion-referenced test.
The core of the item analysis is presented next. Appendix 1.gif gives an example. All scores in the class are ranked from highest to lowest. The upper 27% and the lower 27% of the scores are set aside; the middle 46% is ignored for the time being.*** These upper and lower groups of students represent those who have performed well on the test and those who haven’t fared well. One indication of an item’s worth is whether it can distinguish these two groups. Under the columns headed A,B,C,D,E (or 1,2,3,4,5 if numeric responses were used) the behavior of the upper and lower groups is compared for each portion of an item. The correct answer among A,B,C,D,E is noted with a # sign.
Let’s look at item 1 on this sample test (Appendix 1.gif). Option A is the correct answer and was chosen by 32 of 43 students. Options B, C & D are called distractors. Option B appears to be a good distractor because a total of 10 students selected it and 6 of these students were in the low group. Option C is not a good distractor because no student selected it; we note that Option D was selected by only 1 student. The last line under Item 1 labeled “Mean” gives the average score on the exam for those students selecting each option. Under the column headers WRONG and RIGHT, the total class is considered. This is merely information about how many got the item wrong or right, and the test means for these two groups.
All of these little comparisons of upper and lower groups are combined into a single index, in the far right column of the page. The discrimination index,*** (titled DISCRIM), tells what proportion of the upper group got the item right, minus the proportion of the bottom group. The index ranges from +1.00 (the whole top group got the item right; the whole bottom group got it wrong) to -1.00 (the entire upper group missed the item; the bottom group got it right). Obviously, a high discrimination index is a good sign. A negative index usually indicates a poorly constructed or ambiguous item. Or it could be a sign that you have marked the wrong answer on your answer keys. In general, coefficients of .20 and higher are suitable for achievement tests. Notice that item 1 has only a fair discrimination because several people in the low group also selected Option A.
On the right of the page, two other pieces of information surround the discrimination index. The EASINESS figure represents what percent of the whole class got the item right.**** Easiness figures are challenging to interpret. For example, an item may show that 97% of a class got it right. Do you ascribe that to your
stellar teaching, or have you written an exceedingly easy item?
Finally, the R=PBISER stands for the Point-BISERial correlation. This is a special case of the correlation coefficient most of us are familiar with. It shows the relationship between getting an item right and the total test score. So, item l’s point-biserial coefficient of .42 shows a modest connection between performance on this item and performance on the test as a whole. If ever an item gave a point-biserial correlation of 1.00, that item is equivalent to the entire test!
*** When two or more scores fall at the 27th or the 73rd percentile, all of the tied scores are dumped into the upper or lower category. Thus, the claim of exactly 27% of the scores is accurate only when there is a single score at those percentile ranks.
**** Please note that the EASINESS index was formerly called the “difficulty” item, and was the inverse of the number presently used.