The primary goal of any assessment of students is to provide valid and reliable evaluations of students’ knowledge and skills as well as provision of accurate feedback to students about their performance (Al Mahmoud, Elzubeir, Shaban, & Branicki, 2015) . We are increasingly dependent on Multiple Choice Tests (MCQs) as the sole tool for assessment because it is valid, objective and cost effective tool of assessment.
Examination malpractices are acts that contravene the rules and regulations which govern the conduct of examinations (Ollennu, 2015) .
Randomization of test item into different versions that includes the same questions can minimize the chance of cheating by students, while keeping the level of difficulty of the exam constant across students since every version contains the same questions (Sue, 2009) .
Several studies have suggested that changing the position of an item on an operational exam relative to its position during trial testing development leads to a change in the difficulty of the item (Schroeder, Murphy, & Holme, 2012) .
The difficulty index, symbolized as p, can range from 0 (no one selected the keyed option) to 1.00 (everyone selected it). Naturally, overall test scores tend to be higher when the items on a test have higher p values, and vice versa (Di Battista & Kurzawa, 2011) .
In our institution (King Khalid University, Faculty of Medicine, Saudi Arabia), randomization of test item into four versions (A, B, C and D) of the same test is done to avoid cheating and this is a mandatory requirement of every test before approval by the academic office.
Each version contains the same test items in a different order. In version (A), questions were ordered according to the coverage of the course materials in the class, in version (D) the questions were ordered in reverse sequences to version (A) and versions (B) and (C) were randomized. We observe that students who took version (A) finish the exam and collect their papers earlier than other versions.
This study was conducted to investigate the effect of scrambling the test questions on student performance and difficulty index of each test version. The difficulty index of an item is the proportion of examinees who selected the keyed option.
A prospective, cross-sectional study was carried out in College of Medicine-King Khalid University-Saudi Arabia, participants were 5th year undergraduate medical students who completed their major course of obstetrics and gynecology, the whole course duration is 8 weeks, after course blueprint, three tests of single best answer types were designed by teachers who taught the course at week 4, 6 and 8 during the second semester of academic year 2017-2018.
There were four versions of each exam. In the version (A), multiple choice questions were ordered according to material coverage in the class. In versions (B and C), multiple choice questions were placed in random order, that is, unrelated to the order that the material was taught in the class. In version (D), multiple choice questions were placed in reverse order to version (A).
Ninety eight (98) undergraduate medical students were divided randomly into four versions of each test (A, B, C and D) every time they sat for the exam. Post-test item analysis was conducted for each exam and average student’s score for each version was calculated, in addition to difficulty index of each version of the three exams. The marks obtained by the candidates and difficulty index of each version were entered into the Statistical Package for Social Sciences (SPSS) version 20 and comparison amongst the marks of candidates in these four versions were carried out through analysis of variance (ANOVA). A p-value of < 0.05 was considered as statistically Significant.
3. Results & Discussion
Difficulty level of the versions in each test was recorded from post-test item analysis, it is reflected as mean difficulty index in Table 1, for test 1 (0.69, 0.72, 0.69 and 0.68), test 2 (0.65, 0.66, 0.70 and 0.68) and test 3 (0.78, 0.78, 0.75 and 0.73) for versions A, B, C and D respectively. Version comparison was done in each test through Analysis of Variance (ANOVA). No significant difference was found in the mean difficulty index for different versions in each test. Results presented in Table 2 are shown (F = 0.99, p = 0.49), (F = 1.50, p = 0.16) and (F = 1.46, p = 0.17) for comparison of version A to B, A to C and A to D respectively in test 1 and similar non-significant results were obtained when comparing the versions B, C and D to version A in test 2 and 3.
Table 3 showed the average students’ scores in each versions of the three tests (1, 2 out of 50 marks and 3 out of 60 marks), (34.9, 36, 34.5 and 34), (32.1, 33.1, 34.3 and 43.9) and (46.8, 47, 45.2 and 43.9) for versions A, B, C and D respectively.
Again there are no statistically different results when we compared version A mean students’ scores to other versions (B, C and D) after applying ANOVA analysis in all three tests as showed in Table 4. (F = 1.14, p = 0.42), (F = 0.75, p = 0.69) and (F = 1.29, p = 0.34); (F = 0.84, p = 0.62), (F = 0.81, p = 0.64) and (F = 0.62, p = 0.79); (F = 0.62, p = 0.79), (F = 0.35, p = 0.95) and (F = 0.83, p = 0.64) for test 1, 2 and 3 respectively.
Table 1. Descriptive analysis of difficulty index among three MCQs tests.
N = Number of Test Items. SD = Standard Deviation.
Table 2. ANOVA comparisons of difficulty index among the three MCQs test.
Level of significance is 5%.
Table 3. Descriptive analysis for the students’ scores among the three MCQs tests.
N = Number of Students per version. SD = Standard Deviation.
Table 4. ANOVA comparisons of students’ scores among the three MCQs test.
This study is the first study comparing more than one version of scrambled but similar-content MCQ papers in a medical school in Saudi Arabia.
Our study failed to identify any differences in the scores of students taking the version which followed content coverage sequence (version A), from other randomized versions (B, C and D).
Similar to our study, Sue D.L. concluded that, the technique of scrambling multiple-choice questions in order to reduce the benefits of student cheating during the exam can be done without risk of biasing student performance (Sue, 2009) .
Another study in medical school in Pakistan comparing more than one sequence of scrambled but similar-content MCQ papers in a high-stake entrance examination over 3 years from 2008 to 2011. It failed to identify any differences in the scores of students receiving the papers which followed content coverage sequence, from those that did not (Khan, Tabasum, Mukhtar, & Iqbal, 2013) .
Zaman et al. concluded that item difficulty is not affected by the sequence of items in the test (Zaman, Niwaz, Faize, & Dahar, 2010) .
On other hand, some studies have shown that there was indeed statistically significant difference in performance when the positions of the items were altered (Ollennu, 2015; Doerner & Calhoun, 2009; Raux, Sangnier, & Ypersele, 2017) .
Although English is foreign language to our students, the sequence of items did not affect their performance, these findings contrast the results of Soureshjani K.H., who revealed that the sequence of items affect foreign language learners’ performance (Soureshjani, 2011) .
5. Conclusion & Recommendations
Up to our knowledge this is the first study comparing more than one version of scrambled but similar-content MCQ papers in a medical school in Saudi Arabia. Our study revealed that randomization of test item into versions to avoid cheating does not affect student performance or the difficulty level of the exam. Our institution can carry on their regulations of scrambling questions into different versions without hesitation. Further studies are recommended in this field to ensure better assessment of our students.