|Year : 2016 | Volume
| Issue : 4 | Page : 270-275
Flawed multiple-choice questions put on the scale: What is their impact on students' achievement in a final undergraduate surgical examination?
Ahmad Abdul Azeem Abdullah Omer, Mohammed Elnibras Abdulrahim, Ibrahim Abdullah Albalawi
Department of Surgery, Faculty of Medicine, University of Tabuk, Tabuk, Saudi Arabia
|Date of Web Publication||12-Oct-2016|
Ahmad Abdul Azeem Abdullah Omer
Assistant Professor of General Surgery, Department of Surgery, Faculty of Medicine, University of Tabuk, P.O. Box: 3718, Tabuk 71481
Background : Violation of item-writing guidelines is still frequently encountered in assessments in medical colleges. Flawed multiple-choice (MC) items affect students' performance and distort examinations' results.
Aims: The aim was to assess the frequency and impact of flawed MC items on students' achievement in our setting.
Settings and Design: This is a quantitative descriptive study conducted at the Faculty of Medicine in the University of Tabuk, Saudi Arabia.
Methods: We evaluated a 100 single-correct answer MC questions summative surgical examination administered to 44 6 th year final medical students in November 2014. MC items, which contain one or more violation of item-writing guidelines, were classified as flawed, those with no violation, were classified as standard. The passing rates and median scores of high- and low-achieving students were calculated on both standard and flawed test scales. Item performance parameters (difficulty index, discrimination power and internal consistency reliability (Kuder-Richardson formula 20) were calculated for standard and flawed items. Descriptive and comparative statistics with the relevant tests of significance were performed using the SPSS (IBM SPSS Inc. Chicago, Illinois) computer software version 16.
Results: Thirty-nine flawed items were identified (39%) which contain 49 violations of the item-writing guidelines. The passing rate was 93.2% and 91.8% on the total and standard scales, respectively. Flawed items benefited low-achieving students and disadvantaged the high-achieving students. Overall, flawed items were less difficult, less discriminating and less reliable than standard items.
Conclusions: The frequency of flawed items in our examination was high and reflects the need for more training and faculty development programmes.
Keywords: Flawed multiple-choice items, high-achieving students, item analysis, low-achieving students, standard items
|How to cite this article:|
Omer AA, Abdulrahim ME, Albalawi IA. Flawed multiple-choice questions put on the scale: What is their impact on students' achievement in a final undergraduate surgical examination?. J Health Spec 2016;4:270-5
|How to cite this URL:|
Omer AA, Abdulrahim ME, Albalawi IA. Flawed multiple-choice questions put on the scale: What is their impact on students' achievement in a final undergraduate surgical examination?. J Health Spec [serial online] 2016 [cited 2020 Jan 26];4:270-5. Available from: http://www.thejhs.org/text.asp?2016/4/4/270/191908
| Introduction|| |
Multiple-choice (MC) questions are extensively used in the assessment of knowledge in medical education. , The ability of this type of questions to sample widely over a subject, in addition, to their objectivity and easy marking have contributed to their popularity in the field of assessment. ,, When they are well constructed, they can test higher cognitive functions and discriminate well between examinees with reasonable validity and reliability. , However, poorly constructed MC items may have a negative impact on students' performance in achievement tests.  Some reports indicated that poorly crafted MC items are still being used commonly in medical colleges. , Despite the fact that MC item-writing guidelines are well developed and shared in the medical education literature, ,, the frequency of occurrence of flawed MC items is still substantial. , The effect of item-writing flaws on students' performance in achievement test is bimodal; making questions either easier or more difficult to answer. , Some flawed items clue test-wise examinees to the correct answer and thereby advantage those students over other performance categories of examinees. Flawed items also introduce unnecessary difficulty to the question and consequently affect the students' performance on the construct being tested (construct-irrelevant variance). ,, Haladyna and Downing have stated that 'test-wiseness could be taught and that some students could increase their scores after such training'.  They also added that MC item faults can be detected by examinees with or without training. Test-wiseness has been referred to as the ability of the student to recognise the correct answer without knowing the question material.  Tarrant and Ware have reviewed 10 MC examinations used in nursing and found that the percentage of flawed items ranged between 27% and 75%. They also highlighted in their series that borderline students benefited from flawed items because a greater proportion of them passed the tests while they would have otherwise failed if flawed items were removed from the tests. They concluded that flawed items have impacted negatively on the high-achieving students.  Based on these findings, Tarrant and Ware showed low discrimination power of the tests they assessed since the marks of the borderline students were artificially inflated while those of high-achieving students were lowered. In his study, Downing has reviewed a year-one basic science MC test and found that one-third of the items in the test were flawed and that these items were more difficult than the standard items measuring the same content. He also found that flawed items failed one-quarter of students more than standard items did.  In another study, Downing evaluated four basic science examinations for the effect of violation of item-writing guidelines and found that 36 - 65% of all questions were flawed. He also pointed out that flawed items were more difficult than standard items and they tend to fail more students. He also found that the reliability of flawed items was higher than that of the standard items.  Almuhaidib examined 10 summative undergraduate MC tests and pointed out that the average frequency of flawed items was 17.64%. She also found that flawed items were easier and poorly discriminating than standard items and that they tend to benefit low-achieving students and penalise their high-achieving counterparts.  Based on the patient safety concerns in medicine and the responsibility towards the different stakeholders and the whole community, there is a genuine need to construct good quality MC items to improve the reliability and validity of our examination results and consequently the quality of our graduates. ,,
Study of the frequency of occurrence of flawed items and their nature and effect was not done before in our setting, which is a newly established medical college found in the year 2006. We believe that such a study is essential to shed light on the quality of our MC examinations regarding the frequency, nature and effect of flawed items in our achievement tests. Solutions and recommendations would then be appropriately proposed based on the findings to improve the quality of our examinations and the inferences we made out of their results.
In this study, we evaluated a summative surgical examination administered to the 6 th year final medical students aiming to:
- Determine the frequency and type of flawed MC items
- Assess the effect of flawed MC items on the high- and low-achieving students
- Compare item performance parameters (difficulty index, discrimination power and reliability) of flawed and standard MC items to assess the quality of their performance.
| Methods|| |
This was a quantitative descriptive study conducted to evaluate the frequency and the impact of MC item flaws on the performance of the students in the written part of a final medical surgical examination. The examination was composed of 100 single-best answer MC questions administered to the 6 th year final medical students (n = 44) in our college in November 2014. Based on the opinion of subject experts, questions were analysed and categorised into two groups: Flawed items, which contained one or more violation of the MC item-writing guidelines published in the literature (Haladyna, Downing and Rodriguez 2010, Haladyna and Downing 1989), and standard items which did not contain any violation of those guidelines. The college implements a criterion-referenced fixed pass/fail mark strategy in its examination, which is set at 60%. The pass rate was calculated for the whole class on the total test scale (involving both flawed and standard items) and on the standard scale (involving only the standard items) and comparison was made. The item performance parameters (difficulty index, discrimination power and internal consistency reliability (Kuder-Richardson formula 20 [KR20]) were calculated for the flawed and standard items and compared to each other. The median scores of the high- and low-achieving students (11 students in each group) on the total, standard and flawed scales were also calculated and comparison was made to highlight any differences that may exist.
| Results|| |
Thirty-nine questions were assigned as flawed, found containing one or more violation of the conventional MC item-writing guidelines representing 39% of the total test items. A total of 49 flaws were identified distributed over the 39 questions. The type and frequency of item flaws identified are shown in [Table 1].
The overall pass rate on the total test scale was 93.2% whereas it was found to be 91.8% on the standard scale.
|Table 1: Nature and frequency of flaws encountered among all questions (n = 49)|
Click here to view
The median score of the high-achieving students on the total test scale was 85% while on the standard and flawed scales was 86.8% and 82.7%, respectively when we corrected the difference in the number of questions of flawed and standard categories to 100 for ease of comparison. On the other hand, the median score of the low-achieving students on the total test scale was 62.6%, while on the standard and flawed scales was 60.7% and 63.6%, respectively. These results are summarised in [Table 2].
|Table 2: Median scores of high-achieving and low-achieving students on different test scales (n = 100)|
Click here to view
Flawed items were found less discriminating than standard items (0.2 vs. 0.28) and were slightly more difficult (0.74 vs. 0.75). However, those differences were found statistically not significant as shown in [Table 3]. Reliability KR20 (internal consistency) of the total test scale was 0.84, of the standard items was 0.80 and that of the flawed items was 0.55. Obviously, the reliability of standard items is greater than that of flawed items.
|Table 3: Comparison of averages of difficulty index and discrimination power and reliability of flawed and standard items|
Click here to view
| Discussion|| |
The percentage of flawed items in this study was relatively high (39%), however, it coincides with the results of Tarrant and Ware (27 - 75%) and Downing (33%) and 36 - 65% in two different studies, but was higher than that of Almuhaidib's (17.64%). This result was not surprising in view of the small number of workshops and training conducted for the staff on MC item-writing guidelines. The finding that 14 out of the 32 conventional item-writing guidelines were violated, indicates poor knowledge of the test developers of those guidelines and underlines the need for more training efforts in this field. Page and Caldwell have identified in their study 17 violations of the 32 item-writing guidelines. They reasoned lack of training and that articles related to item-writing guidelines are published in educational literature outside the mainstream of medical journals. They also concluded a substantial difference in MC item quality between trained and non-trained individuals.  MC item-writing skills can be improved by training and regular practice of reviewing MC for item flaws as shown by Fayyaz Khan et al., in 2013.  They have shown reduction in the frequency of item-writing flaws in their study from 67% in 2009 to 21% in 2011 following conduction of training workshops for the staff. Regular review of MC items for flaws before and after test administration and provision of feedback to item writers were indicated as useful strategies to help increase the staffs' awareness on MC item-writing guidelines and reduce the frequency of flawed MC questions in examinations.  Baig et al., proposed training and encouraging the staff to write MC item that test higher-order cognitive level as a mean to reduce MC item flaws. 
In this study, flawed items passed slightly more students than would otherwise have passed if the flawed items were removed from the test. This is in accordance with the findings of Tarrant and Ware and Almuhaidib but was in contrary to what Downing showed in two separate studies. , The small magnitude of this difference (1.4%) may be explained by the small number of students taking the test. This finding indicates that some low-achieving students benefit from flawed items, which is understood in view of our knowledge that some flawed items advantage test-wise students by providing clues to the correct answer leading to inflation of their results. The finding that the performance of low-achieving students was better in flawed items than in standard ones, indicated by their median scores on those items' scales, may further explain the above result. However, this finding was contradicted by the almost equal overall difficulty index of flawed and standard items. Again, this might be influenced by the small number of students taking the test.
On the other hand, it appears that high-achieving students were disadvantaged by the flawed items since their median score was better on the standard scale than on the total and flawed test scales. This finding agrees with what Tarrant and Ware already proved in their study. They highlighted that high-achieving students do not tend to use test-wiseness strategies to answer questions and therefore, their performance is affected negatively with some flawed items. This finding may also elaborate on the role of unnecessary difficulty sometimes added to flawed MC questions and is referred to as the 'construct-irrelevant variance'. This variable represents an unnecessary difficulty added to the question which would then jeopardise the construct being tested and distorts the students' performance in such a way that the inferences we could make from the test's results were less valid.
There were no significant differences found between the difficulty index and the discrimination power of standard and flawed items. The figures state that flawed items were only 1% more difficult than standard items. While this agrees with the finding of Tarrant and Ware who found no substantial differences in difficulty index of standard and flawed items, Downing showed consistently increased difficulty of flawed items in comparison to standard items in two separate studies. , Almuhaidib, on the other hand, pointed out that flawed items were easier than standard items. These mixed results were not surprising if we considered the varying effects of flawed items on the difficulty of the question. Some flaws clue the test-wise students to the correct answer making the question easier while other flaws may confuse the students and lend themselves more difficult to answer. The slightly increased difficulty of flawed items in comparison to standard items observed in this study may be explained by the construct-irrelevant variance factor introduced by the flawed items adding to their difficulty. Flawed items were less discriminating than standard items in this study, which has also been supported by Tarrant and Ware and Almuhaidib in their series. Similarly, Pate and Caldwell showed that flawed MC items negatively affect students' performance without improvement in the discrimination between high- and low-achievers.  The low discrimination power of the flawed items in comparison to the standard items is a logical consequence in which flawed items affected low-achieving students positively and high-achieving students negatively. In that case, the scores of high-achieving students would be reduced and those of low-achieving students be artificially inflated leading to low discrimination power of those questions.
The KR20 determines the internal consistency of responses across a specific number of items.  It also points to how much different parts of a test are homogenous and consistently measure one single construct (unidimensional). ,, Downing in 2005, reasoned the higher reliability of flawed items in his study to the tendency of internal consistency reliability to be affected by random rather than systematic errors of measurement. However, it has been concluded that systematic errors of measurement also influence the internal consistency reliability as well as random ones.  These systematic errors of measurement include test-specific factors, which include poor construction of test items. The lower reliability of flawed MC items in comparison to the standard items, in this study, likely indicates low homogeneity with standard items and the tendency of flawed items to measure a different construct than what standard items aim to assess. This is further explained by the 'construct-irrelevant' factor, which contaminates the flawed items, making them either more or less difficult to answer due to factors that are not related to the construct being tested. In this setting, students' responses and performance, overall, are influenced by factors which are least related to their ability in the subject under assessment. The lower scores of high-performing students in the flawed items in comparison to their performance in standard items, and the opposite likewise to low-performing students, already support that. Since the internal consistency reliability considers student's responses and is related to their scores in a particular test, the validity of the test results are also expected to be jeopardised when the reliability is low, , which is particularly significant in high-stakes examinations like the one in this study. The intricate relationship between reliability and validity is another important factor that necessitates paying more attention to the issue of poorly constructed MC items and their effects.
The results of this study cannot be generalised for more than one reason. First, the number of students taking the test is small which may affect the accuracy and reliability of the results. The small number of students is dictated by the small number of students' intake owing to the small setting nature of our college at present. Second, only one examination has been included in this study, which is administered to a single class of students among all other faculty students. This may have introduced selection bias affecting and limiting the credibility of the results. Nevertheless, this study is the first one in our setting, which is a newly established medical college, to evaluate the frequency of flawed items in our examinations and their impact on the students' achievement, shedding light on the quality of our summative assessments and opening the door for further work in the field.
| Conclusion|| |
The frequency of flawed items was relatively high in this study. While this result pertains actually to the examination that has been evaluated, there is no logical reason to judge the contrary in other examinations in our setting. This necessitates a great deal of attention on the faculty's administration side to encourage efforts in continuous training and faculty development programmes to spread the knowledge of MC item-writing principles among the staff members. It was shown that flawed MC items negatively affected the performance of high-achieving students and at the same time, advantaged low-achieving students in recognition of the correct answer, thereby artificially inflating their results and allowing more of them to pass the test than would otherwise have. Flawed MC questions were found slightly less difficult, less discriminating and less reliable than standard MC items although these results were not statistically significant.
Further similar studies are essentially required in our setting across different classes and examinations to help further explore the true picture of the quality of our assessments.
We would like to thank Mr. Abdalla Elkhalifa who provided generous help with the statistical analysis of the results.
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
| References|| |
Kapur E, Kulenoviæ A. Analysis of difficulty and discrimination indices in one-best answer multiple choice questions of an anatomy paper. Folia Medica 2010;45:14-20.
McCoubrie P. Improving the fairness of multiple-choice questions: A literature review. Med Teach 2004;26:709-12.
Tarrant M, Ware J. A comparison of the psychometric properties of three- and four-option multiple-choice questions in nursing assessments. Nurse Educ Today 2010;30:539-43.
Epstein RM. Assessment in medical education. N Engl J Med 2007;356:387-96.
Hettiaratchi ES. A comparison of student performance in two parallel physiology tests in multiple choice and short answer forms. Med Educ 1978;12:290-6.
Palmer EJ, Devitt PG. Assessment of higher order cognitive skills in undergraduate education: Modified essay or multiple choice questions? Research paper. BMC Med Educ 2007;7:49.
Downing SM. Construct-irrelevant variance and flawed test questions: Do multiple-choice item-writing principles make any difference? Acad Med 2002;77 10 Suppl: S103-4.
Downing SM. The effects of violating standard item writing principles on tests and students: The consequences of using flawed test items on achievement examinations in medical education. Adv Health Sci Educ Theory Pract 2005;10:133-43.
Haladyna T, Downing S, Rodriguez M. A review of multiple-choice item-writing guidelines for classroom assessment. Appl Meas Educ 2010;15:309-34.
Haladyna T, Downing S. Validity of a taxonomy of multiple-choice item-writing rules. Appl Meas Educ 1989;2:51-72.
Pate A, Caldwell D. Effects of multiple-choice item-writing guideline utilization on item and students performance. Curr Pharm Teach Learn 2013;6:130-4.
Fayyaz Khan H, Farooq Danish K, Saeed Awan A, Anwar M. Identification of technical item flaws leads to improvement of the quality of single best multiple choice questions. Pak J Med Sci 2013;29:715-8.
Almuhaidib N. Types of item-writing flaws in multiple choice question pattern - A comparative study. Umm Al Qura Univ J Educ Psychol Sci 2010;2:10-45.
Tarrant M, Ware J. Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursing assessments. Med Educ 2008;42:198-206.
Baig M, Ali SK, Ali S, Huda N. Evaluation of multiple choice and short essay question items in basic medical sciences. Pak J Med Sci 2014;30:3-6.
Schumacker R, Smith E. Reliability a Rasch perspective. Educ Psychol Meas 2007;67:394-409.
Dressel P. Some remarks on the Kuder-Richardson reliability coefficient. Psychometrika 1940;5:305-10.
Leech N, Onwuegbuzie A, O′Conner R. Assessing internal consistency in counseling research. Couns Outcome Res Eval 2011;2:115-25.
[Table 1], [Table 2], [Table 3]