|Year : 2014 | Volume
| Issue : 3 | Page : 94-99
The perfect MCQ exam
James Ware1, Thuraya Essam Kattan1, Imran Siddiqui1, Ahmed M Mohammed2
1 Department of Medical Education and Postgraduate Studies, The Saudi Commission for Health Specialties, Riyadh, Saudi Arabia
2 Faculty of Medicine, Kuwait University, Kuwait
|Date of Web Publication||31-Jul-2014|
Department of Medical Education and Postgraduate Studies, The Saudi Commission for Health Specialties, Riyadh 11614, Saudi Arabia
Aims: The Saudi Commission for Health Specialties' question banks for licensing international medical graduates and certifying Saudi trained residents are currently being upgraded. The process is briefly explained with the justification for the developments made to all the banks over the last 3 years. A process of quality assurance has been introduced to ensure the banks are maintained at the highest standards and procedures that are fit for purpose introduced for the management of test results.
Results: An analysis of 16 undergraduate exams shows a sigmoid relationship between mean test scores (MTS) and numbers of functioning distractors. Sixteen computer-based Saudi Licencing Exam (SLE) banks were also analyzed to determine the mean MTS for all, 52.3 ± 7.88, which justifies the use of four option items and also validates the quality of the exams with a majority of items falling within the desired range of medium difficulty to obtain the highest reliability.
Conclusion: The Saudi Commission believes that it is achieving increased quality in all aspects of test measurement and would hope that over the next 3 - 5 years achieve levels of excellence for all aspects of it licensing and certifying assessments.
Keywords: Quality assurance, test measurement, item performance, high stakes exams
|How to cite this article:|
Ware J, Kattan TE, Siddiqui I, Mohammed AM. The perfect MCQ exam. J Health Spec 2014;2:94-9
| The Perfect Multiple-Choice Exam|| |
There is, of course, no such thing, but it would be nice to think so considering all the effort we put into our exams. For the last 2 years the Saudi Commission for Health Specialties has been upgrading their question banks, starting with the Saudi Licensing Exams, SLE (for overseas health professionals to practice in Saudi Arabia), and more recently the Saudi Board Exam banks. So what makes a question or an exam of good quality?
There are three main criteria:
- The quality of the questions and clarity of the language
- The performance of the questions and candidates
- The consistency of the scoring methods.
Starting with the last point first: When raters are used to score exam papers there will be issues of examiner reliability and consistency of marking. As the Commission uses only multiple-choice questions (MCQs) for their knowledge based exams the scoring component is not an issue.
The quality of the questions also relies on three components:
- Consistency of presentation
- Alignment with the course of study
- Absence of item writing flaws. 
Consistency of presentation
The Saudi Commission has had a tradition of using mostly one best of five options MCQs. What had been lacking was a format and structure which was applied to all items used in their examinations. The uniform format was agreed upon and presented in a manual published in 2012.  Although it has evolved since then, the changes have been small. The item writers' workshops have been reduced from a day and a half to half a day. Presentations that had originally been 40 minutes now are 20. From a concept that all the skills that might be needed being presented in one workshop, there are now three:
- Better item writing,
- The review process,
- Constructing a test blueprint and item classification.
The Commission puts great emphasis on presentation and format, because it believes that when assessment items are presented in a consistent way they are less stressful and more likely to allow the candidate to use all his or her knowledge. To support consistent production, Commission staff has been trained to format all items sent to them. [Figure 1] shows the production process from item construction to finally being banked.
|Figure 1: The flow chart for new item production. The question banking unit (QB-Unit) involved logs all items as they progress through these four steps of quality control|
Click here to view
The materials given to all workshop participants have also evolved to a point where overseas reviewers are trained on the materials provided to them and the feedback given as they complete training assignments. This step was taken when it became apparent that the full time staff at the Commission was insufficient to cope with the throughput of items that is happening today. Plans for the future include an automated and secure web portal which will ensure that all items submitted will always be submitted securely in the correct format.
Following a workshop, participants formerly submitted ten items which then allowed the selection of the best item writers. Today, item writers submit ten new items and through a process of feedback those who still retain interest are signed on. Steps 4-5-6, in [Figure 1], may be repeated several times. The technical reviewer is a central person in the quality assurance process. The reviewer checks for:
|Figure 2: Question in Fig 5 has been revised to improve clarity and speed of reading, a central goal for Saudi Commission exam items|
Click here to view
- Option length or what is described as option balance,
- Ensures that a clear closed question is asked,
- Checks all lab results to ensure they are given in standard international (SI) units and correctly tabulated,
- The presentation of the scenario, if any, is correctly sequenced like a stripped down set of clinical notes, and finally,
- The technical reviewer edits and revises any grammatical or syntax errors, bearing in mind that MCQs do not always conform to rules of perfectly written language.
All these revisions may cause loss of focus, meaning or accuracy for the item, which will then need correction at the scientific review stage [Figure 1], step 8].
The format, structure and philosophies guiding the writing of multiple-choice exam items are for computer-based delivery. Today this is true for the SLE exams and in the near future will also be used for Saudi Board exams.
SLE candidates are not usually native English language speakers and so language issues are important. Partly, so as not to cause unnecessary errors in test measurement due to language challenges, a simple scenario must be less than 70 words, and a complex one with lab data and possibly a radiological report, 110 words. When the question line and four options are added one may have a question maximally with 100 - 150 words for this level of exam. All lab data are tabulated below the scenario [Figure 2], and vital signs are also given in a table, while omitting none as these are considered important data and also increases authenticity, and hence face validity. Images are positively encouraged, but without copyright issues. Although there is evidence that probably over 80% of all diagnoses can be made from a history and physical examination, while radiological images on their own only give the correct diagnosis in 35% of cases.  Thus the use of clinical scenarios is considered not only justified but desirable, and preferably with images.
[Figure 3] shows the rate that words are read to comprehension, the average number of words per item (wpi) was 60.1 which were read to comprehension in 42 seconds. Therefore, a 100 word question can be read and the correct key marked in one minute. With these data it can be calculated that 100 item exam can easily be completed in 2 hours, in a relaxed manner by the average examinee. Saudi Board exams are more complex and would be expected have a larger wpi.
|Figure 3: Twenty item time calibration test (n =154), time per item recorded and wpi known. Time versus wpi, r2 = 0.95, mean time = 42 s and mean WPI = 60.1.|
Click here to view
The page is formatted so that there are only 6 - 9 words per line, excluding indefinite articles and two letter prepositions. This allows for the text to be read quickly and on rereading find information again. The font is sans serif, again for easy reading on a computer screen, and the use of double page presentation is preferred to screen scrolling. The advent of high resolution screen technology means reading off the screen has become much less tiring than it was ten years ago.
Intuitively, one would think that the more difficult items would take longer to answer, and this is indeed the case. A study was made of 4,679 items from four SLEs. The items were divided into three groups: Group I, DIFF < 0.45; Group II, DIFF 0.45 - 0.80; and Group III, DIFF > 0.80 (DIFF, the proportion of candidates marking the correct key). The correlation between item DIFF and seconds per item was r = −0.97, with time negatively correlated with the difficulty (note a difficult item has a low DIFF value). Refer to [Table 1] and the eight random items shown, there is a negative correlation, r = −0.80, for those eight items.
|Table 1: Detailed psychometric data for every item used during a twelve month period, example from part of the emergency medicine technicians SLE bank |
Click here to view
Previously, many Commission exams used five options. Whether one uses five, four, or three makes no difference to the performance of the exam and the reliability of the test. , This has important implications for item construction because often the fifth option only becomes a filler, easily recognized and dismissed as such.  All high stakes computer-based exams should be reviewed once each year and among other analyses of quality, the performance of all distractors must be determined. A large scale analysis showed that the frequency that distractors are selected by more than 5% of candidates was 2.03 ± 0.33. This again supports the number of options being less than five, but for reasons of face validity more than three. Face validity in this sense is when one considers if the appearance of the item is fit for the purpose intended, in other words for use in a high stakes licensing exam. It is important that the functioning distractor frequency (FDF) is not lower than about 1.5 and if possible higher than 2.1, which usually will happen when the item is of medium difficulty (DIFF 0.45 - 0.80).
There are two methods for validating distractor function: when a distractor is selected by more than 5% of all candidates and secondly when the Hi-Lo discrimination is negative [Table 2]. There were 154 candidates and in example, A the item performance is good. The discrimination index (DI) is 0.50 and the DIFF value is between 0.45 - 0.80, 0.55 (85/154). Compare this with example B, where the correct option is D, and theoretically anyway, the DIFF value again is good, 0.53. However the DI is negative for the correct key, -0.36, and all the distractors are positive, while one, option C has been selected by less than 5% (6/154) of all candidates. Example B is a typical example of what must be reviewed by the post hoc exam committee, but might then choose not to reject it as 35% of the top ability candidates have marked the answer key correctly, despite the DI being strongly negative.
|Table 2: Two examples of items A and B. Proportion selections for options A, B C and D*, marks the correct item key; Hi-Lo item discrimination index |
Click here to view
Another property of the number of options per item used is dependent on the mean test score of all the candidates [Figure 4].The mean test score (MTS) of most SLE exams lies between 41 - 65%, 52.3 ± 7.88 (n = 15 exams) which justifies three distractors or four options. The length of an exam with five options is 12% longer than one with four. 
|Figure 4: Items taken from 16 exams with functioning distractor numbers rounded and the item mean tests scores plotted against the number of functioning distractors|
Click here to view
It is important that the candidate knows where to find the investigation results being used in a case based scenario and the normal values if lab results are being presented. [Figure 5] shows an item where the lab data are placed in the scenario without any normal values and it can be appreciated that reading such a scenario to comprehension, plus time for reflection to resolve any uncertainties, would take longer than one minute per 100 WPI. [Figure 2] shows how clarity can be improved by structuring the item logically, and has been presented in the format used by the Saudi Commission. If the quality of the question is low it will introduce an element of construct-irrelevant variance, an obstacle to the measurement purpose of the item and exam because the examinee is challenged by actually attempting to understand the wording of the question.  The Saudi Commission does not believe that is a fair way to examine.
|Figure 5: An example of an unstructured multiple-choice question with lab data presented in the scenario and without normal values. The question could also benefit from editing|
Click here to view
Alignment with the course of study
Item alignment with the course of study is a necessity for all exams to be valid and fair. However, the SLE does not relate to a specific curriculum and instead is aligned with a published test blueprint for the specialty concerned, and delivered as a specialist exam. The level assumed is one for safe practice by a consultant in that specialty. The Saudi Board exams, on the other hand, shall be directed to the course of study, which today is being made explicit with the development of the new CanMEDS curricula for every specialty. Sixteen have either completed this task or are on the way to do so. Every board exam will have its own test blueprint.
Item writing flaws
Over half the questions used in high stakes exams are flawed.  A flawed item is one that either is confusing, ambiguous or unclear, alternatively the item is written in such a way that the correct key is cued for the test wise. Powerful evidence has been found elsewhere whereby removing the flawed items and re-computing the results of an exam makes the exam easier but more borderline candidates fail, while the top ability candidates score much higher marks.  All Saudi Commission exams are reviewed to eliminate item writing flaws, achieved by either revising or rejecting the item.
High stakes items should always be reviewed by exam specialists and content experts before they can be considered fit for purpose to create a valid exam. But for an exam to be valid it must be reliable. This stands to reason because if only one in three questions actually addresses the subject matter being tested and the candidates have prepared themselves for a test on what they have studied, the results will be inconsistent and reliability low. On the other hand if an exam with only a few items is delivered it too will have poor reliability. The Spearman Brown formula will help determine the effect of adding extra items on reliability. , Reliability is also affected by three other item related factors:
- Difficulty 
- Discrimination. 
When item quality is poor, construct irrelevant variance (CIV) will become an issue. In other words some of what one is measuring is not the purpose of the test, for example mastery of the language being used for the examination, usually English.
To determine the relationship between DIFF and an item's DI, 16 SLE exams were analyzed, using 9,907 items and 41,183 exam records. Once again three groups were analyzed: Group I, DIFF < 0.45; Group II, DIFF 0.45 - 0.80 and Group III DIFF > 0.80. Group II with an MTS of 0.63, had a DI which was 100% higher than Group III and 76% high than Group I (P < 0.001 and 0.01 respectively). Exams with a high mean item discrimination, DI > 0.30, will also be more reliable.  Almost half of Saudi Commission items used in high stakes exams are of medium difficulty, range 41 - 65%, 52.3 ± 7.88 (n = 15). 
Finally, all single date and time exams must be subjected to post hoc analysis, which should be completed within a specified time after the test, or in the case of a CBT offered throughout the year, at the end of the year with all the accumulated data.
Single date and time test
A typical example of such an exam is a Saudi Board Exam, Part I or Part II. Following the exam the results are subjected to item analysis, from which the following data are made available:
- A score report of candidates' results
- The exam reliability coefficient (Kuder-Richardson 21) for internal consistency
- The exam MTS, mean DI and point biserial (rPBS)
- For every item a report similar to [Table 1]
- For every item the rPBS
- The option performance, including:
- The proportion of top ability candidates selecting each option
- The proportion of low ability candidates selecting each option
- The MTS for all those selecting each option.
From this data the questions which need further consideration will be flagged according to the following two criteria:
- Items with a negative DI and or rPBS
- Diff level < 0.30.
The questions with their full psychometric data are discussed at a meeting of the specialty committee. The discussion should reach a consensus concerning the questions' clarity, relevance, and alignment as well as whether the top ability candidates demonstrated sufficient success with each item to justify its retention. Where a curriculum is available this may be consulted, otherwise the test blueprint should be available to compare the item with. It is usual that about 20 - 25% of the items used in such an exam will be reviewed with about 3 - 6% of the total items in the exam finally being deleted before the results are re-computed.
Annual CBT analysis
The Saudi Commission has started a systematic annual analysis of all their SLE CBTs. The first reports received are
- Summary Reports [Table 3],
- The trend report and
- Detailed psychometric data on every item [Table 1].
Using a combination of the psychometric data and the item itself, flagged items are reviewed by specialty content experts to determine if the items shall be revised, accepted, or rejected. Thus in an incremental way the banks are being upgraded and with increasing numbers of new items being added the total size of most banks is increasing.
|Table 3: Page 1 of the Summary Report provided for all CBTs delivered on an annual basis |
Click here to view
| Conclusions|| |
The production of high quality exams is a complex and time consuming activity. The Saudi Commission has committed itself to achieving this end and will not compromise on any step in the process. What will inevitably result from all this investment are rising standards of health care and safety standards in the Kingdom, a goal which is shared by more than just those who give of their time.
| References|| |
|1.||Tarrant M, Knierim A, Hayes SK, Ware J. The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments. Nurse Educ Pract 2006;26:662-71. |
|2.||Al-Sultan MA, Ware J, Kattan T, Fawaz T, Mohammed A, Siddiqui I. Item writing manual for multiple-choice questions. Saudi Comm Health Spec 2012. |
|3.||Hampton JR, Harrison MJ, Mitchell JR, Prichard JS, Seymour C. Relative contributions of history-taking, physical examination, and laboratory investigation to diagnosis and management of medical outpatients. Br Med J 1975;2:486-9. |
|4.||Tarrant M, Ware J. A comparison of the psychometric properties of three-and four-option multiple-choice questions in nursing assessments. Nurse Educ Today 2010;30:539-43. |
|5.||Cizek GJ, Day DM. Further investigations of non-functioning options in multiple-choice test items. Educ Psychol Meas 1994;54:861-72. |
|6.||Cizek GJ, Robinson LK, O'Day DM. Non-functioning options: A closer look. Educ Psychol Meas 1998;58:605-11. |
|7.||Aamodt MG, McShane T. A meta-analytic investigation of the effect of various test item characteristics on test scores. Public Pers Manage 1992;21:151. |
|8.||Downing SM. Threats to the validity of locally developed multiple-choice tests in medical education: Construct-irrelevant variance and construct underrepresentation. Adv Health Sci Educ 2002;7:235-41. |
|9.||Tarrant M, Ware J. Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursing assessments. Med Educ 2008;42:198-206. |
|10.||Spearman CC. Correlation calculated from faulty data. Br J Psychol 1910;3:271-95. |
|11.||Brown W. Some experimental results in the correlation of mental abilities. Br J Psychol 1910;3:296-322. |
|12.||Ebel RL. The relation of item discrimination to test reliability. J Educ Meas 1967;4:125-8. |
|13.||Gulliksen H. The relation of item difficulty and inter-item correlation to test variance and reliability. Psychometrika 1945;10:79-91. |
[Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5]
[Table 1], [Table 2], [Table 3]