Rationale for Group Adaptive Designs in International Large Scale Assessment

Different test forms (booklets) are commonly used in large scale international assessments such as PIRLS to balance respondent burden and content coverage. Country level group adaptive assessment designs extend this approach through targeted sampling of booklets to provide better coverage of the diverse range of ability distributions encountered in such assessments. This can increase student motivation and reduce item level nonresponse. The PIRLS 2021 approach is designed to minimally change existing procedures and time requirements, while using prior data about country performance to maximize the information obtained from the assessment.

The basic idea behind adaptive assessment is that in order to enable any type of measurement, tasks must not be too easy or too difficult for the target population. If the tasks given to a sample of test takers are too difficult, nobody (or almost nobody) will be able to solve them. Similarly, if the tasks are too easy, everybody will answer them all correctly. In each of these situations all test takers receive the same observed scores, even if they are known to differ with respect to relevant skills. 

For this reason, in educational and psychological measurement we try to craft test questions that match the ability of the targeted population of test takers, and quantify differences among test takers by eliciting responses that differentiate between higher and lower skilled respondents. A series of tasks that matches the skills of test takers will likely result in some correct and some incorrect responses. Mathematically, the variability of such a binary response (choice of correct versus incorrect option) is maximized when there is a 50% chance to get the tasks right. This 50/50 criterion leads to different requirements for different test takers. More proficient test takers require more challenging questions in order to have (only) a 50% chance, while less proficient test takers require a series of easier tasks to arrive at a 50% chance of correct responses. In order to achieve this optimal match for all test takers, it would be necessary to adjust the test difficulty for each individual respondent. However, since this is only possible if the exact difficulty of all items is known (or can be estimated well with little error), many testing programs instead rely on a variation of this individual level adaptivity and adapt their tests according to the known, or estimated, average ability levels of pre-defined groups, rather than individuals.

Existing ApproachesTop

Country level adaptive assessment designs target particular booklets to specific populations in order to match ability distributions with the distribution of booklets. There are various approaches and assessment designs that adapt the assignment of tests to differences in target populations with respect to the distribution of skills, which may be estimated by previous routing instruments or inferred from variables such as age or educational attainment. The following section describes major approaches for adapting the difficulty of tests to the ability of the test taking populations.

Starting Rules and Discontinue Rules

In intelligence testing for individuals, for adult as well as child and adolescent populations, it is common to design tests that present items in the order of increasing difficulty (e.g., the Stanford Binet Intelligence Scale1). When first applying these tests to different age groups, it was soon noticed that the first few questions were not much of a challenge for older test takers, as they would get the first handful or so questions right in almost all cases. This led test administrators to skip these first few very easy items as they ’knew’ (i.e., made an inference on cases observed so far) that older test takers would get these easy questions right. Along the same lines, it also became apparent that for younger test takers, there was a point in these tests where the remaining, harder questions were almost impossible to solve. This in turn led test administrators to stop presenting items that experience had shown to be too difficult.

Many tests of this type have a rule about how many items a test taker must get wrong consecutively before the testing session can be terminated. This number varies typically from 23 items for short forms to 5-6 items for longer IQ tests. It can be shown that the discontinued items (those for which no response was recorded after a pre-determined number of consecutive wrong responses) are missing data that is ignorable2 and that the data only on what students actually took is sufficient to estimate ability.

Multistage Adaptive Testing

Multistage adaptive testing (MST) has been used in large scale international studies for adult populations3 and can be understood as a flexible approach to assign test takers to a fixed number of test forms while aiming for a good, if not perfect, match between respondent ability and test difficulty.4 In multistage adaptive testing, the completely randomized assignment of blocks to test takers (the previous practice in TIMSS, PIRLS, and PISA) is modified to take into account the performance of the test taker on a previous block, as well as the relative difficulty of the blocks contained in the test design.

In the beginning of the assessment, some form of preliminary ability estimate is required for each test taker so that he or she can be assigned item blocks that match their expected performance. The assignment can be done deterministically, based on fixed cut-off scores, or probabilistically, based on a preliminary estimate of the student’s ability distribution. Choosing the next block probabilistically ensures that at least some easy, medium and hard blocks are likely to be still available to all respondents at subsequent stages of the test. It also allows the assignment probability to be adjusted at each stage so that weak performance on a block is more likely to result in an easy block being assigned next, while it is still possible, with lower probability, to be presented with a medium or even hard block of items. Along the same lines, following strong performance on earlier blocks the probability of being administered a hard block of items increases, while the probability of easier blocks decreases.5

The drawback of most multistage adaptive designs is that the initial starting point is either not adaptive because nothing is known about test takers, or it requires an initial routing block that produces a very rough first estimate of proficiency based on a short block of items. This estimate is somewhat error prone, particularly in assessments administered to a wide and diverse set of populations, as it assumes that the item characteristics of the routing items are known without error. An alternative to this approach is to use prior information based on background data such as education and occupation or other socio-economic data.6

Adaptive Longitudinal Designs 

Another example of how tests are adapted to different group level ability distributions is a design that is being used in longitudinal large scale skill surveys.7 These designs use information on how test takers performed in prior assessment cycles to adaptively assign a more difficult test form to students who belong to a high performing group, and an easier test form to students who belong to a low performing group. These assessments are often 2 years apart,8 so that the adaptation in this case uses information that dates back years in time. This approach turns out to be efficient as the performance at the group level is a reliable predictor of group performance at the next time point.

Pohl9 describes these designs in more detail and discusses applications in multi-cohort longitudinal studies of student populations. Each assessment cycle determines which test form should be administered to which group based on information from prior data collections. Group membership is based on prior performance, which may itself have been estimated using a harder or easier form. Over assessment cycles this provides a sequence of test forms that are tailored to decrease the error of measurement in proficiency estimation. It does this by increasing the expected response variance by matching prior performance to test forms that elicit optimal levels of systematic, ability related responses variability in groups of test takers.

Group Adaptive Assessment in PIRLS 2021Top

Group adaptive assessment in PIRLS 2021 is implemented by dividing its 18 passages into three levels of passage difficulty – difficult, medium, and easy – and combining these into two levels of booklet difficulty:

  • More difficult booklets (9) composed of difficult or medium and difficult passages
  • Less difficult booklets (9) composed of easy or easy and medium difficult passages

In this approach, all countries administer all 18 passages, but in varying proportions. Higher performing countries will administer proportionally more of the more difficult booklets while lower performing countries will administer proportionally more of the less difficult booklets. The goal is a better match between assessment difficulty and student achievement in each country. 

The group adaptive design in PIRLS 2021 involves changing from the procedure used in previous PIRLS cycles where booklets were randomly assigned to students at the same rate in each country to one where more or less difficult booklets are assigned at different rates in different countries. This change is intended to improve the accuracy of measurement in countries participating in PIRLS and provide some practical and operational advantages. More specifically, the PIRLS group adaptive design provides the following:

  1. Better measurement at all achievement levels by matching booklet difficulty to student ability at the country level
  2. All countries participate in the same assessment, maintaining full coverage of the reading construct while providing adaptivity at the population level 
  3. Minimal disruption of the PIRLS design as there is no need for a routing block under this approach
  4. Improved student response rates, more student engagement, and less student frustration as passages are better aligned with target populations
  5. Possibility of targeting subpopulations – although the PIRLS 2021 group adaptive design is intended to be implemented at the country level, it also could be implemented within countries that have clearly defined subpopulations that vary in student ability

As outlined in this paper there are ample examples of group-level adaptive approaches, from simple start/discontinue rules to elaborate longitudinal stage-based assessment designs. All these are based on group-level adaptivity that identifies groups of test takers which are to be assigned targeted test forms which are better aligned with the expected performance compared to complete random assignment or the use of only a single form.

The PIRLS group-adaptive design should benefit both high and low performing countries, in that students will be administered items that are either too difficult or too easy at a lower rate than in previous assessments. This improved targeting of the ability distributions will lead to more accurate measurement and will, as an intended side effect, likely also reduce item level non-response associated with administering too challenging or too easy items. Together, this is expected to lead to an overall improved database for reporting and secondary analyses.

ReferencesTop

1
Roid, G. & Barram, R. (2004). Essentials of Stanford–Binet Intelligence Scales (SB5) Assessment. Hoboken, New Jersey: John Wiley & Sons, Inc.
2
von Davier, M., Cho, Y., & Pan, T. (2019) Effects of Discontinue Rules on Psychometric Properties of Test Scores. Psychometrika, vol. 84, no. 1, 147–163. Retrieved from https://doi.org/10.1007/s11336-018-09652-3
3
Yamamoto, K., Khorramdel, L., & von Davier, M. (2013) Chapter 17: Scaling PIAAC Cognitive Data. In: Technical Report of the Survey of Adult Skills (PIAAC). Available at: http://www.oecd.org/site/piaac/_Technical Report_17OCT13.pdf
4
Yan, D., von Davier, A., & Lewis, L. (2014; Eds.). Computerized multistage testing: Theory and applications (pp. 3-20). New York, NY: CRC Press.
5
Yamamoto, K., Chen, H., & von Davier, M. (2014) Controlling Multistage Testing Exposure Rates in International Large-Scale Assessments. Chapter 19 in: Yan, D., von Davier, A., & Lewis, L. (Eds.). Computerized multistage testing: Theory and applications (pp. 3-20). New York, NY: CRC Press.
6
Yamamoto, K., Khorramdel, L., & von Davier, M. (2013). Chapter 18: Scaling Outcomes. In: Technical Report of the Survey of Adult Skills (PIAAC). Available at: http://www.oecd.org/site/piaac/_Technical%20Report_17OCT13.pdf
7
Rock D.A. (2017) Modeling Change in Large-Scale Longitudinal Studies of Educational Growth: Four Decades of Contributions to the Assessment of Educational Growth. In: Bennett R., von Davier M. (eds) Advancing Human Assessment. Methodology of Educational Measurement and Assessment. Springer.
8
ECLS: Rock, D. A. (2007a). A note on gain scores and their interpretation in developmental models designed to measure change in the early school years (Research Report No. RR-07-08). Princeton: Educational Testing Service. Retrieved from http://dx.doi.org/10.1002/j.2333-8504.2007.tb02050.x
9
Pohl, S. (2014). Longitudinal Multistage Testing. JEM. Retrieved from https://doi.org/10.1111/jedm.12028