The 14 key areas of a robust assessment program
For a large-scale assessment of learning outcomes to be high quality, technically sound and useful for educational policy, the following areas should be addressed.
High quality cognitive instruments
The cognitive or ‘test’ instruments are at the heart of the assessment program, intended to measure a learner’s knowledge and skills in a particular learning domain (e.g. a reading literacy or a mathematical literacy test).
The cognitive items should meet the requirements of the assessment framework. Items will be of a high quality if their development is subject to rigorous quality assurance.
Quality assurance in item development involves:
- expert review: Peer-review of the developed items
- cognitive labs: Developers administer items to small groups of individuals under observation to investigate the techniques respondents use to answer the items
- piloting: Items are administered to larger groups of individuals under test conditions to obtain statistical information which can assist in the selection of the final pool of items
The 14 key areas of a robust assessment program
Reporting and dissemination
Reporting and dissemination should be guided by a communication strategy that incorporates a range of approaches to cater for a diversity of audiences.
A diversity of audiences includes:
- Teachers and school leaders
- curriculum developers
- teacher trainers
- policymakers and policy analysts
Statistical methods appropriate to a complex sample design should be used to ensure accurate and reliable results.
Statistical methods for accurate and reliable results:
- Plausible values should be calculated so that the population estimates are unbiased.
In Item response Theory (IRT) analysis, each individual’s score on the assessment is used to calculate a point estimate (i.e. one number estimate) for that individual’s ability. Estimating the ability of a population from ability point estimates of sampled individuals results in bias if the point estimates are not sufficiently precise. This bias can be avoided if an estimated distribution of ability is calculated for each individual, and then randomly drawing from this distribution several plausible values as estimates of the individual’s ability. The ability of the population represented by the sample is then estimated from the set of all plausible values for the individuals in the sample.
- Sample weights should be applied.
A complex sample design often divides the population of interest as a way of improving precision. This is called stratification, and the different divided groups are called strata (singular: stratum). In order to compare results across different strata, usually the same number of individuals are sampled from each stratum. But there will be different numbers of individuals in the population of interest within each stratum. This means that from one stratum to the next, the probability that one individual from the population of interest is selected to be part of the sample will not be the same. And it means that from one stratum to the next, the number of individuals in the population of interest represented by each sampled individual – known as the individual’s weight – will not be the same. Weights must be assigned to all sampled individuals before stratum-level results can be put together to obtain estimates for the entire population of interest.
- Replication methods should be used to determine sample variance.
When calculating sampling estimates, standard data analysis software packages will assume that the sample was drawn using simple random sampling methods. If the sample is drawn according to a complex sample design involving clustering, stratification and weighting, different analytical methods should be used. One approach with a complex sample design is to use replication. In replication, multiple subsamples are systematically generated, and the estimates from these subsamples are used as the basis for calculating the overall sampling estimates.
Special analyses are needed to reveal the influence of contextual factors on student results:
- Multilevel modelling is required if classroom, school or district-level estimates are to be examined.
Multilevel modelling or hierarchical linear modelling of data is appropriate in situations where people or objects are nested within larger groups, for example students are grouped within classes, classes are nested in schools, schools are located in administrative districts or regions and so on.
The aim of multilevel modelling is to overcome the problems associated with single-level procedures where data at different levels have to be either aggregated (eg student level data have to be averaged to the class or school level) or disaggregated (eg school level data have to be assigned to each individual student) before they can be analysed.
In the aggregation process, information is lost because the variance of the lower level variables, which often represent a considerable amount of the overall variance, is reduced. The disaggregation process leads to a violation of the assumption of the independence of observations because the same value is assigned to all cases at the lower level.
In general, multilevel modelling is employed to:
- improve the estimation of individual effects
- model cross level effects
- partition variance components across levels of analysis in order to apply significance tests more appropriately
Item Response Theory (IRT) should be used to construct numerical scales to report results for each subject domain.
Item response Theory (IRT):
Previously, Classical Test Theory (CTT) was used to compare individuals’ results on an assessment. In CTT, the overall scores individuals obtain on the assessment are compared.
Using CTT means that comparisons are only true comparisons if the same assessment items are administered to all assessed individuals in the same order. In other words, in CTT, individuals’ results are item-dependent. This is not practical because:
- the same assessment often cannot be administered multiple times (e.g. year after year) because it is too difficult to keep the items secure
- many items are required to assess the many facets of a broad construct – so many items that they could not all be administered to each individual
In contrast to CTT, IRT analysis yields results for assessed individuals that are item-independent. This means that the results for assessed individuals can be reported on the same numerical scale, and therefore compared, even when they did not complete the same assessment.
IRT allows comparisons of performance over time and across grades.
The numerical scales form the basis of learning metrics:
The figure shows an example learning metric. The key features of this metric are:
- The numerical scale against which numerical scores – known as proficiency scores – are reported
- The proficiency descriptions that give substantive meaning to sub-ranges of proficiency scores
- Benchmarks showing agreed target levels for Grade 3 and Grade 6
Indicators showing the distribution of proficiency scores for the assessed students and the mean proficiency scores for different subgroups of the assessed students (i.e. girls, boys, rural students, urban students)
Data management should be standardised.
Data management covers data security, data capture, data cleaning and version control.
Data capture and data cleaning:
- Data capture is the process by which the raw assessment data are transferred to an initial database. Some examples of data capture are manual data entry and scanning.
- Data cleaning is the process whereby the data in the initial database are checked for discrepancies, errors and outliers. Data cleaning should include mechanisms that enable discrepant data to be checked at the source.
Standardised field operations
Test materials should be standardised.
Independent, trained test administrators should manage assessment coordination and administration to ensure that standardised test conditions are applied.
Aspects of standardised field operations include:
- Assessment coordination: dates, student eligibility guidelines, consent, test materials, student participation and tracking, session reporting, security of test materials
- Assessment administration: timing, instructions, scripts for student queries, guidelines for student absences and follow-up sessions
The sample design should involve scientific sampling methods.
Large-scale assessments of learning outcomes generally:
- develop a comprehensive sampling frame enumerating the target population
- cluster individuals to be assessed in manageable units such as schools or classes
- stratify the sampling frame by areas of research interest such as region, school location (e.g. urban or rural), school authority, school size, gender of students, language of instruction.
Using scientific sampling methods guarantees an appropriate degree of statistical precision and inferential validity:
- In a sample survey, the population of interest is represented by a sample drawn from that population.
- Data collected from the survey yields statistics that describe certain characteristics of the sample.
- The statistics of the sample are estimates of parameters that describe the same characteristics in the population of interest.
The sample design aims to ensure:
- statistical precision, i.e. that there is acceptable variation between multiple estimates of the same population parameter
- inferential validity, i.e. that each statistic is an acceptable approximation of the corresponding population parameter.
A comprehensive test design should be formulated so that:
- the most efficient sample sizes are used
- any measures of change over time remain stable
- the content of the test is balanced.
The test design covers:
- the test mode (e.g. paper-and-pencil or computer-based)
- the number of items required to measure and report on each subject domain
- the number of test forms required
- the number of items required for linking test forms.
Linking is achieved by constructing the test forms so that, while each form has a different subset of items, some common items are included across the forms. Including these common items means that the performances of individuals who complete different test forms can be compared in terms of:
- the balance of cognitive assessment and background questionnaire to be administered
- the required number of individuals who must be assessed to meet the level of accuracy specified in the technical standards.
Linguistic quality control
Linguistic quality assurance is necessary to ensure that items administered in multiple languages are psychometrically and linguistically equivalent.
Aspects of linguistic quality assurance include:
- general translation guidelines
- item-specific translation guidelines
Steps in a linguistic quality assurance process may include:
- documented double independent translation
- documented independent reconciliation of the translations
- documented verification by an independent language expert (to ensure that translations have adhered to the guidelines)
High quality contextual instruments
Collecting contextual or ‘background’ data is critical to a robust high-quality assessment program. While information on cognitive learning outcomes is important, information on contextual factors that are associated with achievement (e.g. socio-economic status measures) or considered as important outcomes of education per se (e.g. attitudes and engagement), can add significant value for evidence-based education policy making.
Typically large-scale assessments use questionnaires to collect background information on student and school-level factors (e.g. using questionnaires for students, parents, teachers and school principals). Less frequently, quantitative data collection is combined with qualitative methods such as classroom or school observations. Data at the system level can be collected through questionnaires for major stakeholders (e.g. a curriculum questionnaire) or derived from existing education databases, such as a school census or an Education Management Information System (EMIS).
The contextual instruments should be developed based on a conceptual framework (see assessment framework). All questions to be included should undergo rigorous development, similarly to the cognitive item writing process. Quality assurance mechanisms such as cognitive labs and piloting are essential for the development of the contextual instruments, to ensure respondents understand the meaning of the questions and are able to provide useable responses.
An assessment framework:
- describes the why, what and how of the assessment
- guides the development of the cognitive instruments/tests
- guides contextual instruments development
- gives stakeholders a common language to discuss the assessment
- communicates the purpose and features of the assessment to a broader audience.
An assessment framework includes information about:
the cognitive domains (e.g. reading literacy, mathematical literacy):
- definition of the domains
- knowledge and skills to be assessed
- test time
- test mode (e.g. paper-and-pencil, computer-based)
- distribution of item types (e.g. multiple choice, constructed response).
the contextual data collection:
- description of the purposes of the contextual data collection, the policy priorities and education issues to be addressed and investigated
- description of the underlying theoretical model or conceptual framework for the contextual data collection
- discussion of the content of the contextual data collection
- description of which instruments will be used (e.g. paper or online questionnaires, interviews) and for which respondents (e.g. students, school principals, teachers, parents) and other data sources.
Technical standards should be set at the start of the project, adhered to during the project lifetime, and reported against at the end of the project.
Technical standards ensure that conclusions drawn about the sample of assessed individuals are valid for the entire population(s) to which those individuals belong.
Technical standards may cover:
- school and student response rates
- survey sample size
- degree of statistical accuracy required
- policy on language of assessment
- psychometric quality of instruments
- item security
- test administration procedures
Project team and infrastructure
The project team must be motivated and committed, and enough working time to dedicate to their key responsibilities.
Key responsibilities include:
- project management
- policy liaison
- coordination of test development and translation
- field operations and project administration
- data management, sampling and analysis
- reporting and communications
The project team needs adequate systems and infrastructure:
- financial resources:
- adequate budget
- efficient procurement practices
- Support services:
- IT support
- administrative support
- physical infrastructure:
- suitable office space
- communication resources (phone, fax, email)
- a networked computer environment
- up-to-date computer software and hardware
Policy goals and issues
The assessment should be designed with the aim of obtaining data to help address policy issues of interest and intended policy goals.
Examples of policy issues of interest:
- performance in mathematics – girls vs. boys
- effect on performance if language spoken at home is different to language of instruction
- educational value – private schools vs. government schools
- teacher professional development – rural vs. urban
Examples of intended policy goals:
- ensure quality of the education system – diagnose strengths and weaknesses
- ensure equity of the education system – comparisons between groups
- ensure accountability of the education system – report results to stakeholders or evaluate policy interventions
The assessment program should also include a follow-up so that results can inform and influence a policy response.
The follow-up should involve:
- an analysis of policy options responding to the results
- the formulation of additional research questions that may be answered by further data analysis
- a discussion of policy issues that might feed into the design of subsequent assessment cycles