Developing and monitoring valid assessments

In one study, ACER Research Fellow Dr Van Nguyen investigated the effect of age on gender difference in large-scale testing. The study used data from the 2010 Special Tertiary Admissions Test (STAT), a test of verbal and quantitative reasoning used to assist university admissions processes in Australia. STAT 2010 involved about 16 000 candidates aged between 18 and 60 years, of which 60 per cent were women and 40 per cent men.

To conduct the study, candidates were classified into four groups by gender and whether a candidate was younger or older than 22 years of age. Psychometric analysis was performed on 2000 randomly-selected candidates from each of the four groups. This revealed that men performed better than women on verbal and quantitative reasoning. Younger candidates performed better than older candidates on quantitative reasoning, but worse on verbal reasoning. In both quantitative reasoning and verbal reasoning, the gender gap in which men performed better was greater with older than younger candidates.

The study revealed that when matching for ability, men performed better than women on test items that include graphs, pictures, or items with low verbal loading. Women, on the other hand, performed better than men on items requiring logical reasoning by words or letters, and on items with high verbal loading.

In another study, ACER researchers Drs Luc Le and Van Nguyen investigated the variation of item difficulty in relation to the changing of item positions in a test. This study used data from the 2010 Colombian Graduate Skills Assessment (GSA), a test of problem solving, critical thinking and interpersonal understandings. Twenty-six items in each test domain were used to create eight different test forms by arranging the items in different orders. Analysis was performed on 1000 randomly selected female candidates and 1000 randomly selected male candidates from a total pool of 8000 candidates.

The results showed that items generally became more difficult when located towards the end of the test. The positive relationship between item difficulty change and the item position change in the test was the highest for problem solving items and the lowest for interpersonal understanding items. The relationship was higher for men than for women, and was also different for lower and higher ability groups.

In a third study, ACER Research Fellow Ms Dulce Lay investigated the variation in item difficulties of the computer-based and paper-based International Student Admission Test (ISAT), a cross-curricular skills test for international students applying to a selection of Australian universities. The study analysed more than 1000 candidates from each of the 2009 ISAT paper-based test and the 2010 computer-based test. There were 47 common items between the two tests. The analysis revealed small variations in item difficulties for the two test administration methods. Lengthy items were more difficult in the computer-based test than in the paper-based test.

Findings from studies such as these highlight important considerations for the development of assessments. Results from the STAT study show test developers must give important consideration to gender difference in managing large-scale testing and assessment where test-takers come from a range of age groups. Findings from the GSA study suggest that, when using multiple paralleled test forms, common items should be located in similar positions within each test. Findings from the ISAT study provide useful information in an age where assessments are increasingly being administered online. â–

Developing and monitoring valid assessments

Copyright and publishing permissions

Media enquiries

ACER Social Media