Item response theory

[1] Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult.

They might be multiple choice questions that have incorrect and correct responses, but are also commonly statements on questionnaires that allow respondents to indicate level of agreement (a rating or Likert scale), or patient symptoms scored as present/absent, or diagnostic information in complex systems.

Parameters on which items are characterized include their difficulty (known as "location" for their location on the difficulty range); discrimination (slope or correlation), representing how steeply the rate of success of individuals varies with their ability; and a pseudoguessing parameter, characterising the (lower) asymptote at which even the least able persons will score due to guessing (for instance, 25% for a pure chance on a multiple choice item with four possible responses).

Three of the pioneers were the Educational Testing Service psychometrician Frederic M. Lord,[4] the Danish mathematician Georg Rasch, and Austrian sociologist Paul Lazarsfeld, who pursued parallel research independently.

Key figures who furthered the progress of IRT include Benjamin Drake Wright and David Andrich.

In the 1990's Margaret Wu developed two item response software programs that analyse PISA and TIMSS data; ACER ConQuest (1998) and the R-package TAM (2010).

For tasks that can be accomplished using CTT, IRT generally brings greater flexibility and provides more sophisticated information.

Unidimensionality should be interpreted as homogeneity, a quality that should be defined or empirically demonstrated in relation to a given purpose or use, but not a quantity that can be measured.

The topic of dimensionality is often investigated with factor analysis, while the IRF is the basic building block of IRT and is the center of much of the research and literature.

The exact value of the probability depends, in addition to ability, on a set of item parameters for the IRF.

For example, in the three parameter logistic model (3PL), the probability of a correct response to a dichotomous item i, usually a multiple-choice question, is:

indicates that the person's abilities are modeled as a sample from a normal distribution for the purpose of estimating the item parameters.

It indicates the probability that very low ability individuals will get this item correct by chance, mathematically represented as a lower asymptote.

However, because of the greatly increased complexity, the majority of IRT research and applications utilize a unidimensional model.

"), or where the concept of guessing does not apply, such as personality, attitude, or interest items (e.g., "I like Broadway musicals.

The normal-ogive model derives from the assumption of normally distributed measurement error and is theoretically appealing on that basis.

One can estimate a normal-ogive latent trait model by factor-analyzing a matrix of tetrachoric correlations between items.

[10] This means it is technically possible to estimate a simple IRT model using general-purpose statistical software.

With rescaling of the ability parameter, it is possible to make the 2PL logistic model closely approximate the cumulative normal ogive.

However, proponents of Rasch modeling prefer to view it as a completely different approach to conceptualizing the relationship between data and theory.

[16] Operationally, this means that the IRT approaches include additional model parameters to reflect the patterns observed in the data (e.g., allowing items to vary in their correlation with the latent trait), whereas in the Rasch approach, claims regarding the presence of a latent trait can only be considered valid when both (a) the data fit the Rasch model, and (b) test items and examinees conform to the model.

As the noise is randomly distributed, it is assumed that, provided sufficient items are tested, the rank-ordering of persons along the latent trait by raw score will not change, but will simply undergo a linear rescaling.

The first advantage is the primacy of Rasch's specific requirements,[19] which (when met) provides fundamental person-free measurement (where persons and items can be mapped onto the same invariant scale).

Traditionally, it is measured using a single index defined in various ways, such as the ratio of true and observed score variance.

Using this property with a large item bank, test information functions can be shaped to control measurement error very precisely.

Characterizing the accuracy of test scores is perhaps the central issue in psychometric theory and is a chief difference between IRT and CTT.

In the place of reliability, IRT offers the test information function which shows the degree of precision at different values of theta, θ.

represents the magnitude of latent trait of the individual, which is the human capacity or attribute measured by the test.

In fact, a portion of IRT research focuses on the measurement of change in trait level.

Although the two paradigms are generally consistent and complementary, there are a number of points of difference: It is worth also mentioning some specific similarities between CTT and IRT which help to understand the correspondence between concepts.

Figure 1: Example of 3PL IRF, with dotted lines overlaid to demonstrate parameters.