Differential item functioning

It manifests when individuals from different groups, with comparable skill levels, do not have an equal likelihood of answering a question correctly.

DIF characteristic of an item isn't solely determined by varying probabilities of selecting a specific response among individuals from different groups.

In order to create a general understanding of DIF or measurement bias, consider the following example offered by Osterlind and Everson (2009).

This indicates that there is no DIF or item bias because members of the reference and focal group with the same underlying ability or attribute have the same probability of responding correctly.

Rather than a consistent advantage being given to the reference group across the ability continuum, the conditional dependency moves and changes direction at different locations on the

Differences in ICCs indicate that examinees from the two groups with identical ability levels have unequal probabilities of correctly responding to an item.

[13] The MH procedure is a chi-squared contingency table based approach which examines differences between the reference and focal groups on all items of the test, one by one.

The next step in the calculation of the MH statistic is to use data from the contingency table to obtain an odds ratio for the two groups on the item of interest at a particular

As noted earlier, DIF examines the probability of correctly responding to or endorsing an item conditioned on the latent trait or ability.

After examination of ICCs and subsequent suspicion of DIF, statistical procedures are implemented to test differences between parameter estimates.

ICCs represent mathematical functions of the relationship between positioning on the latent trait continuum and the probability of giving a particular response.

Thus, those higher on the latent trait or in ability have a greater chance of a correct response or endorsing an item.

The inflection point is determined by the difficulty of the item which corresponds to values on the ability or latent trait continuum.

This corresponds to a lower asymptote which essentially allows for the possibility of an individual to get a moderate or difficult item correct even if they are low in ability.

Using a similar method to a Student's t-test, the next step is to determine if the difference in difficulty is statistically significant.

Logistic regression approaches to DIF detection involve running a separate analysis for each item.

, is the matching variable used to link individuals on ability, in this case a total test score, similar to that employed by the Mantel-Haenszel procedure.

However, the issue more closely revolves around whether the number of people per group is sufficient for there to be enough statistical power to identify DIF.

Therefore, in such instances, it may be appropriate to modify or adjust data so that the groups being compared for DIF are in fact equal or closer in size.

Dummy coding or recoding is a common practice employed to adjust for disparities in the size of the reference and focal group.

Another issue that pertains to sample size directly relates to the statistical procedure being used to detect DIF.

Within the logistic regression approach, leveraged values and outliers are of particular concern and must be examined prior to DIF detection.

As noted in earlier sections, a total test score is typically used as a method for matching individuals on ability.

Using a minimum of 20 items allows for greater variance in the score distribution which results in more meaningful ability level groups.

Test items need to accurately tap into the construct of interest in order to derive meaningful ability level groups.

Gadermann et al. (2012),[31] Revelle and Zinbarg (2009),[32] and John and Soto (2007)[33] offer more information on modern approaches to structural validation and more precise and appropriate methods for assessing reliability.

As with all psychological research and psychometric evaluation, statistics play a vital role but should by no means be the sole basis for decisions and conclusions reached.

This type of bias can often be addressed by using separate test norms for different groups to ensure fairness in assessment.

Factors such as socioeconomic status, cultural differences, language barriers, and disparities in knowledge access can contribute to nonuniform DIF.

Instead, it signals specific items may be biased, requiring attention to maintain test integrity and fairness for all examinees.