Classical test theory

Classical test theory (CTT) is a body of related psychometric theory that predicts outcomes of psychological testing such as the difficulty of items or the ability of test-takers.

The description of classical test theory below follows these seminal publications.

Classical test theory was born only after the following three achievements or ideas were conceptualized: In 1904, Charles Spearman was responsible for figuring out how to correct a correlation coefficient for attenuation due to measurement error and how to obtain the index of reliability needed in making the correction.

[2] Spearman's finding is thought to be the beginning of Classical Test Theory by some (Traub 1997).

Others who had an influence in the Classical Test Theory's framework include: George Udny Yule, Truman Lee Kelley, Fritz Kuder & Marion Richardson involved in making the Kuder–Richardson Formulas, Louis Guttman, and, most recently, Melvin Novick, not to mention others over the next quarter century after Spearman's initial findings.

Classical test theory assumes that each person has a true score,T, that would be obtained if there were no errors in measurement.

The square root of the reliability is the absolute value of the correlation between true and observed scores.

Reliability cannot be estimated directly since that would require one to know the true scores, which according to classical test theory is impossible.

One way of estimating reliability is by constructing a so-called parallel test.

can be shown to provide a lower bound for reliability under rather mild assumptions.

[citation needed] Thus, the reliability of test scores in a population is always higher than the value of Cronbach's

Thus, this method is empirically feasible and, as a result, it is very popular among researchers.

[3] As has been noted above, the entire exercise of classical test theory is done to arrive at a suitable definition of reliability.

Reliability is supposed to say something about the general quality of the test scores in question.

Classical test theory does not say how high reliability is supposed to be.

Around .8 is recommended for personality research, while .9+ is desirable for individual high-stakes testing.

[4] These 'criteria' are not based on formal arguments, but rather are the result of convention and professional practice.

The extent to which they can be mapped to formal principles of statistical inference is unclear.

The P-value represents the proportion of examinees responding in the keyed direction, and is typically referred to as item difficulty.

However, general statistical packages often do not provide a complete classical analysis (Cronbach's

is only one of many important statistics), and in many cases, specialized software for classical analysis is also necessary.

The problem here is that, according to classical test theory, the standard error of measurement is assumed to be the same for all examinees.

However, as Hambleton explains in his book, scores on any test are unequally precise measures for examinees of different ability, thus making the assumption of equal errors of measurement for all examinees implausible (Hambleton, Swaminathan & Rogers 1991, p. 4).