Krippendorff's alpha coefficient,[1] named after academic Klaus Krippendorff, is a statistical measure of the agreement achieved when coding a set of units of analysis.
Since the 1970s, alpha has been used in content analysis where textual units are categorized by trained readers, in counseling and survey research where experts code open-ended interview data into analyzable terms, in psychological testing where alternative tests of the same phenomena need to be compared, or in observational studies where unstructured happenings are recorded for subsequent analysis.
Krippendorff's alpha generalizes several known statistics, often called measures of inter-coder agreement, inter-rater reliability, reliability of coding given sets of units (as distinct from unitizing) but it also distinguishes itself from statistics that are called reliability coefficients but are unsuitable to the particulars of coding data generated for subsequent analysis.
Krippendorff's alpha is applicable to any number of coders, each assigning one value to one unit of analysis, to incomplete (missing) data, to any number of values available for coding a variable, to binary, nominal, ordinal, interval, ratio, polar, and circular metrics (note that this is not a metric in the mathematical sense, but often the square of a mathematical metric, see levels of measurement), and it adjusts itself to small sample sizes of the reliability data.
The virtue of a single coefficient with these variations is that computed reliabilities are comparable across any numbers of coders, values, different metrics, and unequal sample sizes.
[2][3][4][5][6][7][8][9] Reliability data are generated in a situation in which m ≥ 2 jointly instructed (e.g., by a code book) but independently working coders assign any one of a set of values 1,...,V to a common set of N units of analysis.
In their canonical form, reliability data are tabulated in an m-by-N matrix containing N values vij that coder ci has assigned to unit uj.
The responses of all observers for an example is called a unit (it forms a multiset).
Rearranging terms, the sum can be interpreted in a conceptual way as the weighted average of the disagreements of the individual units---weighted by the number of coders assigned to unit j:
This can be seen to be the average distance from the diagonal of all possible pairs of responses that could be derived from the multiset of all observations.
In this general form, disagreements Do and De may be conceptually transparent but are computationally inefficient.
They can be simplified algebraically, especially when expressed in terms of the visually more instructive coincidence matrix representation of the reliability data.
A coincidence matrix cross tabulates the n pairable values from the canonical form of the reliability data into a v-by-v square matrix, where v is the number of values available in a variable.
The matrix of observed coincidences contains frequencies: omitting unpaired values, where I(∘) = 1 if ∘ is true, and 0 otherwise.
Because a coincidence matrix tabulates all pairable values and its contents sum to the total n, when four or more coders are involved, ock may be fractions.
The matrix of expected coincidences contains frequencies: which sum to the same nc, nk, and n as does ock.
In terms of these coincidences, Krippendorff's alpha becomes:Difference functions
[11] between values v and v' reflect the metric properties (levels of measurement) of their variable.
[12][13] Alpha's distribution gives rise to two indices: The minimum acceptable alpha coefficient should be chosen according to the importance of the conclusions to be drawn from imperfect data.
[14] Let the canonical form of reliability data be a 3-coder-by-15 unit matrix with 45 cells: Suppose “*” indicates a default category like “cannot code,” “no answer,” or “lacking an observation.” Then, * provides no information about the reliability of data in the four values that matter.
Thus, these reliability data consist not of mN = 45 but of n = 26 pairable values, not in N = 15 but in 12 multiply coded units.
, only the entries in one of the off-diagonal triangles of the coincidence matrix are listed in the following: Considering that all
Krippendorff's alpha brings several known statistics under a common umbrella, each of them has its own limitations but no additional virtues.
It adjusts to varying sample sizes and affords comparisons across a wide variety of reliability data, mostly ignored by the familiar measures.
Semantically, reliability is the ability to rely on something, here on coded data for subsequent analysis.
When a sufficiently large number of coders agree perfectly on what they have read or observed, relying on their descriptions is a safe bet.
Judgments of this kind hinge on the number of coders duplicating the process and how representative the coded units are of the population of interest.
Problems of interpretation arise when agreement is less than perfect, especially when reliability is absent.
Naming a statistic as one of agreement, reproducibility, or reliability does not make it a valid index of whether one can rely on coded data in subsequent decisions.
Its mathematical structure must fit the process of coding units into a system of analyzable terms.