Scott's Pi

Since automatically annotating text is a popular problem in natural language processing, and the goal is to get the computer program that is being developed to agree with the humans in the annotations it creates, assessing the extent to which humans agree with each other is important for establishing a reasonable upper limit on computer performance.

However, in each statistic, the expected agreement is calculated slightly differently.

Scott's pi compares to the baseline of the annotators being not only independent but also having the same distribution of responses; Cohen's kappa compares to a baseline in which the annotators are assumed to be independent but to have their own, different distributions of responses.

Scott's pi is extended to more than two annotators by Fleiss' kappa.

Square and total these: To calculate observed agreement, divide the number of items on which annotators agreed by the total number of items.