Categorical variable

Commonly (though not in this article), each of the possible values of a categorical variable is referred to as a level.

Regression analysis often treats category membership with one or more quantitative dummy variables.

In general, however, the numbers are arbitrary, and have no significance beyond simply providing a convenient label for a particular value.

In other words, the values in a categorical variable exist on a nominal scale: they each represent a logically separate concept, cannot necessarily be meaningfully ordered, and cannot be otherwise manipulated as numbers could be.

As a result, the central tendency of a set of categorical variables is given by its mode; neither the mean nor the median can be defined.

This ignores the concept of alphabetical order, which is a property that is not inherent in the names themselves, but in the way we construct the labels.

For example, if we write the names in Cyrillic and consider the Cyrillic ordering of letters, we might get a different result of evaluating "Smith < Johnson" than if we write the names in the standard Latin alphabet; and if we write the names in Chinese characters, we cannot meaningfully evaluate "Smith < Johnson" at all, because no consistent ordering is defined for such characters.

However, if we do consider the names as written, e.g., in the Latin alphabet, and define an ordering corresponding to standard alphabetical order, then we have effectively converted them into ordinal variables defined on an ordinal scale.

Such multiple-category categorical variables are often analyzed using a multinomial distribution, which counts the frequency of each possible combination of numbers of occurrences of the various categories.

It is also possible to consider categorical variables where the number of categories is not fixed in advance.

Standard statistical models, such as those involving the categorical distribution and multinomial logistic regression, assume that the number of categories is known in advance, and changing the number of categories on the fly is tricky.

The regression equation takes the form of Y = bX + a, where b is the slope and gives the weight empirically assigned to an explanator, X is the explanatory variable, and a is the Y-intercept, and these values take on different meanings based on the coding system used.

However, one chooses a coding system based on the comparison of interest since the interpretation of b values will vary.

To illustrate this, suppose that we are measuring optimism among several nationalities and we have decided that French people would serve as a useful control.

Unweighted effects coding is most appropriate in situations where differences in sample size are the result of incidental factors.

Unlike when used in ANOVA, where it is at the researcher's discretion whether they choose coefficient values that are either orthogonal or non-orthogonal, in regression, it is essential that the coefficient values assigned in contrast coding be orthogonal.

The construction of contrast codes is restricted by three rules: Violating rule 2 produces accurate R2 and F values, indicating that we would reach the same conclusions about whether or not there is a significant difference; however, we can no longer interpret the b values as a mean difference.

This is illustrated through assigning the same coefficient to the French and Italian categories and a different one to the Germans.

Although it produces correct mean values for the variables, the use of nonsense coding is not recommended as it will lead to uninterpretable statistical results.

In order to probe this type of interaction, one would code using the system that addresses the researcher's hypothesis most appropriately.

We cannot simply choose values to probe the interaction as we would in the continuous variable case because of the nominal nature of the data (i.e., in the continuous case, one could analyze the data at high, moderate, and low levels assigning 1 standard deviation above the mean, at the mean, and at one standard deviation below the mean respectively).