Multi-label classification

In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.

The formulation of multi-label learning was first introduced by Shen et al. in the context of Semantic Scene Classification,[1][2] and later gained popularity across various areas of machine learning.

Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y; that is, it assigns a value of 0 or 1 for each element (label) in y.

Several problem transformation methods exist for multi-label classification, and can be roughly broken down into: The baseline approach, called the binary relevance method,[3] amounts to independently training one binary classifier for each label.

Although this method of dividing the task into multiple binary tasks may resemble superficially the one-vs.-all (OvA) and one-vs.-rest (OvR) methods for multiclass classification, it is essentially different from both, because a single classifier under binary relevance deals with a single label, without any regard to other labels whatsoever.

It differs from binary relevance in that labels are predicted sequentially, and the output of all previous classifiers (i.e. positive or negative for a particular label) are input as features to subsequent classifiers.

[3] Classifier chains have been applied, for instance, in HIV drug resistance prediction.

[6] In case of transforming the problem to multiple binary classifications, the likelihood function reads

These predictions are then combined by an ensemble method, usually a voting scheme where every class that receives a requisite percentage of votes from individual classifiers (often referred to as the discrimination threshold[8]) is predicted as a present label in the multi-label output.

Another variation is the random k-labelsets (RAKEL) algorithm, which uses multiple LP classifiers, each trained on a random subset of the actual labels; label prediction is then carried out by a voting scheme.

Some classification algorithms/models have been adapted to the multi-label task, without requiring problem transformations.

The online learning algorithms, on the other hand, incrementally build their models in sequential iterations.

In iteration t, an online algorithm receives a sample, xt and predicts its label(s) ŷt using the current model; the algorithm then receives yt, the true label(s) of xt and updates its model based on the sample-label pair: (xt, yt).

The difficulties of multi-label classification (exponential number of possible label sets, capturing dependencies between labels) are combined with difficulties of data streams (time and memory constraints, addressing infinite stream with finite means, concept drifts).

data sample (do not confuse it with a one-hot vector; it is simply a collection of all of the labels that belong to this sample), the extent to which a dataset is multi-label can be captured in two statistics: Evaluation metrics for multi-label classification performance are inherently different from those used in multi-class (or binary) classification, due to the inherent differences of the classification problem.

[26] Java implementations of multi-label algorithms are available in the Mulan and Meka software packages, both based on Weka.

The scikit-learn Python package implements some multi-labels algorithms and metrics.

The scikit-multilearn Python package specifically caters to the multi-label classification.

It provides multi-label implementation of several well-known techniques including SVM, kNN and many more.

The binary relevance method, classifier chains and other multilabel algorithms with a lot of different base learners are implemented in the R-package mlr[27] A list of commonly used multi-label data-sets is available at the Mulan website.