Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction.
For example, in natural language processing, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours.
Other examples where CRFs are used are: labeling or parsing of sequential data for natural language processing or biological sequences,[1] part-of-speech tagging, shallow parsing,[2] named entity recognition,[3] gene finding, peptide critical functional region finding,[4] and object recognition[5] and image segmentation in computer vision.
[6] CRFs are a type of discriminative undirected probabilistic graphical model.
, obeys the Markov property with respect to the graph; that is, its probability is dependent only on its neighbours in G:
What this means is that a CRF is an undirected graphical model whose nodes can be divided into exactly two disjoint sets
, the observed and output variables, respectively; the conditional distribution
For general graphs, the problem of exact inference in CRFs is intractable.
The inference problem for a CRF is basically the same as for an MRF and the same arguments hold.
represents a hidden (or unknown) state variable that needs to be inferred given the observations.
as "labels" for each element in the input sequence, this layout admits efficient algorithms for: The conditional dependency of each
, which can be thought of as measurements on the input sequence that partially determine the likelihood of each possible value for
The model assigns each feature a numerical weight and combines them to determine the probability of a certain value for
Linear-chain CRFs have many of the same applications as conceptually simpler hidden Markov models (HMMs), but relax certain assumptions about the input and output sequence distributions.
An HMM can loosely be understood as a CRF with very specific feature functions that use constant probabilities to model state transitions and emissions.
Conversely, a CRF can loosely be understood as a generalization of an HMM that makes the constant transition probabilities into arbitrary functions that vary across the positions in the sequence of hidden states, depending on the input sequence.
at any point during inference, and the range of the feature functions need not have a probabilistic interpretation.
CRFs can be extended into higher order models by making each
In conventional formulations of higher order CRFs, training and inference are only practical for small values of
However, another recent advance has managed to ameliorate these issues by leveraging concepts and tools from the field of Bayesian nonparametrics.
Specifically, the CRF-infinity approach[9] constitutes a CRF-type model that is capable of learning infinitely-long temporal dynamics in a scalable fashion.
This is effected by introducing a novel potential function for CRFs that is based on the Sequence Memoizer (SM), a nonparametric Bayesian model for learning infinitely-long dynamics in sequential observations.
[10] To render such a model computationally tractable, CRF-infinity employs a mean-field approximation[11] of the postulated novel potential functions (which are driven by an SM).
This allows for devising efficient approximate training and inference algorithms for the model, without undermining its capability to capture and model temporal dependencies of arbitrary length.
There exists another generalization of CRFs, the semi-Markov conditional random field (semi-CRF), which models variable-length segmentations of the label sequence
[12] This provides much of the power of higher-order CRFs to model long-range dependencies of the
Latent-dynamic conditional random fields (LDCRF) or discriminative probabilistic latent variable models (DPLVM) are a type of CRFs for sequence tagging tasks.
, the main problem the model must solve is how to assign a sequence of labels y =
Instead of directly modeling P(y|x) as an ordinary linear-chain CRF would do, a set of latent variables h is "inserted" between x and y using the chain rule of probability:[13] This allows capturing latent structure between the observations and labels.
[13] These models find applications in computer vision, specifically gesture recognition from video streams[14] and shallow parsing.