Collective classification

[1] Collective classification problems are defined in terms of networks of random variables, where the network structure determines the relationship between the random variables.

Approaches that use collective classification can make use of relational information when performing inference.

Examples of collective classification include predicting attributes (ex.

gender, age, political affiliation) of individuals in a social network, classifying webpages in the World Wide Web, and inferring the research area of a paper in a scientific publication dataset.

Traditionally, a major focus of machine learning is to solve classification problems.

Many machine learning models for performing this task will try to categorize each item independently, and focus on predicting the class labels separately.

Many researchers have proposed techniques that attempt to classify samples in a joint or collective manner, instead of treating each sample in isolation; these techniques have enabled significant gains in classification accuracy.

An approach which performs collective classification might assume that users who are friends tend to have similar political views, and could then jointly infer all unobserved party affiliations while making use of the rich relational structure of the social network.

, the set of nodes for which we know the correct label values (observed variables), and

In such settings, traditional classification algorithms assume that the data is drawn independently and identically from some distribution (iid).

Instead, there are three distinct types of correlations that can be utilized to determine the classification or label of

Though convergence and optimality is not always mathematically guaranteed, in practice, these approaches will typically converge quickly to a good solution, depending on the graph structure and problem complexity.

The methods presented in this section are representative of this iterative approach.

A natural assumption in network classification is that adjacent nodes are likely to have the same label (i.e., contagion or homophily).

[4] While label propagation is surprisingly effective, it may sometimes fail to capture complex relational dynamics.

Iterative classification applies uses a local classifier for each node, which uses information about current predictions and ground truth information about the node's neighbors, and iterates until the local predictions converge to a global solution.

[5][6][7] Another approach to collective classification is to represent the problem with a graphical model and use learning and inference techniques for the graphical modeling approach to arrive at the correct classifications.

Graphical models are tools for joint, probabilistic inference, making them ideal for collective classification.

Graphical models can be broadly categorized by whether the underlying graph is directed (e.g., Bayesian networks or collections of local classifiers) or undirected (e.g., Markov random fields (MRF)).

Gibbs sampling is a general framework for approximating a distribution.

It is a Markov chain Monte Carlo algorithm, in that it iteratively samples from the current estimate of the distribution, constructing a Markov chain that converges to the target (stationary) distribution.

and maintain count statistics for the number of times we sampled label

After collecting a predefined number of such samples, we output the best label assignment for node

by choosing the label that was assigned the maximum number of times to

[8][9] For certain undirected graphical models, it is possible to efficiently perform exact inference via message passing, or belief propagation algorithms.

[10] These algorithms follow a simple iterative pattern: each variable passes its "beliefs" about its neighbors' marginal distributions, then uses the incoming messages about its own value to update its beliefs.

Statistical relational learning is often used to address collective classification problems.

A variety of SRL methods has been applied to the collective classification setting.

Some of the methods include direct methods such probabilistic relational models (PRM),[11] coupled conditional models such as link-based classification,[12] and indirect methods such as Markov logic networks (MLN)[13] and Probabilistic Soft Logic (PSL).

[14] Collective classification is applied in many domains which exhibit relational structure, such as: