The task becomes significantly complicated by factors such as background clutter, occlusion, and variations in viewpoint, illumination, and scale.
For this reason, particular parts of an object such as the headlights or tires of a car still have consistent appearances and relative positions.
Model parameters are estimated using an unsupervised learning algorithm, meaning that the visual concept of an object class can be extracted from an unlabeled set of training images, even if that set contains "junk" images or instances of objects from multiple categories.
It can also account for the absence of model parts due to appearance variability, occlusion, clutter, or detector error.
The idea for a "parts and structure" model was originally introduced by Fischler and Elschlager in 1973.
The Constellation Model, as introduced by Dr. Perona and his colleagues, was a probabilistic adaptation of this approach.
In the late '90s, Burl et al.[2][3][4][5] revisited the Fischler and Elschlager model for the purpose of face recognition.
In their work, Burl et al. used manual selection of constellation parts in training images to construct a statistical model for a set of detectors and the relative locations at which they should be applied.
In 2000, Weber et al. [6][7][8][9] made the significant step of training the model using a more unsupervised learning process, which precluded the necessity for tedious hand-labeling of parts.
Their algorithm was particularly remarkable because it performed well even on cluttered and occluded image data.
Image features generated from the vicinity of these points are then clustered using k-means or another appropriate algorithm.
In this process of vector quantization, one can think of the centroids of these clusters as being representative of the appearance of distinctive object parts.
Appropriate feature detectors are then trained using these clusters, which can be used to obtain a set of candidate parts from images.
Each part has a type, corresponding to one of the aforementioned appearance clusters, as well as a location in the image space.
To parametrize the joint probability density, Weber & Welling introduce the auxiliary variables
By decomposition, The probability density over the number of background detections can be modeled by a Poisson distribution, where
To accomplish this, Weber & Welling run part detectors from the learning step exhaustively over the image, examining different combinations of detections.
The goal is then to select the class with maximum a posteriori probability, by considering the ratio where
After the preliminary step of interest point detection, feature generation and clustering, we have a large set of candidate parts over the training images.
EM proceeds by maximizing the likelihood of the observed data, with respect to the model parameters
Since this is difficult to achieve analytically, EM iteratively maximizes a sequence of cost functions, Taking the derivative of this with respect to the parameters and equating to zero produces the update rules: The update rules in the M-step are expressed in terms of sufficient statistics,
, which are calculated in the E-step by considering the posterior density: In Weber et al., shape and appearance models are constructed separately.
The innovation of Fergus et al. is to learn not only two, but three model parameters simultaneously: shape, appearance, and relative scale.
[10] Whereas the preliminary step in the Weber et al. method is to search for the locations of interest points, Fergus et al. use the detector of Kadir and Brady[12] to find salient regions in the image over both location (center) and scale (radius).
Fergus et al. then normalize the squares bounding these circular regions to 11 x 11 pixel patches, or equivalently, 121-dimensional vectors in the appearance space.
These are then reduced to 10-15 dimensions by principal component analysis, giving the appearance information
[11] The Constellation Model as conceived by Fergus et al. achieves successful categorization rates consistently above 90% on large datasets of motorbikes, faces, airplanes, and spotted cats.
[13] For each of these datasets, the Constellation Model is able to capture the "essence" of the object class in terms of appearance and/or shape.
It is important to note that the Constellation Model does not generally account for significant changes in orientation.
Because the computation of sufficient statistics in the E-step of expectation maximization necessitates evaluating the likelihood for every hypothesis, learning becomes a major bottleneck operation.