Scale-invariant feature transform

The determination of consistent clusters is performed rapidly by using an efficient hash table implementation of the generalised Hough transform.

Each cluster of 3 or more features that agree on an object and its pose is then subject to further detailed model verification and subsequently outliers are discarded.

Similarly, features located in articulated or flexible objects would typically not work if any change in their internal geometry happens between two images in the set being processed.

[1] This section summarizes the original SIFT algorithm and mentions a few competing techniques available for object recognition under clutter and partial occlusion.

The SIFT features are local and based on the appearance of the object at particular interest points, and are invariant to image scale and rotation.

In addition to these properties, they are highly distinctive, relatively easy to extract and allow for correct object identification with low probability of mismatch.

They are relatively easy to match against a (large) database of local features but, however, the high dimensionality can be an issue, and generally probabilistic algorithms such as k-d trees with best bin first search are used.

These features share similar properties with neurons in the primary visual cortex that encode basic forms, color, and movement for object detection in primate vision.

[13] Key locations are defined as maxima and minima of the result of difference of Gaussians function applied in scale space to a series of smoothed and resampled images.

Lowe used a modification of the k-d tree algorithm called the best-bin-first search method[14] that can identify the nearest neighbors with high probability using only a limited amount of computation.

For a database of 100,000 keypoints, this provides a speedup over exact nearest neighbor search by about 2 orders of magnitude, yet results in less than a 5% loss in the number of correct matches.

The similarity transform implied by these 4 parameters is only an approximation to the full 6 degree-of-freedom pose space for a 3D object and also does not account for any non-rigid deformations.

Each identified cluster is then subject to a verification procedure in which a linear least squares solution is performed for the parameters of the affine transformation relating the model to the image.

Given the linear least squares solution, each match is required to agree within half the error range that was used for the parameters in the Hough transform bins.

The next step in the algorithm is to perform a detailed fit to the nearby data for accurate location, scale, and ratio of principal curvatures.

A similar subpixel determination of the locations of scale-space extrema is performed in the real-time implementation based on hybrid pyramids developed by Lindeberg and his co-workers.

Therefore, in order to increase stability, we need to eliminate the keypoints that have poorly determined locations but have high edge responses.

[2] This processing step for suppressing responses at edges is a transfer of a corresponding approach in the Harris operator for corner detection.

To test the distinctiveness of the SIFT descriptors, matching accuracy is also measured against varying number of keypoints in the testing database, and it is shown that matching accuracy decreases only very slightly for very large database sizes, thus indicating that SIFT features are highly distinctive.

[19] The main results are summarized below: The evaluations carried out suggests strongly that SIFT-based descriptors, which are region-based, are the most robust and distinctive, and are therefore best suited for feature matching.

[21] Recently, a slight variation of the descriptor employing an irregular histogram grid has been proposed that significantly improves its performance.

The Euclidean distance between SIFT-Rank descriptors is invariant to arbitrary monotonic changes in histogram bin values, and is related to Spearman's rank correlation coefficient.

Finally for each connected component bundle adjustment is performed to solve for joint camera parameters, and the panorama is rendered using multi-band blending.

Because of the SIFT-inspired object recognition approach to panorama stitching, the resulting system is insensitive to the ordering, orientation, scale and illumination of the images.

This is used with bundle adjustment initialized from an essential matrix or trifocal tensor to build a sparse 3D model of the viewed scene and to simultaneously recover camera poses and calibration parameters.

[31][32] Extensions of the SIFT descriptor to 2+1-dimensional spatio-temporal data in context of human action recognition in video sequences have been studied.

[36] The Feature-based Morphometry (FBM) technique[37] uses extrema in a difference of Gaussian scale-space to analyze and classify 3D magnetic resonance images (MRIs) of the human brain.

In an extensive experimental evaluation on a poster dataset comprising multiple views of 12 posters over scaling transformations up to a factor of 6 and viewing direction variations up to a slant angle of 45 degrees, it was shown that substantial increase in performance of image matching (higher efficiency scores and lower 1-precision scores) could be obtained by replacing Laplacian of Gaussian interest points by determinant of the Hessian interest points.

This study therefore shows that discregarding discretization effects the pure image descriptor in SIFT is significantly better than the pure image descriptor in SURF, whereas the underlying interest point detector in SURF, which can be seen as numerical approximation to scale-space extrema of the determinant of the Hessian, is significantly better than the underlying interest point detector in SIFT.

Wagner et al. developed two object recognition algorithms especially designed with the limitations of current mobile phones in mind.

After scale space extrema are detected (their location being shown in the uppermost image) the SIFT algorithm discards low-contrast keypoints (remaining points are shown in the middle image) and then filters out those located on edges. Resulting set of keypoints is shown on last image.