SURF was first published by Herbert Bay, Tinne Tuytelaars, and Luc Van Gool, and presented at the 2006 European Conference on Computer Vision.
[1] An "upright" version of SURF (called U-SURF) is not invariant to image rotation and therefore faster to compute and better suited for application where the camera remains more or less horizontal.
This achieves a special blurring effect on the original image, called Scale-Space and ensures that the points of interest are scale invariant.
(The SIFT approach uses cascaded filters to detect scale-invariant characteristic points, where the difference of Gaussians (DoG) is calculated on rescaled images progressively.)
In contrast to the Hessian-Laplacian detector by Mikolajczyk and Schmid, SURF also uses the determinant of the Hessian for selecting the scale, as is also done by Lindeberg.
The box filter of size 9×9 is an approximation of a Gaussian with σ=1.2 and represents the lowest level (highest spatial resolution) for blob-response maps.
Hence, unlike previous methods, scale spaces in SURF are implemented by applying box filters of different sizes.
This results in filters of size 9×9, 15×15, 21×21, 27×27,.... Non-maximum suppression in a 3×3×3 neighborhood is applied to localize interest points in the image and over scales.
The maxima of the determinant of the Hessian matrix are then interpolated in scale and image space with the method proposed by Brown, et al.
The goal of a descriptor is to provide a unique and robust description of an image feature, e.g., by describing the intensity distribution of the pixels within the neighbourhood of the point of interest.
A short descriptor may be more robust against appearance variations, but may not offer sufficient discrimination and thus give too many false positives.
The first step consists of fixing a reproducible orientation based on information from a circular region around the interest point.
The size of the sliding window is a parameter that has to be chosen carefully to achieve a desired balance between robustness and angular resolution.
The interest region is split into smaller 4x4 square sub-regions, and for each one, the Haar wavelet responses are extracted at 5x5 regularly spaced sample points.