In the presence of round-off error, many FFT algorithms are much more accurate than evaluating the DFT definition directly or indirectly.
The development of fast algorithms for DFT was prefigured in Carl Friedrich Gauss's unpublished 1805 work on the orbits of asteroids Pallas and Juno.
Gauss wanted to interpolate the orbits from sample observations;[6][7] his method was very similar to the one that would be published in 1965 by James Cooley and John Tukey, who are generally credited for the invention of the modern generic FFT algorithm.
Frank Yates in 1932 published his version called interaction algorithm, which provided efficient computation of Hadamard and Walsh transforms.
In 1942, G. C. Danielson and Cornelius Lanczos published their version to compute DFT for x-ray crystallography, a field where calculation of Fourier transforms presented a formidable bottleneck.
[11] James Cooley and John Tukey independently rediscovered these earlier algorithms[7] and published a more general FFT in 1965 that is applicable when n is composite and not necessarily a power of 2, as well as analyzing the
[12] Tukey came up with the idea during a meeting of President Kennedy's Science Advisory Committee where a discussion topic involved detecting nuclear tests by the Soviet Union by setting up sensors to surround the country from outside.
[13] Garwin gave Tukey's idea to Cooley (both worked at IBM's Watson labs) for implementation.
In practice, actual performance on modern computers is usually dominated by factors other than the speed of arithmetic operations and the analysis is a complicated subject (for example, see Frigo & Johnson, 2005),[17] but the overall improvement from
[18] This method (and the general idea of an FFT) was popularized by a publication of Cooley and Tukey in 1965,[12] but it was later discovered[1] that those two authors had together independently re-invented an algorithm known to Carl Friedrich Gauss around 1805[19] (and subsequently rediscovered several times in limited forms).
The Rader–Brenner algorithm (1976)[20] is a Cooley–Tukey-like factorization but with purely imaginary twiddle factors, reducing multiplications at the cost of increased additions and reduced numerical stability; it was later superseded by the split-radix variant of Cooley–Tukey (which achieves the same multiplication count but with fewer additions and without sacrificing accuracy).
into cyclotomic polynomials—these often have coefficients of 1, 0, or −1, and therefore require few (if any) multiplications, so Winograd can be used to obtain minimal-multiplication FFTs and is often used to find efficient algorithms for small factors.
Another prime-size FFT is due to L. I. Bluestein, and is sometimes called the chirp-z algorithm; it also re-expresses a DFT as a convolution, but this time of the same size (which can be zero-padded to a power of two and evaluated by radix-2 Cooley–Tukey FFTs, for example), via the identity Hexagonal fast Fourier transform (HFFT) aims at computing an efficient FFT for the hexagonally-sampled data by using a new addressing scheme for hexagonal grids, called Array Set Addressing (ASA).
In many applications, the input data for the DFT are purely real, in which case the outputs satisfy the symmetry and efficient FFT algorithms have been designed for this situation (see e.g. Sorensen, 1987).
[23][24] One approach consists of taking an ordinary algorithm (e.g. Cooley–Tukey) and removing the redundant parts of the computation, saving roughly a factor of two in time and memory.
It was once believed that real-input DFTs could be more efficiently computed by means of the discrete Hartley transform (DHT), but it was subsequently argued that a specialized real-input DFT algorithm (FFT) can typically be found that requires fewer operations than the corresponding DHT algorithm (FHT) for the same number of inputs.
[23] Bruun's algorithm (above) is another method that was initially proposed to take advantage of real inputs, but it has not proved popular.
Instead of directly modifying an FFT algorithm for these cases, DCTs/DSTs can also be computed via FFTs of real data combined with
A fundamental question of longstanding theoretical interest is to prove lower bounds on the complexity and exact operation counts of fast Fourier transforms, and many open problems remain.
In particular, the count of arithmetic operations is usually the focus of such questions, although actual performance on modern-day computers is determined by many other factors such as cache or CPU pipeline optimization.
However, these algorithms require too many additions to be practical, at least on modern computers with hardware multipliers (Duhamel, 1990;[26] Frigo & Johnson, 2005).
Since 1968, however, the lowest published count for power-of-two n was long achieved by the split-radix FFT algorithm, which requires
A slightly larger count (but still better than split radix for n ≥ 256) was shown to be provably optimal for n ≤ 512 under additional restrictions on the possible algorithms (split-radix-like flowgraphs with unit-modulus multiplicative factors), by reduction to a satisfiability modulo theories problem solvable by brute force (Haynal & Haynal, 2011).
[31] Most of the attempts to lower or prove the complexity of FFT algorithms have focused on the ordinary complex-data case, because it is the simplest.
For example, an approximate FFT algorithm by Edelman et al. (1999)[33] achieves lower communication requirements for parallel computing with the help of a fast multipole method.
, and this has been demonstrated to lead to practical speedups compared to an ordinary FFT for n/k > 32 in a large-n example (n = 222) using a probabilistic approximate algorithm (which estimates the largest k coefficients to several decimal places).
[38] Achieving this accuracy requires careful attention to scaling to minimize loss of precision, and fixed-point FFT algorithms involve rescaling at each intermediate stage of decompositions like Cooley–Tukey.
time by a simple procedure checking the linearity, impulse-response, and time-shift properties of the transform on random inputs (Ergün, 1995).
Equivalently, it is the composition of a sequence of d sets of one-dimensional DFTs, performed along one dimension at a time (in any order).
More generally, an asymptotically optimal cache-oblivious algorithm consists of recursively dividing the dimensions into two groups