Cooley–Tukey FFT algorithm

in terms of N1 smaller DFTs of sizes N2, recursively, to reduce the computation time to O(N log N) for highly composite N (smooth numbers).

This algorithm, including its recursive application, was invented around 1805 by Carl Friedrich Gauss, who used it to interpolate the trajectories of the asteroids Pallas and Juno, but his work was not widely recognized (being published only posthumously and in Neo-Latin).

[2] FFTs became popular after James Cooley of IBM and John Tukey of Princeton published a paper in 1965 reinventing[2] the algorithm and describing how to perform it conveniently on a computer.

[3] Tukey reportedly came up with the idea during a meeting of President Kennedy's Science Advisory Committee discussing ways to detect nuclear-weapon tests in the Soviet Union by employing seismometers located outside the country.

However, analysis of this data would require fast algorithms for computing DFTs due to the number of sensors and length of time.

This task was critical for the ratification of the proposed nuclear test ban so that any violations could be detected without need to visit Soviet facilities.

[4][5] Another participant at that meeting, Richard Garwin of IBM, recognized the potential of the method and put Tukey in touch with Cooley.

Cooley and Tukey subsequently published their joint paper, and wide adoption quickly followed due to the simultaneous development of Analog-to-digital converters capable of sampling at rates up to 300 kHz.

The fact that Gauss had described the same algorithm (albeit without analyzing its asymptotic cost) was not realized until several years after Cooley and Tukey's 1965 paper.

as: This result, expressing the DFT of length N recursively in terms of two DFTs of size N/2, is the core of the radix-2 DIT fast Fourier transform.

This process is an example of the general technique of divide and conquer algorithms; in many conventional implementations, however, the explicit recursion is avoided, and instead one traverses the computational tree in breadth-first fashion.

The above re-expression of a size-N DFT as two size-N/2 DFTs is sometimes called the Danielson–Lanczos lemma, since the identity was noted by those two authors in 1942[7] (influenced by Runge's 1903 work[2]).

They applied their lemma in a "backwards" recursive fashion, repeatedly doubling the DFT size until the transform spectrum converged (although they apparently didn't realize the linearithmic [i.e., order N log N] asymptotic complexity they had achieved).

Cooley and Tukey's 1965 paper reported a running time of 0.02 minutes for a size-2048 complex DFT on an IBM 7094 (probably in 36-bit single precision, ~8 digits).

(To put the time for the hand calculation in perspective, 140 minutes for size 64 corresponds to an average of at most 16 seconds per floating-point operation, around 20% of which are multiplications.)

In pseudocode, the below procedure could be written:[8] Here, ditfft2(x,N,1), computes X=DFT(x) out-of-place by a radix-2 DIT FFT, where N is an integer power of 2 and s=1 is the stride of the input x array.

Split radix merges radices 2 and 4, exploiting the fact that the first transform of radix 2 requires no twiddle factor, in order to achieve what was long the lowest known arithmetic operation count for power-of-two sizes,[10] although recent variations achieve an even lower count.

[11][12] (On present-day computers, performance is determined more by cache and CPU pipeline considerations than by strict operation counts; well-optimized FFT implementations often employ larger radices and/or hard-coded base-case transforms of significant size.[13]).

The net result of all of these transpositions, for a radix-2 algorithm, corresponds to a bit reversal of the input (DIF) or output (DIT) indices.

If, instead of using a small radix, one employs a radix of roughly √N and explicit input/output matrix transpositions, it is called a four-step FFT algorithm (or six-step, depending on the number of transpositions), initially proposed to improve memory locality,[14][15] e.g. for cache optimization or out-of-core operation, and was later shown to be an optimal cache-oblivious algorithm.

That is, it re-indexes the input (n) and output (k) as N1 by N2 two-dimensional arrays in column-major and row-major order, respectively; the difference between these indexings is a transposition, as mentioned above.

An arbitrary radix r (as well as mixed radices) can be employed, as was shown by both Cooley and Tukey[3] as well as Gauss (who gave examples of radix-3 and radix-6 steps).

[14][15][16] Although the abstract Cooley–Tukey factorization of the DFT, above, applies in some form to all implementations of the algorithm, much greater diversity exists in the techniques for ordering and accessing the data at each stage of the FFT.

Of special interest is the problem of devising an in-place algorithm that overwrites its input with its output data using only O(1) auxiliary storage.

Moreover, it is desirable on many hardware architectures to re-order intermediate stages of the FFT algorithm so that they operate on consecutive (or at least more localized) data elements.

To these ends, a number of alternative implementation schemes have been devised for the Cooley–Tukey algorithm that do not require separate bit reversal and/or involve additional permutations at intermediate stages.

[23][24] Even greater potential SIMD advantages (more consecutive accesses) have been proposed for the Pease algorithm,[25] which also reorders out-of-place with each stage, but this method requires separate bit/digit reversal and O(N log N) storage.

One can also directly apply the Cooley–Tukey factorization definition with explicit (depth-first) recursion and small radices, which produces natural-order out-of-place output with no separate permutation step (as in the pseudocode above) and can be argued to have cache-oblivious locality benefits on systems with hierarchical memory.

[9][13][26] A typical strategy for in-place algorithms without auxiliary storage and without separate digit-reversal passes involves small matrix transpositions (which swap individual pairs of digits) at intermediate stages, which can be combined with the radix butterflies to reduce the number of passes over the data.

[13][27][28][29][30] Note, when FFT is used to calculate a convolution, the data reordering can be avoided if a suitable combination of forward and backward transformation is used.

Data flow diagram for N =8: a decimation-in-time radix-2 FFT breaks a length- N DFT into two length- N /2 DFTs followed by a combining stage consisting of N /2 size-2 DFTs called "butterfly" operations (so called because of the shape of the data-flow diagrams).
The basic step of the Cooley–Tukey FFT for general factorizations can be viewed as re-interpreting a 1d DFT as something like a 2d DFT. The 1d input array of length N = N 1 N 2 is reinterpreted as a 2d N 1 × N 2 matrix stored in column-major order . One performs smaller 1d DFTs along the N 2 direction (the non-contiguous direction), then multiplies by phase factors (twiddle factors), and finally performs 1d DFTs along the N 1 direction. The transposition step can be performed in the middle, as shown here, or at the beginning or end. This is done recursively for the smaller transforms.