Zipf's law (/zɪf/; German pronunciation: [tsɪpf]) is an empirical law stating that when a list of measured values is sorted in decreasing order, the value of the n-th entry is often approximately inversely proportional to n. The best known instance of Zipf's law applies to the frequency table of words in a text or corpus of natural language:
It has been found to apply to many other types of data studied in the physical and social sciences.
In 1913, the German physicist Felix Auerbach observed an inverse proportionality between the population sizes of cities, and their ranks when sorted by decreasing order of that variable.
[10] The same relation for frequencies of words in natural language texts was observed by George Zipf in 1932,[4] but he never claimed to have originated it.
ibidem, p. 21: The only mathematical expression Zipf used looks like ab2 = constant, which he "borrowed" from Alfred J. Lotka's 1926 publication.
[13] The same relation is found for personal incomes (where it is called Pareto principle[14]), number of people watching the same TV channel,[15] notes in music,[16] cells transcriptomes,[17][18] and more.
In 1992 bioinformatician Wentian Li published a short paper[19] showing that Zipf's law emerges even in randomly generated texts.
The generalized Zipf distribution can be extended to infinitely many items (N = ∞) only if the exponent s exceeds 1 .
The infinite item case is characterized by the Zeta distribution and is called Lotka's law.
The data conform to Zipf's law with exponent s to the extent that the plot approximates a linear (more precisely, affine) function with slope −s.
[3] Although Zipf's Law holds for most natural languages, and even certain artificial ones such as Esperanto[22] and Toki Pona,[23] the reason is still not well understood.
[24] Recent reviews of generative processes for Zipf's law include Mitzenmacher, "A Brief History of Generative Models for Power Law and Lognormal Distributions",[25] and Simkin, "Re-inventing Willis".
Wentian Li has shown that in a document in which each character has been chosen randomly from a uniform distribution of all letters (plus a space character), the "words" with different lengths follow the macro-trend of Zipf's law (the more probable words are the shortest and have equal probability).
[19] In 1959, Vitold Belevitch observed that if any of a large class of well-behaved statistical distributions (not only the normal distribution) is expressed in terms of rank and expanded into a Taylor series, the first-order truncation of the series results in Zipf's law.
[27][28] The principle of least effort is another possible explanation: Zipf himself proposed that neither speakers nor hearers using a given language wants to work any harder than necessary to reach understanding, and the process that results in approximately equal distribution of effort leads to the observed Zipf distribution.
[5][29] A minimal explanation assumes that words are generated by monkeys typing randomly.
If language is generated by a single monkey typing randomly, with fixed and nonzero probability of hitting each letter key or white space, then the words (letter strings separated by white spaces) produced by the monkey follows Zipf's law.
It was originally derived to explain population versus rank in species by Yule, and applied to cities by Simon.
It has been shown mathematically that Zipf's law holds for Atlas models that satisfy certain natural regularity conditions.
Following Auerbach's 1913 observation, there has been substantial examination of Zipf's law for city sizes.
[39] However, more recent empirical[40][41] and theoretical[42] studies have challenged the relevance of Zipf's law for cities.
The actual rank-frequency plot of a natural language text deviates in some extent from the ideal Zipf distribution, especially at the two ends of the range.
At the low-frequency end, where the rank approaches N, the plot takes a staircase shape, because each word can occur only an integer number of times.
The rank-frequency table for those morphemes deviates significantly from the ideal Zipf law, at both ends of the range.
[citation needed] Even in English, the deviations from the ideal Zipf's law become more apparent as one examines large collections of texts.
[45] In these cases, the observed frequency-rank relation can be modeled more accurately as by separate Zipf–Mandelbrot laws distributions for different subsets or subtypes of words.
In particular, the frequencies of the closed class of function words in English is better described with s lower than 1, while open-ended vocabulary growth with document size and corpus size require s greater than 1 for convergence of the Generalized Harmonic Series.
[citation needed] Zipf's law has been used for extraction of parallel fragments of texts out of comparable corpora.
[46] Laurance Doyle and others have suggested the application of Zipf's law for detection of alien language in the search for extraterrestrial intelligence.
[49][50] The word-like sign groups of the 15th-century codex Voynich Manuscript have been found to satisfy Zipf's law, suggesting that text is most likely not a hoax but rather written in an obscure language or cipher.