Quantitative comparative linguistics

Probably the first published quantitative historical linguistics study was by Sapir in 1916,[1] while Kroeber and Chretien in 1937 [2] investigated nine Indo-European (IE) languages using 74 morphological and phonological features (extended in 1939 by the inclusion of Hittite).

Swadesh, using word lists, developed lexicostatistics and glottochronology in a series of papers [4] published in the early 1950s but these methods were widely criticised [5] though some of the criticisms were seen as unjustified by other scholars.

Embleton published a book on "Statistics in Historical Linguistics" in 1986 which reviewed previous work and extended the glottochronological method.

Such projects often involved collaboration by linguistic scholars, and colleagues with expertise in information science and/or biological anthropology.

These projects often sought to arrive at an optimal phylogenetic tree (or network), to represent a hypothesis about the evolutionary ancestry and perhaps its language contacts.

Greater media attention was generated in 2003 after the publication by anthropologists Russell Gray and Quentin Atkinson of a short study on Indo-European languages in Nature.

Gray and Atkinson attempted to quantify, in a probabilistic sense, the age and relatedness of modern Indo-European languages and, sometimes, the preceding proto-languages.

The proceedings of an influential 2004 conference, Phylogenetic Methods and the Prehistory of Languages were published in 2006, edited by Peter Forster and Colin Renfrew.

[29] The steps in quantitative analysis are (i) to devise a procedure based on theoretical grounds, on a particular model or on past experience, etc.

[30] Applying phylogenetic methods to languages is a multi-stage process: (a) the encoding stage - getting from real languages to some expression of the relationships between them in the form of numerical or state data, so that those data can then be used as input to phylogenetic methods (b) the representation stage - applying phylogenetic methods to extract from those numerical and/or state data a signal that is converted into some useful form of representation, usually two dimensional graphical ones such as trees or networks, which synthesise and "collapse" what are often highly complex multi dimensional relationships in the signal (c) the interpretation stage - assessing those tree and network representations to extract from them what they actually mean for real languages and their relationships through time.

A rooted tree explicitly identifies a common ancestor, often by specifying a direction of evolution or by including an "outgroup" that is known to be only distantly related to the set of languages being classified.

A further type is the reticular network which shows incompatibilities (due to for example to contact) as reticulations and its internal nodes do represent ancestors.

As originally devised by Swadesh the single most common word for a slot was to be chosen, which can be difficult and subjective because of semantic shift.

Some methods allow constraints to be placed on language contact geography (isolation by distance) and on sub-group split times.

This showed that chance resemblances were critical to the technique and that Greenberg's conclusions could not be justified, though the mathematical procedure used by Ringe was later criticised.

In some cases with a large database and exhaustive search of all possible trees or networks is not feasible because of running time limitations.

The simplest assumption is that all characters evolve at a single constant rate with time and that this is independent of the tree branch.

A Markov Chain Monte Carlo algorithm[49] generates a sample of trees as an approximation to the posterior probability distribution.

The "Unweighted Pairwise Group Method with Arithmetic-mean" (UPGMA) is a clustering technique which operates by repeatedly joining the two languages that have the smallest distance between them.

The weighted splits are then represented in a tree or network based on minimising the number of changes between each pair of taxa.

Fitch and Kitch are maximum likelihood based programs in PHYLIP that allow a tree to be rearranged after each addition, unlike NJ.

Later he introduced a refined method, called SLD, to take account of the variable word distribution across languages.

A similar evaluation of the phonetics had earlier been carried out by Grimes and Agard for Romance languages, but this used only six points of comparison.

The families suggested for this analysis by Nichols and Warnow [73] are Germanic, Romance, Slavic, Common Turkic, Chinese, and Mixe Zoque as well as older groups such as Oceanic and IE.

They later founded the CHPL project, the goals of which include: "producing and maintaining real linguistic datasets, in particular of Indo-European languages", "formulating statistical models that capture the evolution of historical linguistic data", "designing simulation tools and accuracy measures for generating synthetic data for studying the performance of reconstruction methods", and "developing and implementing statistically-based as well as combinatorial methods for reconstructing language phylogenies, including phylogenetic networks".

They produced a standard multistate matrix where the 141 character states corresponds to individual cognate classes, allowing polymorphism.

The PAUP software package was used for UPGMA, NJ, and MC as well as computing the majority consensus trees.

Then a screened database was produced excluding all characters that clearly exhibited parallel development, so eliminating 38 features.

[80] Cysouw et al. (2006) [81] compared Holm's original method with NJ, Fitch, MP and SD.

In 2013, François Barbancon, Warnow, Evans, Ringe and Nakleh (2013) studied various tree reconstruction methods using simulated data.