[1] Language models are useful for a variety of tasks, including speech recognition,[2] machine translation,[3] natural language generation (generating more human-like text), optical character recognition, route optimization,[4] handwriting recognition,[5] grammar induction,[6] and information retrieval.
[7][8] Large language models, currently their most advanced form, are a combination of larger datasets (frequently using words scraped from the public internet), feedforward neural networks, and transformers.
Noam Chomsky did pioneering work on language models in the 1950s by developing a theory of formal grammars, which became fundamental to the field of programming languages.
[9] In 1980, statistical approaches were explored and found to be more useful for many purposes than rule-based formal grammars.
Discrete representations like word n-gram language models, with probabilities for discrete combinations of words, made significant advances.
[10] Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning, and common relationships between pairs of words like plurality or gender .
In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.
[13] Special tokens are introduced to denote the start and end of a sentence
Maximum entropy language models encode the relationship between a word and the n-gram history using feature functions.
In the simplest case, the feature function is just an indicator of the presence of a certain n-gram.
Words represented in an embedding vector were not necessarily consecutive anymore, but could leave gaps that are skipped over.
[14] Formally, a k-skip-n-gram is a length-n subsequence where the components occur at distance at most k from each other.
For example, in the input text: the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences In skip-gram model, semantic relations between words are represented by linear combinations, capturing a form of compositionality.
For example, in some such models, if v is the function that maps a word w to its n-d vector representation, then
Continuous representations or embeddings of words are produced in recurrent neural network-based language models (known also as continuous space language models).
[17] Such continuous space embeddings help to alleviate the curse of dimensionality, which is the consequence of the number of possible sequences of words increasing exponentially with the size of the vocabulary, further causing a data sparsity problem.
LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.
Although sometimes matching human performance, it is not clear whether they are plausible cognitive models.
At least for recurrent neural networks, it has been shown that they sometimes learn patterns that humans do not, but fail to learn patterns that humans typically do.
[22] Evaluation of the quality of language models is mostly done by comparison to human created sample benchmarks created from typical language-oriented tasks.
[23] Various data sets have been developed for use in evaluating language processing systems.