Word-sense disambiguation

Given that natural language requires reflection of neurological reality, as shaped by the abilities provided by the brain's neural networks, computer science has had a long-term challenge in developing the ability in computers to do natural language processing and machine learning.

[1] Later, Bar-Hillel (1960) argued[2] that WSD could not be solved by "electronic computer" because of the need in general to model all world knowledge.

In the 1990s, the statistical revolution advanced computational linguistics, and WSD became a paradigm problem on which to apply supervised machine learning techniques.

The question whether these tasks should be kept together or decoupled is still not unanimously resolved, but recently scientists incline to test these things separately (e.g. in the Senseval/SemEval competitions parts of speech are provided as input for the text to disambiguate).

However, while it is relatively easy to assign parts of speech to text, training people to tag senses has been proven to be far more difficult.

These approaches are generally not considered to be very successful in practice, mainly because such a body of knowledge does not exist in a computer-readable format, outside very limited domains.

This attempt used as data a punched-card version of Roget's Thesaurus and its numbered "heads", as an indicator of topics and looked for repetitions in text, using a set intersection algorithm.

In recent research, kernel-based methods such as support vector machines have shown superior performance in supervised learning.

Graph-based approaches have also gained much attention from the research community, and currently achieve performance close to the state of the art.

[3][22] Recently, it has been reported that simple graph connectivity measures, such as degree, perform state-of-the-art WSD in the presence of a sufficiently rich lexical knowledge base.

[23] Also, automatically transferring knowledge in the form of semantic relations from Wikipedia to WordNet has been shown to boost simple knowledge-based methods, enabling them to rival the best supervised systems and even outperform them in a domain-specific setting.

Supervised methods are based on the assumption that the context can provide enough evidence on its own to disambiguate words (hence, common sense and reasoning are deemed unnecessary).

Support Vector Machines and memory-based learning have been shown to be the most successful approaches, to date, probably because they can cope with the high-dimensionality of the feature space.

However, these supervised methods are subject to a new knowledge acquisition bottleneck since they rely on substantial amounts of manually sense-tagged corpora for training, which are laborious and expensive to create.

This classifier is then used on the untagged portion of the corpus to extract a larger training set, in which only the most confident classifications are included.

If a mapping to a set of dictionary senses is not desired, cluster-based evaluations (including measures of entropy and purity) can be performed.

[30][31][32] Even though most of traditional word-embedding techniques conflate words with multiple meanings into a single vector representation, they still can be used to improve WSD.

[34][35] In addition to word-embedding techniques, lexical databases (e.g., WordNet, ConceptNet, BabelNet) can also assist unsupervised systems in mapping words and their senses as dictionaries.

Some techniques that combine lexical databases and word embeddings are presented in AutoExtend[36][37] and Most Suitable Sense Annotation (MSSA).

In its improved version, MSSA can make use of word sense embeddings to repeat its disambiguation process iteratively.

Other approaches may vary differently in their methods: The knowledge acquisition bottleneck is perhaps the major impediment to solving the WSD problem.

Unsupervised methods rely on knowledge about word senses, which is only sparsely formulated in dictionaries and lexical databases.

Supervised methods depend crucially on the existence of manually annotated examples for every word sense, a requisite that can so far[when?]

One of the most promising trends in WSD research is using the largest corpus ever accessible, the World Wide Web, to acquire lexical information automatically.

[50] WSD has been traditionally understood as an intermediate language engineering technology which could improve applications such as information retrieval (IR).

The historic lack of training data has provoked the appearance of some new algorithms and techniques, as described in Automatic acquisition of sense-tagged corpora.

They can vary from corpora of texts, either unlabeled or annotated with word senses, to machine-readable dictionaries, thesauri, glossaries, ontologies, etc.

The systems submitted for evaluation to these competitions usually integrate different techniques and often combine supervised and knowledge-based methods (especially for avoiding bad performance in lack of training examples).