Information extraction

[2] An example is the extraction from newswire reports of corporate mergers, such as denoted by the formal relation: from an online news sentence such as: A broad goal of IE is to allow computation to be done on the previously unstructured data.

Information extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display.

MUC is a competition-based conference[6] that focused on the following domains: Considerable support came from the U.S. Defense Advanced Research Projects Agency (DARPA), who wished to automate mundane tasks performed by government analysts, such as scanning newspapers for possible links to terrorism.

Knowledge contained within these documents can be made more accessible for machine processing by means of transformation into relational form, or by marking-up with XML tags.

A typical application of IE is to scan a set of documents written in a natural language and populate a database with the information extracted.

Typical IE tasks and subtasks include: Note that this list is not exhaustive and that the exact meaning of IE activities is not commonly accepted and that many approaches combine multiple sub-tasks of IE in order to achieve a wider goal.

The proliferation of the Web, however, intensified the need for developing IE systems that help people to cope with the enormous amount of data that are available online.

Systems that perform IE from online text should meet the requirements of low cost, flexibility in development and easy adaptation to new domains.

As a result, less linguistically intensive approaches have been developed for IE on the Web using wrappers, which are sets of highly accurate rules that extract a particular page's content.

Wrappers typically handle highly structured collections of web pages, such as product catalogs and telephone directories.