Data scraping

Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and minimize ambiguity.

Aside from the higher programming and processing overhead, output displays intended for human consumption often change structure frequently.

[1] Although the use of physical "dumb terminal" IBM 3270s is slowly diminishing, as more and more mainframe applications acquire Web interfaces, some Web applications merely continue to use the technique of screen scraping to capture old screens and transfer the data to modern front-ends.

As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the 1960s—the dawn of computerized data processing.

Computer to user interfaces from that era were often simply text-based dumb terminals which were not much more than virtual teleprinters (such systems are still in use today[update], for various reasons).

In such cases, the only feasible solution may be to write a screen scraper that "pretends" to be a user at a terminal.

A sophisticated and resilient implementation of this kind, built on a platform providing the governance and control required by a major enterprise—e.g.

change control, security, user management, data protection, operational audit, load balancing, and queue management, etc.—could be said to be an example of robotic process automation software, called RPA or RPAAI for self-guided RPA 2.0 based on artificial intelligence.

The common term for this practice, especially in the United Kingdom, was page shredding, since the results could be imagined to have passed through a paper shredder.

Internally Reuters used the term 'logicized' for this conversion process, running a sophisticated computer system on VAX/VMS called the Logicizer.

[4] This can be combined in the case of GUI applications, with querying the graphical controls by programmatically obtaining references to their underlying programming objects.

[5] Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form.

[6] Companies like Amazon AWS and Google provide web scraping tools, services, and public data available free of cost to end-users.

[7] Recently, companies have developed web scraping systems that rely on using techniques in DOM parsing, computer vision and natural language processing to simulate the human processing that occurs when viewing a webpage to automatically extract useful information.

This approach can provide a quick and simple route to obtaining data without the need to program an API to the source system.

A screen fragment and a screen-scraping interface (blue box with red arrow) to customize data capture process.
vectorial version
vectorial version