Web scraping

An example would be finding and copying names and telephone numbers, companies and their URLs, or e-mail addresses to a list (contact scraping).

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form.

There are methods that some websites use to prevent web scraping, such as detecting and disallowing bots from crawling (viewing) their pages.

As there were fewer websites available on the web, search engines at that time used to rely on human administrators to collect and format links.

Sometimes even the best web-scraping technology cannot replace a human's manual examination and copy-and-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly set up barriers to prevent machine automation.

A simple yet powerful approach to extract information from web pages can be based on the UNIX grep command or regular expression-matching facilities of programming languages (for instance Perl or Python).

In data mining, a program that detects such templates in a particular information source, extracts its content, and translates it into a relational form, is called a wrapper.

The preparation involves establishing the knowledge base for the entire vertical and then the platform creates the bots automatically.

The platform's robustness is measured by the quality of the information it retrieves (usually number of fields) and its scalability (how quick it can scale up to hundreds or thousands of sites).

This scalability is mostly used to target the Long Tail of sites that common aggregators find complicated or too labor-intensive to harvest content from.

The pages being scraped may embrace metadata or semantic markups and annotations, which can be used to locate specific data snippets.

This method enables more intelligent and flexible data extraction, accommodating complex and dynamic web content.

[6] In the United States, website owners can use three major legal claims to prevent undesired web scraping: (1) copyright infringement (compilation), (2) violation of the Computer Fraud and Abuse Act ("CFAA"), and (3) trespass to chattel.

[10] One of the first major tests of screen scraping involved American Airlines (AA), and a firm called FareChase.

The airline argued that FareChase's websearch software trespassed on AA's servers when it collected the publicly available data.

[12] Southwest Airlines has also challenged screen-scraping practices, and has involved both FareChase and another firm, Outtask, in a legal claim.

They also claimed that screen-scraping constitutes what is legally known as "Misappropriation and Unjust Enrichment", as well as being a breach of the web site's user agreement.

The court held that the cease-and-desist letter and IP blocking was sufficient for Craigslist to properly claim that 3Taps had violated the Computer Fraud and Abuse Act (CFAA).

[22] Internet Archive collects and distributes a significant number of publicly available web pages without being considered to be in violation of copyright laws.

In contrast to the findings of the United States District Court Eastern District of Virginia and those of the Danish Maritime and Commercial Court, Justice Michael Hanna ruled that the hyperlink to Ryanair's terms and conditions was plainly visible, and that placing the onus on the user to agree to terms and conditions in order to gain access to online services is sufficient to comprise a contractual relationship.

[28][29] Leaving a few cases dealing with IPR infringement, Indian courts have not expressly ruled on the legality of web scraping.