Focused crawler

A form of online reinforcement learning has been used, along with features extracted from the DOM tree and text of linking pages, to continually train[11] classifiers that guide the crawl.

Dong et al.[15] introduced such an ontology-learning-based crawler using support vector machine to update the content of ontological concepts when crawling Web Pages.

Cho et al.[16] study a variety of crawl prioritization policies and their effects on the link popularity of fetched pages.

Refinements involving detection of stale (poorly maintained) pages have been reported by Eiron et al.[18] A kind of semantic focused crawler, making use of the idea of reinforcement learning has been introduced by Meusel et al.[19] using online-based classification algorithms in combination with a bandit-based selection strategy to efficiently crawl pages with markup languages like RDFa, Microformats, and Microdata.

These high quality seeds should be selected based on a list of URL candidates which are accumulated over a sufficiently long period of general web crawling.