Heritrix

The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.

Heritrix was developed jointly by the Internet Archive and the Nordic national libraries on specifications written in early 2003.

[2] Starting in 2008, the Internet Archive began performance improvements to do its own wide scale crawling, and now does collect most of its content.

More recently it saves by default in the WARC file format, which is similar to ARC but more precisely specified and more flexible.

Example: Heritrix includes a command-line tool called arcreader which can be used to extract the contents of an Arc file.