robots.txt

The standard was proposed by Martijn Koster,[1][2] when working for Nexor[3] in February 1994[4] on the www-talk mailing list, the main communication channel for WWW-related activities at the time.

[7] On July 1, 2019, Google announced the proposal of the Robots Exclusion Protocol as an official standard under Internet Engineering Task Force.

If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site.

This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data.

[6] Some major search engines following this standard include Ask,[11] AOL,[12] Baidu,[13] Bing,[14] DuckDuckGo,[15] Kagi,[16] Google,[17] Yahoo!,[18] and Yandex.

[20] Co-founder Jason Scott said that "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context.

[22][6] According to Digital Trends, this followed widespread use of robots.txt to remove historical sites from search engine results, and contrasted with the nonprofit's aim to archive "snapshots" of the internet as it previously existed.

[23] Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for generative AI.

In 2023, blog host Medium announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam Internet readers".

[6] GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but The Verge's David Pierce said this only began after "training the underlying models that made it so powerful".

[6] 404 Media reported that companies like Anthropic and Perplexity.ai circumvented robots.txt by renaming or spinning up new scrapers to replace the ones that appeared on popular blocklists.

[24] Despite the use of the terms allow and disallow, the protocol is purely advisory and relies on the compliance of the web robot; it cannot enforce any of what is stated in the file.

The National Institute of Standards and Technology (NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components.

[29] A web administrator could also configure the server to automatically return failure (or pass alternative content) when it detects a connection using one of the robots.

[33] Previously, Google had a joke file hosted at /killer-robots.txt instructing the Terminator not to kill the company founders Larry Page and Sergey Brin.