Understanding robots.txt

robots.txt is a text file located in the root directory of a website, used to allow or restrict search engine crawlers (bots) from accessing specific pages or directories on the site. This file operates according to the Robots Exclusion Protocol and serves as a guideline for search engines when crawling a site.

Key Components

User-agent

Specifies a particular search engine crawler.

Example: User-agent: Googlebot

To apply the rule to all crawlers, use User-agent: *.

Disallow

Specifies paths that are restricted from crawling.

Example: Disallow: /admin/ prevents the /admin/ directory from being crawled. / prevents all content from being crawled.

Allow

Explicitly permits crawling of specific paths, often used to specify exceptions within restricted paths.

Example: Allow: /public/ allows crawling of the /public/ directory.

Sitemap

Specifies the URL of an XML sitemap, helping search engines better understand the structure of the site.

Example: Sitemap: https://example.com/sitemap.xml

Example

User-agent: Google-Extended
Disallow: /

In this case, the Google-Extended crawler cannot crawl any pages of this site.

Note

robots.txt does not enforce any restrictions. It merely provides guidelines for crawlers. Malicious crawlers may ignore it.