- Published on
Understanding robots.txt
- Authors
- Name
- hwahyeon
robots.txt
is a text file located in the root directory of a website, used to allow or restrict search engine crawlers (bots) from accessing specific pages or directories on the site. This file operates according to the Robots Exclusion Protocol and serves as a guideline for search engines when crawling a site.
Key Components
User-agent
Specifies a particular search engine crawler.
Example: User-agent: Googlebot
To apply the rule to all crawlers, use User-agent: *
.
Disallow
Specifies paths that are restricted from crawling.
Example: Disallow: /admin/
prevents the /admin/
directory from being crawled. /
prevents all content from being crawled.
Allow
Explicitly permits crawling of specific paths, often used to specify exceptions within restricted paths.
Example: Allow: /public/
allows crawling of the /public/
directory.
Sitemap
Specifies the URL of an XML sitemap, helping search engines better understand the structure of the site.
Example: Sitemap: https://example.com/sitemap.xml
Example
User-agent: Google-Extended
Disallow: /
In this case, the Google-Extended crawler cannot crawl any pages of this site.
Note
robots.txt
does not enforce any restrictions. It merely provides guidelines for crawlers. Malicious crawlers may ignore it.