Robots.txt: A Complete Guide for 2026

Q: How do I block a single bot like an AI scraper?

Add a specific block before the wildcard rule. For example, User-agent: GPTBot followed by Disallow: /. Only well-behaved bots honor robots.txt — malicious scrapers ignore it, so for real protection use server-side blocks via Cloudflare or your firewall.

Robots.txt is a plain text file that lives at the root of your domain and tells search engine crawlers which parts of your site they are allowed to access. It is the first file Google, Bing, and other bots request when they arrive on your site, and a single wrong line in it can hide your entire site from search results. Used correctly, robots.txt helps crawlers focus on important content, protects server resources, and keeps low-value URLs out of indexing queues.

What Robots.txt Actually Does

The Robots Exclusion Protocol was created in 1994 as a way for site owners to communicate crawl preferences to automated bots. The protocol is voluntary — well-behaved bots like Google honor it, badly-behaved scrapers ignore it. Robots.txt is therefore a crawler guidance system, not a security mechanism. If you have sensitive data, password-protect it, do not just disallow it in robots.txt.

The file uses simple directives. User-agent identifies which bot a rule applies to. Disallow blocks a path. Allow creates an exception. Sitemap points to your XML sitemap. Together these directives shape how crawlers spend time on your site.

Where Robots.txt Lives

Robots.txt must be at the root of your domain: https://example.com/robots.txt. It cannot live in a subdirectory and cannot be renamed. Subdomains need their own files — blog.example.com/robots.txt is separate from example.com/robots.txt.

Basic Robots.txt Syntax

The syntax is intentionally simple. Here is a minimal robots.txt that allows all crawlers to access everything and points to a sitemap:

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

Here is one that blocks a specific folder from all bots:

User-agent: *
Disallow: /admin/
Disallow: /tmp/
Sitemap: https://example.com/sitemap.xml

And here is one that blocks one specific bot while letting others through:

User-agent: BadBot
Disallow: /

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

Directive Reference

User-agent — the bot name the following rules apply to. * means all bots. Googlebot means only Google’s main crawler.
Disallow — path the bot must not crawl. / blocks everything. An empty value allows everything.
Allow — path the bot may crawl, used to create exceptions within a Disallow.
Sitemap — absolute URL of your XML sitemap. Multiple Sitemap lines are allowed.
Crawl-delay — seconds between requests. Google ignores this; Bing and Yandex honor it.

Common Patterns

Most sites use one of a few well-known patterns. Pick whichever matches your situation.

Allow Everything

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

The default for marketing sites and blogs. Let Google crawl freely and point it at the sitemap.

Block Admin and Internal Sections

User-agent: *
Disallow: /wp-admin/
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml

Useful for WordPress, ecommerce, and SaaS sites. Allows the admin AJAX endpoint Google needs for proper rendering while blocking the rest of the admin area.

Block Search Result Pages

User-agent: *
Disallow: /search?
Disallow: /*?s=
Sitemap: https://example.com/sitemap.xml

Internal site search pages are usually low-value duplicates of category pages. Blocking them keeps crawl budget on the pages that matter.

User-agent: *
Disallow: /*?filter=
Disallow: /*?sort=
Disallow: /*&utm_

Ecommerce sites generate millions of URLs from filters. Blocking those URL patterns prevents crawl waste, though canonical tags handle this more elegantly.

Robots.txt vs Noindex vs Canonical

Three tools control what shows up in search results, and they are easy to confuse.

Robots.txt Disallow

Blocks crawling. Google never fetches the URL. Side effect: Google cannot see noindex tags or canonical tags on a blocked URL, so the URL can still appear in search results with a generic snippet if other sites link to it.

Meta Noindex

Lets Google crawl the page but tells it not to include the URL in search results. This is the right tool for thank-you pages, internal tools, and thin pages you want hidden. Crucially, the URL must be crawlable for Google to see the noindex tag.

Canonical Tag

Tells Google which URL is the primary version when multiple URLs serve the same or similar content. Both URLs remain crawlable; Google consolidates ranking signals on the canonical version. See our meta tags SEO guide for the full canonical pattern.

Rule of thumb: use noindex for pages you want hidden from results, canonical for duplicates, and robots.txt only when you specifically want to prevent crawling (such as protecting heavy server endpoints).

Common Robots.txt Mistakes That Hurt SEO

Robots.txt mistakes are some of the most catastrophic SEO failures because a single line can hide your entire site for weeks before anyone notices.

Blocking the Whole Site Accidentally

User-agent: *
Disallow: /

This blocks everything. Staging sites often have this in robots.txt; if it ships to production by accident, the site disappears from Google. Always audit robots.txt the moment you launch.

Blocking CSS and JavaScript

Google renders pages like a browser. If you block the CSS or JavaScript files needed to render the page, Google sees a broken layout and may treat the page as low quality. Never disallow /wp-content/, /assets/, or any directory containing render-critical files.

Trying to Hide Sensitive Data

Robots.txt is public — anyone can read it. Listing Disallow: /secret-admin-folder/ tells the world exactly where your admin lives. Hide sensitive paths with authentication, not robots.txt.

Using Robots.txt to Remove URLs from the Index

If a URL is already indexed and you block it in robots.txt, Google cannot crawl it to see your noindex tag, so the URL stays in the index. To remove a URL, leave it crawlable and add noindex, or use Google Search Console’s removal tool.

Forgetting the Sitemap Line

Every robots.txt should reference your XML sitemap. Bing, Yandex, and other engines use this to find the sitemap without you submitting it through their consoles.

Confusing Wildcards

Different bots interpret wildcards differently. Googlebot supports * for any sequence and $ for end-of-URL. Other bots may not. Keep wildcards minimal and test in Google Search Console’s URL Inspection tool.

How to Test Robots.txt

Never push a new robots.txt to production without testing.

Google Search Console URL Inspection — paste any URL and check whether it is allowed or blocked. The result shows which rule blocked it.
Robots.txt Tester (legacy) — Google’s classic tester still works at search.google.com/search-console/robots-testing-tool.
Manual fetch — open https://example.com/robots.txt in a browser and read it. If it shows what you expect, you are good.
Crawler simulators — tools like Screaming Frog and Sitebulb honor robots.txt and let you simulate a full crawl.

Platform-Specific Robots.txt

Most platforms auto-generate robots.txt with reasonable defaults. You can usually override it.

WordPress

WordPress generates a virtual robots.txt by default. To customize, upload a physical robots.txt file to the web root or use Rank Math or Yoast SEO to edit it through the dashboard. Common WordPress additions: disallow /wp-admin/ (with an Allow for admin-ajax.php) and reference the sitemap.

Framer

Framer auto-generates robots.txt with the sitemap reference included. For most sites this is enough. If you need to disallow specific paths, Framer exposes a robots.txt editor in site settings.

Webflow, Squarespace, Wix, Shopify

All four expose robots.txt editing through SEO settings panels. Defaults are sensible. Avoid editing unless you know exactly what you are doing — a wrong line here can wipe out organic traffic.

Next.js and Custom Sites

Place robots.txt in the /public/ directory of your Next.js, React, Astro, or SvelteKit project. The file is served as a static asset. Next.js 13+ also supports app/robots.ts that exports the file dynamically.

Robots.txt and Crawl Budget

For small sites, crawl budget is not a concern. For large sites — millions of URLs — robots.txt is one of the most powerful crawl budget tools. Block low-value URLs and Google spends its crawl budget on the ones that drive traffic.

The most impactful patterns for crawl budget management: block faceted navigation, internal search pages, calendar URLs, infinite tag pages, and parameter-heavy filter URLs. Pair these blocks with strong internal linking to your money pages.

Monitoring Robots.txt

Audit robots.txt every quarter at minimum, and immediately after any platform change or migration. Check the Google Search Console Settings > Crawl Stats report monthly to spot sudden drops in crawled pages that might indicate a robots.txt regression.

FAQ

What happens if I do not have a robots.txt file?

Without a robots.txt, all crawlers assume they have full access to your site. For most small sites this is fine. For larger sites, an explicit robots.txt with a sitemap reference is better practice because it points crawlers at your priority URLs and prevents crawl waste on low-value paths.

Can robots.txt hide a page from Google?

Not reliably. Robots.txt blocks crawling, but a blocked URL can still appear in Google’s index if other sites link to it, just with no description. To truly hide a page, use a noindex meta tag and leave the URL crawlable so Google can read the tag.

How do I block a single bot like an AI scraper?

Add a specific block before the wildcard rule. For example: User-agent: GPTBot followed by Disallow: /. Note that only well-behaved bots honor robots.txt — malicious scrapers ignore it, so for real protection use server-side blocks via Cloudflare or your firewall.

Want a site with SEO infrastructure dialed in from day one — robots.txt, sitemap, schema, canonicals, the whole stack? See our pricing or get in touch.

What Robots.txt Actually Does

Where Robots.txt Lives

Basic Robots.txt Syntax

The syntax is intentionally simple. Here is a minimal robots.txt that allows all crawlers to access everything and points to a sitemap:

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

Here is one that blocks a specific folder from all bots:

User-agent: *
Disallow: /admin/
Disallow: /tmp/
Sitemap: https://example.com/sitemap.xml

And here is one that blocks one specific bot while letting others through:

User-agent: BadBot
Disallow: /

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

Directive Reference

User-agent — the bot name the following rules apply to. * means all bots. Googlebot means only Google’s main crawler.
Disallow — path the bot must not crawl. / blocks everything. An empty value allows everything.
Allow — path the bot may crawl, used to create exceptions within a Disallow.
Sitemap — absolute URL of your XML sitemap. Multiple Sitemap lines are allowed.
Crawl-delay — seconds between requests. Google ignores this; Bing and Yandex honor it.

Common Patterns

Most sites use one of a few well-known patterns. Pick whichever matches your situation.

Allow Everything

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

The default for marketing sites and blogs. Let Google crawl freely and point it at the sitemap.

Block Admin and Internal Sections

User-agent: *
Disallow: /wp-admin/
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml

Useful for WordPress, ecommerce, and SaaS sites. Allows the admin AJAX endpoint Google needs for proper rendering while blocking the rest of the admin area.

Block Search Result Pages

User-agent: *
Disallow: /search?
Disallow: /*?s=
Sitemap: https://example.com/sitemap.xml

Internal site search pages are usually low-value duplicates of category pages. Blocking them keeps crawl budget on the pages that matter.

User-agent: *
Disallow: /*?filter=
Disallow: /*?sort=
Disallow: /*&utm_

Ecommerce sites generate millions of URLs from filters. Blocking those URL patterns prevents crawl waste, though canonical tags handle this more elegantly.

Robots.txt vs Noindex vs Canonical

Three tools control what shows up in search results, and they are easy to confuse.

Robots.txt Disallow

Meta Noindex

Canonical Tag

Common Robots.txt Mistakes That Hurt SEO

Robots.txt mistakes are some of the most catastrophic SEO failures because a single line can hide your entire site for weeks before anyone notices.

Blocking the Whole Site Accidentally

User-agent: *
Disallow: /

This blocks everything. Staging sites often have this in robots.txt; if it ships to production by accident, the site disappears from Google. Always audit robots.txt the moment you launch.

Blocking CSS and JavaScript

Trying to Hide Sensitive Data

Robots.txt is public — anyone can read it. Listing Disallow: /secret-admin-folder/ tells the world exactly where your admin lives. Hide sensitive paths with authentication, not robots.txt.

Using Robots.txt to Remove URLs from the Index

Forgetting the Sitemap Line

Every robots.txt should reference your XML sitemap. Bing, Yandex, and other engines use this to find the sitemap without you submitting it through their consoles.

Confusing Wildcards

How to Test Robots.txt

Never push a new robots.txt to production without testing.

Google Search Console URL Inspection — paste any URL and check whether it is allowed or blocked. The result shows which rule blocked it.
Robots.txt Tester (legacy) — Google’s classic tester still works at search.google.com/search-console/robots-testing-tool.
Manual fetch — open https://example.com/robots.txt in a browser and read it. If it shows what you expect, you are good.
Crawler simulators — tools like Screaming Frog and Sitebulb honor robots.txt and let you simulate a full crawl.

Platform-Specific Robots.txt

Most platforms auto-generate robots.txt with reasonable defaults. You can usually override it.

WordPress

Framer

Framer auto-generates robots.txt with the sitemap reference included. For most sites this is enough. If you need to disallow specific paths, Framer exposes a robots.txt editor in site settings.

Webflow, Squarespace, Wix, Shopify

All four expose robots.txt editing through SEO settings panels. Defaults are sensible. Avoid editing unless you know exactly what you are doing — a wrong line here can wipe out organic traffic.

Next.js and Custom Sites

Robots.txt and Crawl Budget

Monitoring Robots.txt

FAQ

What happens if I do not have a robots.txt file?

Can robots.txt hide a page from Google?

How do I block a single bot like an AI scraper?

Want a site with SEO infrastructure dialed in from day one — robots.txt, sitemap, schema, canonicals, the whole stack? See our pricing or get in touch.

Robots.txt: A Complete Guide for 2026

What Robots.txt Actually Does

Where Robots.txt Lives

Basic Robots.txt Syntax

Directive Reference

Common Patterns

Allow Everything

Block Admin and Internal Sections

Block Search Result Pages

Block Faceted Navigation

Robots.txt vs Noindex vs Canonical

Robots.txt Disallow

Meta Noindex

Canonical Tag

Common Robots.txt Mistakes That Hurt SEO

Blocking the Whole Site Accidentally

Blocking CSS and JavaScript

Trying to Hide Sensitive Data

Using Robots.txt to Remove URLs from the Index

Forgetting the Sitemap Line

Confusing Wildcards

How to Test Robots.txt

Platform-Specific Robots.txt

WordPress

Framer

Webflow, Squarespace, Wix, Shopify

Next.js and Custom Sites

Robots.txt and Crawl Budget

Monitoring Robots.txt

FAQ

What happens if I do not have a robots.txt file?

Can robots.txt hide a page from Google?

How do I block a single bot like an AI scraper?

Related guides

Google Business Profile: A Complete Guide

SaaS SEO: A Complete Guide for 2026

Google Search Console: A Complete Guide

llms.txt: A Complete Guide for 2026

Generative Engine Optimization (GEO): A Complete Guide

Answer Engine Optimization (AEO): A Complete Guide

Ready to build your Framer website?

Robots.txt: A Complete Guide for 2026

What Robots.txt Actually Does

Where Robots.txt Lives

Basic Robots.txt Syntax

Directive Reference

Common Patterns

Allow Everything

Block Admin and Internal Sections

Block Search Result Pages

Block Faceted Navigation

Robots.txt vs Noindex vs Canonical

Robots.txt Disallow

Meta Noindex

Canonical Tag

Common Robots.txt Mistakes That Hurt SEO

Blocking the Whole Site Accidentally

Blocking CSS and JavaScript

Trying to Hide Sensitive Data

Using Robots.txt to Remove URLs from the Index

Forgetting the Sitemap Line

Confusing Wildcards

How to Test Robots.txt

Platform-Specific Robots.txt

WordPress

Framer

Webflow, Squarespace, Wix, Shopify

Next.js and Custom Sites

Robots.txt and Crawl Budget

Monitoring Robots.txt

FAQ

What happens if I do not have a robots.txt file?

Can robots.txt hide a page from Google?

How do I block a single bot like an AI scraper?

Related guides

Google Business Profile: A Complete Guide

SaaS SEO: A Complete Guide for 2026

Google Search Console: A Complete Guide

llms.txt: A Complete Guide for 2026

Generative Engine Optimization (GEO): A Complete Guide