Robots.txt is a plain text file that lives at the root of your domain and tells search engine crawlers which parts of your site they are allowed to access. It is the first file Google, Bing, and other bots request when they arrive on your site, and a single wrong line in it can hide your entire site from search results. Used correctly, robots.txt helps crawlers focus on important content, protects server resources, and keeps low-value URLs out of indexing queues.
What Robots.txt Actually Does
The Robots Exclusion Protocol was created in 1994 as a way for site owners to communicate crawl preferences to automated bots. The protocol is voluntary — well-behaved bots like Google honor it, badly-behaved scrapers ignore it. Robots.txt is therefore a crawler guidance system, not a security mechanism. If you have sensitive data, password-protect it, do not just disallow it in robots.txt.
The file uses simple directives. User-agent identifies which bot a rule applies to. Disallow blocks a path. Allow creates an exception. Sitemap points to your XML sitemap. Together these directives shape how crawlers spend time on your site.
Where Robots.txt Lives
Robots.txt must be at the root of your domain: https://example.com/robots.txt. It cannot live in a subdirectory and cannot be renamed. Subdomains need their own files — blog.example.com/robots.txt is separate from example.com/robots.txt.
Basic Robots.txt Syntax
The syntax is intentionally simple. Here is a minimal robots.txt that allows all crawlers to access everything and points to a sitemap:
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
Here is one that blocks a specific folder from all bots:
User-agent: *
Disallow: /admin/
Disallow: /tmp/
Sitemap: https://example.com/sitemap.xml
And here is one that blocks one specific bot while letting others through:
User-agent: BadBot
Disallow: /
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
Directive Reference
User-agent— the bot name the following rules apply to.*means all bots.Googlebotmeans only Google’s main crawler.Disallow— path the bot must not crawl./blocks everything. An empty value allows everything.Allow— path the bot may crawl, used to create exceptions within a Disallow.Sitemap— absolute URL of your XML sitemap. Multiple Sitemap lines are allowed.Crawl-delay— seconds between requests. Google ignores this; Bing and Yandex honor it.
Common Patterns
Most sites use one of a few well-known patterns. Pick whichever matches your situation.
Allow Everything
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
The default for marketing sites and blogs. Let Google crawl freely and point it at the sitemap.
Block Admin and Internal Sections
User-agent: *
Disallow: /wp-admin/
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml
Useful for WordPress, ecommerce, and SaaS sites. Allows the admin AJAX endpoint Google needs for proper rendering while blocking the rest of the admin area.
Block Search Result Pages
User-agent: *
Disallow: /search?
Disallow: /*?s=
Sitemap: https://example.com/sitemap.xml
Internal site search pages are usually low-value duplicates of category pages. Blocking them keeps crawl budget on the pages that matter.
Block Faceted Navigation
User-agent: *
Disallow: /*?filter=
Disallow: /*?sort=
Disallow: /*&utm_
Ecommerce sites generate millions of URLs from filters. Blocking those URL patterns prevents crawl waste, though canonical tags handle this more elegantly.
Robots.txt vs Noindex vs Canonical
Three tools control what shows up in search results, and they are easy to confuse.
Robots.txt Disallow
Blocks crawling. Google never fetches the URL. Side effect: Google cannot see noindex tags or canonical tags on a blocked URL, so the URL can still appear in search results with a generic snippet if other sites link to it.
Meta Noindex
Lets Google crawl the page but tells it not to include the URL in search results. This is the right tool for thank-you pages, internal tools, and thin pages you want hidden. Crucially, the URL must be crawlable for Google to see the noindex tag.
Canonical Tag
Tells Google which URL is the primary version when multiple URLs serve the same or similar content. Both URLs remain crawlable; Google consolidates ranking signals on the canonical version. See our meta tags SEO guide for the full canonical pattern.
Rule of thumb: use noindex for pages you want hidden from results, canonical for duplicates, and robots.txt only when you specifically want to prevent crawling (such as protecting heavy server endpoints).
Common Robots.txt Mistakes That Hurt SEO
Robots.txt mistakes are some of the most catastrophic SEO failures because a single line can hide your entire site for weeks before anyone notices.
Blocking the Whole Site Accidentally
User-agent: *
Disallow: /
This blocks everything. Staging sites often have this in robots.txt; if it ships to production by accident, the site disappears from Google. Always audit robots.txt the moment you launch.
Blocking CSS and JavaScript
Google renders pages like a browser. If you block the CSS or JavaScript files needed to render the page, Google sees a broken layout and may treat the page as low quality. Never disallow /wp-content/, /assets/, or any directory containing render-critical files.
Trying to Hide Sensitive Data
Robots.txt is public — anyone can read it. Listing Disallow: /secret-admin-folder/ tells the world exactly where your admin lives. Hide sensitive paths with authentication, not robots.txt.
Using Robots.txt to Remove URLs from the Index
If a URL is already indexed and you block it in robots.txt, Google cannot crawl it to see your noindex tag, so the URL stays in the index. To remove a URL, leave it crawlable and add noindex, or use Google Search Console’s removal tool.
Forgetting the Sitemap Line
Every robots.txt should reference your XML sitemap. Bing, Yandex, and other engines use this to find the sitemap without you submitting it through their consoles.
Confusing Wildcards
Different bots interpret wildcards differently. Googlebot supports * for any sequence and $ for end-of-URL. Other bots may not. Keep wildcards minimal and test in Google Search Console’s URL Inspection tool.
How to Test Robots.txt
Never push a new robots.txt to production without testing.
- Google Search Console URL Inspection — paste any URL and check whether it is allowed or blocked. The result shows which rule blocked it.
- Robots.txt Tester (legacy) — Google’s classic tester still works at
search.google.com/search-console/robots-testing-tool. - Manual fetch — open
https://example.com/robots.txtin a browser and read it. If it shows what you expect, you are good. - Crawler simulators — tools like Screaming Frog and Sitebulb honor robots.txt and let you simulate a full crawl.
Platform-Specific Robots.txt
Most platforms auto-generate robots.txt with reasonable defaults. You can usually override it.
WordPress
WordPress generates a virtual robots.txt by default. To customize, upload a physical robots.txt file to the web root or use Rank Math or Yoast SEO to edit it through the dashboard. Common WordPress additions: disallow /wp-admin/ (with an Allow for admin-ajax.php) and reference the sitemap.
Framer
Framer auto-generates robots.txt with the sitemap reference included. For most sites this is enough. If you need to disallow specific paths, Framer exposes a robots.txt editor in site settings.
Webflow, Squarespace, Wix, Shopify
All four expose robots.txt editing through SEO settings panels. Defaults are sensible. Avoid editing unless you know exactly what you are doing — a wrong line here can wipe out organic traffic.
Next.js and Custom Sites
Place robots.txt in the /public/ directory of your Next.js, React, Astro, or SvelteKit project. The file is served as a static asset. Next.js 13+ also supports app/robots.ts that exports the file dynamically.
Robots.txt and Crawl Budget
For small sites, crawl budget is not a concern. For large sites — millions of URLs — robots.txt is one of the most powerful crawl budget tools. Block low-value URLs and Google spends its crawl budget on the ones that drive traffic.
The most impactful patterns for crawl budget management: block faceted navigation, internal search pages, calendar URLs, infinite tag pages, and parameter-heavy filter URLs. Pair these blocks with strong internal linking to your money pages.
Monitoring Robots.txt
Audit robots.txt every quarter at minimum, and immediately after any platform change or migration. Check the Google Search Console Settings > Crawl Stats report monthly to spot sudden drops in crawled pages that might indicate a robots.txt regression.
FAQ
What happens if I do not have a robots.txt file?
Without a robots.txt, all crawlers assume they have full access to your site. For most small sites this is fine. For larger sites, an explicit robots.txt with a sitemap reference is better practice because it points crawlers at your priority URLs and prevents crawl waste on low-value paths.
Can robots.txt hide a page from Google?
Not reliably. Robots.txt blocks crawling, but a blocked URL can still appear in Google’s index if other sites link to it, just with no description. To truly hide a page, use a noindex meta tag and leave the URL crawlable so Google can read the tag.
How do I block a single bot like an AI scraper?
Add a specific block before the wildcard rule. For example: User-agent: GPTBot followed by Disallow: /. Note that only well-behaved bots honor robots.txt — malicious scrapers ignore it, so for real protection use server-side blocks via Cloudflare or your firewall.
Want a site with SEO infrastructure dialed in from day one — robots.txt, sitemap, schema, canonicals, the whole stack? See our pricing or get in touch.
