There is a small, plain-text file sitting at the root of almost every website on the internet that has enormous power over how search engines interact with that site. It is called robots.txt, and despite being one of the oldest standards in web development, it remains one of the most frequently misunderstood and misused SEO tools available.
Used correctly, robots.txt helps you direct search engine crawlers efficiently toward your important content and away from pages that should not be indexed. Used incorrectly, it can accidentally block your entire site from Google.
What Is robots.txt?
robots.txt is a plain text file located at the root of your website — always at yourdomain.com/robots.txt. When a search engine crawler arrives at your site, the very first thing it does is check this file for instructions about which parts of the site it is and is not allowed to crawl.
The file uses a simple syntax based on two main directives:
User-agent: — specifies which crawler the rule applies to. Use * to apply the rule to all crawlers, or a specific crawler name like Googlebot to target only Google.
Disallow: — specifies which URL paths the crawler should not visit.
Allow: — used to explicitly permit access to a path within a broader disallowed directory.
A basic robots.txt that blocks all crawlers from accessing an admin directory looks like this:
User-agent: *
Disallow: /admin/
Disallow: /login/
Sitemap: https://yourdomain.com/sitemap.xml
What robots.txt Does NOT Do
This is where most confusion arises. robots.txt controls crawling — it does not control indexing. These are two different things.
If you block a page with robots.txt, Google will not crawl it. But if another website links to that blocked page, Google may still index it — showing it in search results — based on that external link, even though it has never read the page's content. The result is a search result with no title and no description, just a URL.
To prevent a page from appearing in search results at all, you need a noindex meta tag on the page itself — not a robots.txt block. You cannot block and noindex simultaneously, because if the page is blocked, Google cannot read the noindex tag.
Common SEO Uses of robots.txt
Blocking admin and login pages. Pages like /wp-admin/, /dashboard/, and /login/ have no business appearing in search results and waste crawl budget. Block them in robots.txt.
Blocking duplicate content sections. Many CMS platforms generate URL parameter variants of pages — /products?sort=price, /products?color=blue. If these are creating thousands of near-duplicate URLs, blocking the parameter patterns conserves crawl budget for your canonical pages.
Blocking internal search results pages. Your site's own search results pages — /search?q=keyword — are typically thin content that should not be indexed. Block them.
Blocking staging or development directories. If your site has a /staging/ or /dev/ section that is publicly accessible, block it to prevent Google indexing test content.
Critical Mistakes to Avoid
Never block your CSS and JavaScript files. Google needs to render your pages correctly to understand them. If your stylesheets and scripts are blocked, Google may see a broken version of your page and rank it poorly. Many sites accidentally block /wp-content/ which contains WordPress theme files.
Do not use robots.txt to hide content you want to rank. If you want a page to rank, it must be crawlable. Any page blocked in robots.txt cannot rank.
Test before deploying. A single typo in robots.txt — like a missing slash or an incorrect path — can block large sections of your site. Use Google Search Console's robots.txt tester to validate your file before it goes live.
How robots.txt Interacts With Your Crawl Budget
Blocking low-value pages in robots.txt is one of the most effective ways to improve crawl budget efficiency. Every crawl slot Google spends on your admin pages, duplicate parameter URLs, or internal search results is a slot not spent on your valuable content. Redirecting Googlebot away from waste through robots.txt means more crawl budget available for the pages that actually need to rank.
Combine your robots.txt review with our broken link scanner — fixing 404 errors reduces wasted crawl spend just as effectively as blocking low-value paths. Use our internal link checker to ensure your important pages are well-connected so Googlebot can find them efficiently even without relying on your sitemap.
How to Check Your Current robots.txt
Simply visit yourdomain.com/robots.txt in your browser. You will see the plain text file immediately. Check for any Disallow rules that might be blocking important content, and verify that your sitemap URL is referenced at the bottom of the file.
Summary
robots.txt is a powerful but frequently misused tool. Use it to block crawling of admin pages, duplicate content URLs, and low-value sections. Never use it to try to hide pages from search results — use noindex for that instead. Always test your robots.txt file before deploying changes, and review it whenever you make significant changes to your site structure.
Missed the previous article? Read: How to Build a Sitemap and Submit It to Google