Hi SEO specialists! Let’s dive in and have a comprehensive chat about the importance and functionality of the robots.txt file. In this detailed guide, we will explore what robots.txt is. why it is crucial for your website and your SEO efforts? and when you should use it to optimize your SEO performance?
We’ll cover the basics, as well as some advanced tips and best practices to ensure that your site is properly indexed and managed by search engine crawlers. So, whether you’re new to SEO or looking to refine your strategies, this guide has got you covered.
What is robots.txt ?
A robots.txt is a little text file that can be added to the root directory of your website, making it accessible via the URL https://yoursite/robots.txt. In this file, we can provide clear and friendly instructions to well-behaved web robots, crawlers, and spiders like Google’s crawler. These instructions can specify the URLs and sections of your site that they shouldn’t crawl, helping you manage the indexing process. In other words, robots.txt file acts as a communication tool between your website and web crawlers, providing guidelines on which parts of your site should not be accessed.
With robots.txt you can ensure that only the parts of your site you want are accessed and indexed by these automated agents. Therefore, It’s a fantastic way to optimize your site’s SEO performance by controlling which pages search engines can see.
What happens if the crawler can’t find this file? In this case, the crawler will just go ahead and explore all the discovered pages on your site without any restrictions. This means every page it finds might be crawled, indexed, and included in search engine results.
So, if you want to keep certain pages or sections private, it’s super important to add the robots.txt file. By setting up the robots.txt file correctly, you can make sure the crawler only checks out the pages you want it to and skips the ones you’d rather keep hidden.
It’s important to remember that this is just a guideline and not a strict rule, so not all crawlers will follow it. Therefore, if you have private files you don’t want publicly accessible, robots.txt isn’t the solution you’re looking for. Instead, you should look into more secure methods, such as password-protecting the files or using server-side authentication. Furthermore, the robots.txt is publicly available, so anyone can see which sections of your server you don’t want crawlers to access.
Let’s dive into an example! Here, we have instructions for several web crawlers.
# This is a robots.txt for abc.com
# section 1 - * for catch all
User-agent: *
Sitemap: https://abc.com/sitemap-main.xml
Disallow: /foo.html
Disallow: /bar/foo.html
Disallow: /daz/
# section 2 - Googlebot for google
User-agent: Googlebot
Sitemap: https://abc.com/sitemap-main.xml
Disallow: /foo.html
Disallow: /daz/
Allow: /daz/foo.html
# section 3 - Bingbot for bing
User-agent: Bingbot
Sitemap: https://abc.com/sitemap-main.xml
Disallow: /
# section 4 - Yandex for yandex
User-agent: Yandex
Sitemap: https://abc.com/sitemap-main.xml
Disallow: /foo.html
Clean-param: utm_source
Each section starts with the term “user-agent ” followed by the specific crawler’s name. The instructions that follow apply to that crawler until the next ” user-agent ” or the end of the file.
In the first section, the user-agent is *. This means the instructions here apply to all robots without a dedicated section. It’s a catch-all category to ensure that any unspecified crawlers follow these rules.
In the second section, the user-agent is googlebot, which is the Google crawler. Thus, the instructions here are just for Google, letting web admins customize how Google indexes and crawls their site, potentially boosting their performance in Google search results.
Next, in the third section, the user agent is bingbot, which refer to the Bing crawler. Therefore, the instructions here are specifically for Bing.
In the last section the user agent is yandex, which refer to the yandex crawler. As you guess, this section contains all the instructions for the Yandex search engin.
As we can see the # character starts a comment, which ends at the line’s end, allowing to add comments to the human reader without affecting the instructions.
So, what can we tell a crawler? The main instruction is ” disallow ” followed by a path. If the path ends with a slash, it’s for a directory; if not, it’s for a specific file.
As a rule of thumb: A disallow rule for a file tells the crawler not to crawl it, and for a directory, it tells the crawler not to crawl anything in it.
For example: “Disallow: /foo.html” disallows only “foo.html” , while “Disallow: /daz” will disallow all the files that under “/daz”, such as “daz/foo.html” and any file in “/daz/bar” directory
So, when you want to disallow all files on the site, you can use the “Disallow: /” rule. This ensures no files are crawled.
What if you want to disallow all files in a directory but allow one specific page? You can disallow the directory and then specify an allow rule for that file. For example: “Disallow: /daz/” followed by “Allow: /daz/foo.html” will disallow all files in “/daz/” except “Allow: /daz/foo.html”. This gives you fine-tuned control over what gets indexed, keeping crucial pages accessible while hiding others.
Another common task is to provide the sitemap URL with the sitemap instruction followed by the full URL to the sitemap. Note that a full URL is required, not just the path as in the disallow rule. For example: “Sitemap: https://abc.com/sitemap-main.xml” ,This will help crawlers to find your sitemap more easily.
You may find other instructions in the robots.txt that are specific to a crawler and does not have a meaning for other crawlers. for example: The yandex crawler supports the clean-param instruction which emit the named query parameter from the URL. You can use this instruction, when you have a query param that don’t affect the page contents and avoid duplicate contents.
When to disallow files from search engines crawlers?
Now that we have an understanding of what robots.txt is, its purpose, and its structure, you might be wondering why you would want to disallow Google or other search engines from crawling some of your pages.
Disallow shallow content
Do you have pages with shallow content, duplicate content, or pages you think are not useful to be crawled and indexed by search engines? You can disallow those pages in robots.txt. This will save your crawl budget and improve the efficiency of your site by ensuring that crawlers focus on your more valuable content, and also decrease the load on your server as those pages will not be fetched by the crawler. For example, you might have temporary pages that you don’t want search engines to crawl. By disallowing them in robots.txt, you ensure that search engines only index your high-quality, relevant content, which can help improve user experience and your site’s search engine ranking.
Disallow new site until is ready to be crawl
Have you launched a new site? Maybe the site isn’t ready for indexing yet. For instance, it might not have enough content, or you might still be building it. If you let Google’s crawlers crawl it in this state, it can affect the site’s position in search results even after you finish building it.
Therefore, it’s important to wait before letting Google’s crawler index the site by disallowing it in robots.txt. Remember to remove this rule when the site is ready. This ensures that once your site is fully prepared, it can be properly crawled and indexed, leading to better search engine ranking and visibility.
Disallow new section in existing site until is ready to be crawl
Are you developing a new section on your existing site? Similar to a new site, this section might not be ready to be indexed by Google. Even though the site is already indexed, it’s better to disallow the developing section as Google may index incomplete pages, and it may take time to crawl them again after you finish.
When you finish developing the section, you can remove the rule and let Google properly crawl and index it. This way, you ensure that only completed and polished sections of your site are accessible to search engines, maintaining a high standard of quality and relevancy for your indexed content.
Disallow admin pages
Do you have an admin section on your site that is only available to authorized users? If so, you should definitely add a rule to disallow these URLs in your robots.txt file.
This is important because, since those pages are valid only for authorized users, the server will typically respond with an unauthorized page or redirect to the login page when an unauthorized user tries to access them. Regardless of what the specific response is, you do not want the web crawler or search engine bot to fetch or index those pages.