How Important Is Robots.txt For Google?

Written by John Cow on December 22nd, 2007

Just how important is your blog’s robots.txt file if you want the world to find you and your blog in the search engines?

The robots.txt file is a file on your site that is meant to give instructions as to where search engine spiders may and may not go. This is not a wall but a permission system, which means that you can not force “bad” bots to listen to it. Bad bots are the bots that go all over your site but do not offer you any value at all.

The powerful reason for the robots.txt file is that it is listened to by the majority of all search engines and it helps to ensure that your site gets spidered and indexed properly. That means the pages you want to be found, can be found and the pages you want hidden will remain hidden.

We do not want to go into a long lesson on this, as there are loads of resources available on the topic that can be explained much better then we can explain them. What we will share with you however is that you want to use one and you want to upload it to the root directory on your server, located in the same place as your index page.

You can see the robot.txt we use at http://www.johncow.com/robots.txt

Out of curiosity weve been snooping around a little to see how others do it. Surprisingly enough, we found that there seem to be two different approaches to the system that are total opposites. We’re comparing 4 well known blogs here.

robocow.jpg

Shoemoney.com / PR6 / Alexa 2,988 - We know Jeremy is pretty tech savvy and he probably is the one with the most knowhow about how this would work. Then again, he might not give a poop about it and just let it be.

Here’s his robots.txt:

    User-agent: Googlebot

    Disallow: /wp-content/
    Disallow: /trackback/
    Disallow: /wp-admin/
    Disallow: /feed/
    Disallow: /archives/
    Disallow: /sitemap.xml
    Disallow: /index.php
    Disallow: /*?
    Disallow: /*.php$
    Disallow: /*.js$
    Disallow: /*.inc$
    Disallow: /*.css$
    Disallow: */feed/
    Disallow: */trackback/
    Disallow: /page/
    Disallow: /tag/
    Disallow: /category/

    User-agent: Googlebot-Image
    Disallow: /wp-includes/

    User-agent: Mediapartners-Google*
    Disallow:

    User-agent: ia_archiver
    Disallow: /

    User-agent: duggmirror
    Disallow: /

    User-Agent: Googlebot
    Disallow: /link.php
    Disallow: /gallery2
    Disallow: /gallery2/
    Disallow: /category/
    Disallow: /page/
    Disallow: /pages/
    Disallow: /feed/
    Disallow: /feed

JohnChow.com / PR4 / Alexa 3,071 - Mr Chow has been around the block and we’re assuming he’s quite tech savvy too. Why else would he run a site called TheTechZone for over 8 years? His robots.txt is quite similar to Jeremy’s:

    sitemap: http://www.johnchow.com/sitemap.xml

    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /go/
    Disallow: /wp-admin/
    Disallow: /wp-includes/
    Disallow: /author/
    Disallow: /page/
    Disallow: /category/
    Disallow: /wp-images/
    Disallow: /images/
    Disallow: /backup/
    Disallow: /banners/
    Disallow: /archives/
    Disallow: /trackback/
    Disallow: /feed/

    User-agent: Googlebot-Image
    Allow: /wp-content/uploads/

    User-agent: Mediapartners-Google
    Allow: /

    User-agent: duggmirror
    Disallow: /

Problogger.net / PR6 / Alexa 2,600 - The Problogger seems to take a totally different approach to things. Being part of B5 Media, an organization that makes money by running blogs, we’re pretty sure that the technical knowhow of SEO is widely available in a team of professionals. A copy of Darren’s robots.txt:

    User-agent: *
    Disallow:

That’s right. The Problogger doesn’t hold back any secrets for the search engines of this world. Anyone is allowed to crawl through all of Darren’s content.

MattCutts.com / PR7 / Alexa 5,059 - Matt has been working for Google nearly eight years now and is currently head of Google’s webspam team. Surely its safe to assume that Matt knows what he’s doing. Like Problogger, Matt withholds almost nothing for the crawlers, just a files/ folder:

    User-agent: *
    Disallow: /files/

Eventhough files/ won’t be indexed, Matt has put an index.html saying ‘Sorry’ in place to keep nosy cows like us out of there. Afterall, a robots.txt file is available for anyone to see. put one and one together and you can try to have a peak at the contents of a directory that’s specified in there.

As you can see, there seem to be two trains of thought on the subject.

http://www.johncow.com/wp-content/plugins/sociofluid/images/digg_48.png http://www.johncow.com/wp-content/plugins/sociofluid/images/reddit_48.png http://www.johncow.com/wp-content/plugins/sociofluid/images/stumbleupon_48.png http://www.johncow.com/wp-content/plugins/sociofluid/images/delicious_48.png http://www.johncow.com/wp-content/plugins/sociofluid/images/technorati_48.png http://www.johncow.com/wp-content/plugins/sociofluid/images/google_48.png http://www.johncow.com/wp-content/plugins/sociofluid/images/myspace_48.png http://www.johncow.com/wp-content/plugins/sociofluid/images/facebook_48.png http://www.johncow.com/wp-content/plugins/sociofluid/images/yahoobuzz_48.png http://www.johncow.com/wp-content/plugins/sociofluid/images/sphinn_48.png

If you enjoyed this post, make sure you subscribe to my RSS feed!

26 Responses to “How Important Is Robots.txt For Google?”

  1. links from TechnoratiTechnorati Search for: seowrote an interesting post today on How Important Is Robots.txt For Google?. Here’s a quick excerpt ” Just how important is your blog’s robots.txt file if you want the world to find you and your blog in the search engines? We used to think there was

  2. Excellent Post!

    I’ve been wondering about how to set the Robots.txt for my site for some time now and this has given me a clearer picture on what I maybe should or shouldn’t be allowing Google to index.

    Thank You and a Merry Christmoos from S.A.

  3. Derrick Tan says:

    Great post on the robotic thing.

    Haha! Was always going to read stuffs on robot.txt but somehow just dropped the idea because of other stuffs.

    Will find out more on this.

    All the best!

    Regards,

    Derrick Tan

    http://www.learn-internet-marketing-free.com

  4. Trackback says:

    links from TechnoratiOriginal post:How Important Is Robots.txt For Google? Online Video Demos Do Work!by at Google Blog Search: trackback Blog tag: Trackback Technorati tag: Trackback

  5. Ruchir says:

    I don’t think it makes much of a difference. Shoe and John just disallow all those directories for security purposes…

  6. Is there any way to hide robots.txt from prying eyes and make it only viewable to search robots?

  7. Clog Money says:

    No it needs to be publicly accessible, you would be surprised how many robots actually look for this file, not just google but tons of other scrapers. Who knows you may even get the odd back link here and there from it.

  8. Simon says:

    I take the Shoemoney / Chow approach, as I only allow the indexing of posts themselves, and a couple of pages.

    Shoemoney was given a few tips by Aaron Wall on how to get out of supplementals (when they were public), which mostly revolved around the robots.txt - it was to stop duplicate content being indexed, and according to Shoemoney:

    “I am happy to report not only am I out of supplemental hell but my Google traffic has increased 1400% in only 1 month after implementing his list of stuff.”

    I would expect that ProBlogger has been better organised from the start, and has never needed robots.txt to help with duplicate content. All his archives (categorys, dates etc) are just excerpts, which has always been thought to help.

  9. Mike Huang says:

    Milk Man, even though this is an interesting post…we all know you can do better. Where’s the mojo? :)

    -Mike

  10. Althaf says:

    Good to see a useful post after a long time.

  11. [...] we’ve decided to disable the comments section for a while, or at least till some people can make up their minds. We know, this post is probably better than most rubbish that we’ve written in a long time, [...]

  12. [...] Cow after a long time has published a very good post related to SEO. His blog is in my Top 10 Useless Blog list. Maybe I have to start re-considering it. John, keep [...]

  13. vhxn.com says:

    Thanks for the interesting Post :arrow:

  14. links from Technorati[IMG]How Important Is Robots.txt For Google?Posted: 22 Dec 2007 06:46 AM CST

  15. lol. I have been using Shoe’s robots.txt file for my blog since its inception. Figured, he probably knows what he’s doing when it comes to SEO.

  16. Excellent post.

    I completly forgot about robots.txt and this reminded me for my blog :wink:

  17. links from Technoratito need you guys to point me in the right direction. This is a hint for “give me interesting links in the comments section.” I’ll start by posting a link to a very interesting article by The Cow about robots.txt and different ways of using it.LINK!Post a comment to a useful article and maybe I’ll post it!

  18. Allyn Paul says:

    This is way over my head! I have a plugin that creates a Google sitemap…does that suffice in terms of this posting and the robot txt?

  19. John Cow says:

    The sitemap will make it easier for Google to index your site. The robots.txt allows you to tell the search engine crawlers what they can and can’t put on file.

    Because of duplicate content that could be picked up by a crawler (a post in your archives has a different URL but still hods the exact same content as it’s main URL) Google for example might think you’re spamming. (although we’re not convinced by that. Surely Google is smarter than that.)

  20. [...] Cow writes a detailed analysis of the different methods that some of the top bloggers are using their robots file. Definitely and [...]

  21. ajacx says:

    can i put this robot.txt in my blog spot?

  22. Allyn Paul says:

    Cow–thanks for taking some time to reply to my question…I appreciate the solid information!

    AL

  23. Yah I would n,t blame you problogger is probably the best choice the reason why he doesn’t hold anything from google is that he has so many incoming links that probably leads to his older post

  24. Kramer auto Pingback[...] minutes ago Kiwipulse You can look an article that johncow did. http://www.johncow.com/how-important-is-robotstxt-for-google/ Well I allow google bot for image, which I receive lots of traffic from google image. Also I allow [...]