How Important Is Robots.txt For Google?

Written by John Cow on December 22nd, 2007

Want to Learn How to Build a Business Not Just a Blog.. for FREE? CLICK HERE Now to Find Out How!

Just how important is your blog’s robots.txt file if you want the world to find you and your blog in the search engines?

The robots.txt file is a file on your site that is meant to give instructions as to where search engine spiders may and may not go. This is not a wall but a permission system, which means that you can not force “bad” bots to listen to it. Bad bots are the bots that go all over your site but do not offer you any value at all.

The powerful reason for the robots.txt file is that it is listened to by the majority of all search engines and it helps to ensure that your site gets spidered and indexed properly. That means the pages you want to be found, can be found and the pages you want hidden will remain hidden.

We do not want to go into a long lesson on this, as there are loads of resources available on the topic that can be explained much better then we can explain them. What we will share with you however is that you want to use one and you want to upload it to the root directory on your server, located in the same place as your index page.

You can see the robot.txt we use at http://www.johncow.com/robots.txt

Out of curiosity weve been snooping around a little to see how others do it. Surprisingly enough, we found that there seem to be two different approaches to the system that are total opposites. We’re comparing 4 well known blogs here.

robocow.jpg

Shoemoney.com / PR6 / Alexa 2,988 - We know Jeremy is pretty tech savvy and he probably is the one with the most knowhow about how this would work. Then again, he might not give a poop about it and just let it be.

Here’s his robots.txt:

    User-agent: Googlebot

    Disallow: /wp-content/
    Disallow: /trackback/
    Disallow: /wp-admin/
    Disallow: /feed/
    Disallow: /archives/
    Disallow: /sitemap.xml
    Disallow: /index.php
    Disallow: /*?
    Disallow: /*.php$
    Disallow: /*.js$
    Disallow: /*.inc$
    Disallow: /*.css$
    Disallow: */feed/
    Disallow: */trackback/
    Disallow: /page/
    Disallow: /tag/
    Disallow: /category/

    User-agent: Googlebot-Image
    Disallow: /wp-includes/

    User-agent: Mediapartners-Google*
    Disallow:

    User-agent: ia_archiver
    Disallow: /

    User-agent: duggmirror
    Disallow: /

    User-Agent: Googlebot
    Disallow: /link.php
    Disallow: /gallery2
    Disallow: /gallery2/
    Disallow: /category/
    Disallow: /page/
    Disallow: /pages/
    Disallow: /feed/
    Disallow: /feed

JohnChow.com / PR4 / Alexa 3,071 - Mr Chow has been around the block and we’re assuming he’s quite tech savvy too. Why else would he run a site called TheTechZone for over 8 years? His robots.txt is quite similar to Jeremy’s:

    sitemap: http://www.johnchow.com/sitemap.xml

    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /go/
    Disallow: /wp-admin/
    Disallow: /wp-includes/
    Disallow: /author/
    Disallow: /page/
    Disallow: /category/
    Disallow: /wp-images/
    Disallow: /images/
    Disallow: /backup/
    Disallow: /banners/
    Disallow: /archives/
    Disallow: /trackback/
    Disallow: /feed/

    User-agent: Googlebot-Image
    Allow: /wp-content/uploads/

    User-agent: Mediapartners-Google
    Allow: /

    User-agent: duggmirror
    Disallow: /

Problogger.net / PR6 / Alexa 2,600 - The Problogger seems to take a totally different approach to things. Being part of B5 Media, an organization that makes money by running blogs, we’re pretty sure that the technical knowhow of SEO is widely available in a team of professionals. A copy of Darren’s robots.txt:

    User-agent: *
    Disallow:

That’s right. The Problogger doesn’t hold back any secrets for the search engines of this world. Anyone is allowed to crawl through all of Darren’s content.

MattCutts.com / PR7 / Alexa 5,059 - Matt has been working for Google nearly eight years now and is currently head of Google’s webspam team. Surely its safe to assume that Matt knows what he’s doing. Like Problogger, Matt withholds almost nothing for the crawlers, just a files/ folder:

    User-agent: *
    Disallow: /files/

Eventhough files/ won’t be indexed, Matt has put an index.html saying ‘Sorry’ in place to keep nosy cows like us out of there. Afterall, a robots.txt file is available for anyone to see. put one and one together and you can try to have a peak at the contents of a directory that’s specified in there.

As you can see, there seem to be two trains of thought on the subject.

Did You Download Your FREE Copy of "How to Build a Business NOT Just a Blog" Yet? Click Here Now to Get Your Copy!

RSS feed | Trackback URI

26 Comments »

2007-12-22 13:24:19

Excellent Post!

I’ve been wondering about how to set the Robots.txt for my site for some time now and this has given me a clearer picture on what I maybe should or shouldn’t be allowing Google to index.

Thank You and a Merry Christmoos from S.A.

 
Comment by Derrick Tan
2007-12-22 13:42:02

Great post on the robotic thing.

Haha! Was always going to read stuffs on robot.txt but somehow just dropped the idea because of other stuffs.

Will find out more on this.

All the best!

Regards,

Derrick Tan

http://www.learn-internet-marketing-free.com

 
Comment by Ruchir
2007-12-22 14:10:44

I don’t think it makes much of a difference. Shoe and John just disallow all those directories for security purposes…

 
Comment by Chris Jacobson
2007-12-22 14:12:16

Is there any way to hide robots.txt from prying eyes and make it only viewable to search robots?

 
Comment by Clog Money Subscribed to comments via email
2007-12-22 14:37:11

No it needs to be publicly accessible, you would be surprised how many robots actually look for this file, not just google but tons of other scrapers. Who knows you may even get the odd back link here and there from it.

 
Comment by Simon
2007-12-22 15:41:42

I take the Shoemoney / Chow approach, as I only allow the indexing of posts themselves, and a couple of pages.

Shoemoney was given a few tips by Aaron Wall on how to get out of supplementals (when they were public), which mostly revolved around the robots.txt - it was to stop duplicate content being indexed, and according to Shoemoney:

“I am happy to report not only am I out of supplemental hell but my Google traffic has increased 1400% in only 1 month after implementing his list of stuff.”

I would expect that ProBlogger has been better organised from the start, and has never needed robots.txt to help with duplicate content. All his archives (categorys, dates etc) are just excerpts, which has always been thought to help.

 
Comment by Mike Huang
2007-12-22 17:20:14

Milk Man, even though this is an interesting post…we all know you can do better. Where’s the mojo? :)

-Mike

 
Comment by John Cow
2007-12-22 17:34:06

Mojo, Mike?

 
Comment by Althaf Subscribed to comments via email
2007-12-22 17:37:09

Good to see a useful post after a long time.

 
Comment by vhxn.com
2007-12-23 18:12:55

Thanks for the interesting Post :arrow:

 
Comment by Think Like An SOB
2007-12-23 18:42:24

lol. I have been using Shoe’s robots.txt file for my blog since its inception. Figured, he probably knows what he’s doing when it comes to SEO.

 
Comment by Nicholas James
2007-12-24 00:00:42

Excellent post.

I completly forgot about robots.txt and this reminded me for my blog :wink:

 
Comment by Allyn Paul Subscribed to comments via email
2007-12-24 19:47:22

This is way over my head! I have a plugin that creates a Google sitemap…does that suffice in terms of this posting and the robot txt?

 
Comment by John Cow
2007-12-24 20:11:05

The sitemap will make it easier for Google to index your site. The robots.txt allows you to tell the search engine crawlers what they can and can’t put on file.

Because of duplicate content that could be picked up by a crawler (a post in your archives has a different URL but still hods the exact same content as it’s main URL) Google for example might think you’re spamming. (although we’re not convinced by that. Surely Google is smarter than that.)

 
Comment by ajacx Subscribed to comments via email
2007-12-25 12:41:01

can i put this robot.txt in my blog spot?

 
Comment by Allyn Paul Subscribed to comments via email
2007-12-25 14:05:07

Cow–thanks for taking some time to reply to my question…I appreciate the solid information!

AL

 
Comment by allen johnson
2008-01-07 16:39:56

Yah I would n,t blame you problogger is probably the best choice the reason why he doesn’t hold anything from google is that he has so many incoming links that probably leads to his older post

 
Name (required)
E-mail (required - never shown publicly)
URI
Subscribe to comments via email
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> in your comment.

Powered by CommentMilk

Trackback responses to this post