How Important Is Robots.txt For Google?
Want to Learn How to Build a Business Not Just a Blog.. for FREE? CLICK HERE Now to Find Out How!
Just how important is your blog’s robots.txt file if you want the world to find you and your blog in the search engines?
The robots.txt file is a file on your site that is meant to give instructions as to where search engine spiders may and may not go. This is not a wall but a permission system, which means that you can not force “bad” bots to listen to it. Bad bots are the bots that go all over your site but do not offer you any value at all.
The powerful reason for the robots.txt file is that it is listened to by the majority of all search engines and it helps to ensure that your site gets spidered and indexed properly. That means the pages you want to be found, can be found and the pages you want hidden will remain hidden.
We do not want to go into a long lesson on this, as there are loads of resources available on the topic that can be explained much better then we can explain them. What we will share with you however is that you want to use one and you want to upload it to the root directory on your server, located in the same place as your index page.
You can see the robot.txt we use at http://www.johncow.com/robots.txt
Out of curiosity weve been snooping around a little to see how others do it. Surprisingly enough, we found that there seem to be two different approaches to the system that are total opposites. We’re comparing 4 well known blogs here.

Shoemoney.com / PR6 / Alexa 2,988 - We know Jeremy is pretty tech savvy and he probably is the one with the most knowhow about how this would work. Then again, he might not give a poop about it and just let it be.
Here’s his robots.txt:
- User-agent: Googlebot
Disallow: /wp-content/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /feed/
Disallow: /archives/
Disallow: /sitemap.xml
Disallow: /index.php
Disallow: /*?
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: */feed/
Disallow: */trackback/
Disallow: /page/
Disallow: /tag/
Disallow: /category/
User-agent: Googlebot-Image
Disallow: /wp-includes/
User-agent: Mediapartners-Google*
Disallow:
User-agent: ia_archiver
Disallow: /
User-agent: duggmirror
Disallow: /
User-Agent: Googlebot
Disallow: /link.php
Disallow: /gallery2
Disallow: /gallery2/
Disallow: /category/
Disallow: /page/
Disallow: /pages/
Disallow: /feed/
Disallow: /feed
JohnChow.com / PR4 / Alexa 3,071 - Mr Chow has been around the block and we’re assuming he’s quite tech savvy too. Why else would he run a site called TheTechZone for over 8 years? His robots.txt is quite similar to Jeremy’s:
- sitemap: http://www.johnchow.com/sitemap.xml
User-agent: *
Disallow: /cgi-bin/
Disallow: /go/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /author/
Disallow: /page/
Disallow: /category/
Disallow: /wp-images/
Disallow: /images/
Disallow: /backup/
Disallow: /banners/
Disallow: /archives/
Disallow: /trackback/
Disallow: /feed/
User-agent: Googlebot-Image
Allow: /wp-content/uploads/
User-agent: Mediapartners-Google
Allow: /
User-agent: duggmirror
Disallow: /
Problogger.net / PR6 / Alexa 2,600 - The Problogger seems to take a totally different approach to things. Being part of B5 Media, an organization that makes money by running blogs, we’re pretty sure that the technical knowhow of SEO is widely available in a team of professionals. A copy of Darren’s robots.txt:
- User-agent: *
Disallow:
That’s right. The Problogger doesn’t hold back any secrets for the search engines of this world. Anyone is allowed to crawl through all of Darren’s content.
MattCutts.com / PR7 / Alexa 5,059 - Matt has been working for Google nearly eight years now and is currently head of Google’s webspam team. Surely its safe to assume that Matt knows what he’s doing. Like Problogger, Matt withholds almost nothing for the crawlers, just a files/ folder:
- User-agent: *
Disallow: /files/
Eventhough files/ won’t be indexed, Matt has put an index.html saying ‘Sorry’ in place to keep nosy cows like us out of there. Afterall, a robots.txt file is available for anyone to see. put one and one together and you can try to have a peak at the contents of a directory that’s specified in there.
As you can see, there seem to be two trains of thought on the subject.
Did You Download Your FREE Copy of "How to Build a Business NOT Just a Blog" Yet? Click Here Now to Get Your Copy!
26 Comments »
Trackback responses to this post
- Automated Traffic Machine
- Trackback
- John Cow dot Com » Blog Archive » Limited Opening Hours These Holidays
- StumbleUpon - Your page is now on StumbleUpon!
- 5 Human Wonders of the World | The Blogrepreneur
- Helping make money online
- Steve's Blog of the Web
- Holiday Ramblings | Nate Whitehill dot Com
- Do you have a ROBOT.txt file? // Blogging Help // BlogCatalog


Syndicate Your Content For Search Engine Domination 









Excellent Post!
I’ve been wondering about how to set the Robots.txt for my site for some time now and this has given me a clearer picture on what I maybe should or shouldn’t be allowing Google to index.
Thank You and a Merry Christmoos from S.A.
Great post on the robotic thing.
Haha! Was always going to read stuffs on robot.txt but somehow just dropped the idea because of other stuffs.
Will find out more on this.
All the best!
Regards,
Derrick Tan
http://www.learn-internet-marketing-free.com
I don’t think it makes much of a difference. Shoe and John just disallow all those directories for security purposes…
Is there any way to hide robots.txt from prying eyes and make it only viewable to search robots?
No it needs to be publicly accessible, you would be surprised how many robots actually look for this file, not just google but tons of other scrapers. Who knows you may even get the odd back link here and there from it.
I take the Shoemoney / Chow approach, as I only allow the indexing of posts themselves, and a couple of pages.
Shoemoney was given a few tips by Aaron Wall on how to get out of supplementals (when they were public), which mostly revolved around the robots.txt - it was to stop duplicate content being indexed, and according to Shoemoney:
“I am happy to report not only am I out of supplemental hell but my Google traffic has increased 1400% in only 1 month after implementing his list of stuff.”
I would expect that ProBlogger has been better organised from the start, and has never needed robots.txt to help with duplicate content. All his archives (categorys, dates etc) are just excerpts, which has always been thought to help.
Milk Man, even though this is an interesting post…we all know you can do better. Where’s the mojo?
-Mike
Mojo, Mike?
Good to see a useful post after a long time.
Thanks for the interesting Post
lol. I have been using Shoe’s robots.txt file for my blog since its inception. Figured, he probably knows what he’s doing when it comes to SEO.
Excellent post.
I completly forgot about robots.txt and this reminded me for my blog
This is way over my head! I have a plugin that creates a Google sitemap…does that suffice in terms of this posting and the robot txt?
The sitemap will make it easier for Google to index your site. The robots.txt allows you to tell the search engine crawlers what they can and can’t put on file.
Because of duplicate content that could be picked up by a crawler (a post in your archives has a different URL but still hods the exact same content as it’s main URL) Google for example might think you’re spamming. (although we’re not convinced by that. Surely Google is smarter than that.)
can i put this robot.txt in my blog spot?
Cow–thanks for taking some time to reply to my question…I appreciate the solid information!
AL
Yah I would n,t blame you problogger is probably the best choice the reason why he doesn’t hold anything from google is that he has so many incoming links that probably leads to his older post