Redesigning my robots.txt file. Have you done yours?
Some months ago, I was creating my first robots.txt. I even asked someone to help me make one. After a few months though, I’m a bit more confident in knowing WordPress and so I revisited my robots.txt again and start doing some more research to optimize my search engines traffic even further.
I found out that I still have a lot of duplicated contents that need to be filtered out to get a good SEO (Search Engine Optimization)!
If you don’t know what robots.txt is, it’s the file that search engine bots/crawlers will look first on their visits to your site/blog. The file tells them what to crawl/info to grab and what’s not.
The file robots.txt has to be put on your root site, even if your WordPress is installed on a sub-folder! So since my blog’s URL is http://www.michaelaulia.com/blogs/ , I still have to put the robots.txt under the http://www.michaelaulia.com/ (or your public_html/ folder).
Here is my new robots.txt file: (Feel free to comment about it)
# BEGIN XML-SITEMAP-PLUGIN
Sitemap: http://www.michaelaulia.com/blogs/sitemap.xml.gz
# END XML-SITEMAP-PLUGINUser-agent: Googlebot-Image
Disallow:# Google AdSense
User-agent: Mediapartners-Google*
Disallow:# Internet Archiver Wayback Machine
User-agent: ia_archiver
Disallow: /# digg mirror
User-agent: duggmirror
Disallow: /User-agent: *
Disallow: /blogs/cgi-bin/
Disallow: /blogs/wp-admin/
Disallow: /blogs/wp-includes/
Disallow: /blogs/wp-content/plugins/
Disallow: /blogs/wp-content/cache/
Disallow: /blogs/wp-content/themes/
Disallow: /blogs/author/
Disallow: /blogs/archives/
Disallow: /blogs/trackback/
Disallow: /blogs/feed/
Disallow: /blogs/tag/
Disallow: /blogs/search-result/
Disallow: /blogs/smilies/
Disallow: /blogs/wp-au-backup/
Disallow: /blogs/category/
Disallow: /blogs/page/
Disallow: /blogs/2007/
Disallow: /blogs/2008/
———-

NOTE: If you want to copy my robots.txt to your WordPress blog, feel free to do so, BUT! This only works if your permalink structure is similar like mine (www……./%posttitle%……. IF your permalink structure has the year or category on it, it will be blocked by this robots.txt configuration! (i.e. the Disallow: /2008/ part)
As always, check whether your posts are accessible by using Google Webmaster Tools.
Once there, go to Tools-Analyze robots.txt.
You should then see your robots.txt contents there. If you’ve just updated your robots.txt file, you may still see the old one. It will be refreshed on the next Google’s crawl which may take a day or two.
Then, test if the crawler bots can access your actual content and can’t access the duplicated contents:
Then, look at the results to see if the bot can access only the actual content.
As you can see, the bot can now only access the actual post content and not the posts on archives, feeds, navigation pages, etc.
Have you re-visited your robots.txt? It’s very important for search engines, especially Google, that you get it right and optimized!
|
If you like this post, don't forget to subscribe to this blog via your favorite RSS feed reader (What is RSS?) or by entering your email address on the form below: |












nice , i love the files to be ready to copy and paste
i’ll see if my blog support blocking robots or not ,
ameos last blog post..manage passwords / arting ads [ firefox ]
very good post. my perma structure is year/month/articlename.
i think you are using sitemap plugin. I am using the same. I don’t have any special instructions in robot.txt
Let me know if i can put like you have done with the same perma structure?
Nihars last blog post..Get FREE Kaspersky Internet Security license key
Nice post, I always knew they existed but never had the time to verify the file, with your post i’ll be sure to give it a look today.
I’m in the same situation of Nihar for the permalink structure, I will have to read on how to configure my robot file.
Thanks!
Chessmasters last blog post..Too Many Money Blogs
Hmm I guess having a year/ on the permalink is a bit tricky for the robots.txt. Worst to worst, don’t put the
Disallow /YEAR/ parts…
You’ll still get a duplicate content though because I can go to your
http://www.YOURSITE.com/2007 to see all of your 2007 archive posts..
I did not disallow those admin content too, you are right, I should follow your way too
but i did not edit robots.txt over my root, i just edit the subdomain’s robots.txt
I just realize that my blog doesn’t have robot.txt file. So gotta create one now.
Steve Yus last blog post..Quickly Adjust the Volume of Your Speaker with just a Mouse Scroll
Great tip - I would never have have even know about this if it wasn’t for your post.
Regretful Mornings last blog post..Wingman of the Year
Wow, I didn’t know that most bloggers don’t have robots.txt yet. Glad to help out. Now hopefully more search engine visitors will come more to your site!
@ICalvyn: I’m not sure how web crawlers work for subdomain, but if they’ll grab the robots.txt under the subdomain, then I guess you don’t need the root anymore
When I relook at my robots.txt file, I realize I allow the robot to access to my yearly archive.
How important it is to disallow the yearly archive? If I leave it as it is now, are you suggesting that there will be duplicate content issue?
Yan@Blog for Beginnerss last blog post..Optimize Your URL For Search Engines
Micheal,
Having a good robots.txt file really helps with SEO and search engine traffic. Mine has improved a lot since I started cleaning up my robots.txt file.
You may want to validate your robots.txt file because there are some errors in it. I use this free online robots.txt validation for my site and it works very good. http://tool.motoricerca.info/robots-checker.phtml
I am working on my robots.txt file and still haven’t quite figured it all out but for the most part it is better. If you get a chance, would you look at mind and see what you think. If you need some ec credits, let me know.
Thanks……
@Yan: Yeah, it is. If you type http://thoushallblog.com/2008, you’ll see all of your posts in 2008. It’s kind of duplicate, don’t you think?
@Squeaky: Thanks Squeaky! My goodness, there are so many errors on mine
It’s weird because I’ve got some of the configurations from some blogs on the web (I can’t remember wehere now, planning to give them some link love
)
I have been working on Madmouse robots.txt for a few days now, and the Google crawl cycle is getting better. I have used the robots checker tool on many of the big bloggers sites and found lots of errors.
I am error free now, but I am sure that I have some items to address yet. But, for the most part it is better than what I had.
Once you get things to validate, it will be interesting to see if you notice any results as far as SEO, etc.
Squeakys last blog post..Stop! Blog Scrappers with the RSS Footer, WordPress Plugin
@Michael: Yup, you have your point. It’s time for an update. Anyway, I don’t understand why Disallow: /*?* is an error. I had that on my robots.txt file too after some advise by I-can’t-remember-who.
Yan@Blog for Beginnerss last blog post..If You Have Adsense, Use Section Targeting
Is there another tool that achive the same thing? It’ll be good to check whether the tool/checker itself has no bug whatsoever
Can never trust application 100% these days
Since we create robots.txt file mainly for Google, I would place my trust on the big G to analyze it using Webmaster Tool. You have used that too, haven’t you, Michael?
Yan@Blog for Beginnerss last blog post..If You Have Adsense, Use Section Targeting
@Yan: Yeah, but honestly the Webmaster Tool doesn’t really analyze your robots.txt file in detail.
It’s probably worth researching again if you got errors, and see what other SEO experts say about the error, though.
Thanks for the advise. If you do find any useful tool online to analyze robots.txt, do let us know.
Michael, help me write mine robot.txt files
I’ve just updated this post with my latest robots.txt after following the web checker posted by Squeaky earlier
I think it’s a very good tool to analyze your robots.txt file. I’ll probably post something about it soon
@Arnold: You can copy paste my robots.txt and change the paths to match your blog