Robots.txt is hardly new and is almost as old as the net itself. Having said that it is very handy when it comes to making sure that the search engines only spider the parts of your site that you actually want to show up on search engines like Google, Yahoo, MSN, Altavista etc. You can make your own robots.txt file with a text editor, or you can use this handy online tool from Webtoolcentral. With the reports coming in about security flaws and data mining happening via specially crafted search engine queries, it makes more sense then ever to ensure that you limit what information people can dig out of search engine indexes. It can also be handy for limiting bandwidth caused by excessive spidering as I found out yesterday.
I’ve been trying to work out why so much of our normally sufficient bandwidth was suddenly getting used up for no immediately apparent reason. After much searching, tcpdumping and access_log watching, I discovered one of our hosting clients had a huge directory of video and music files, some of which were 250MB in size. It turns out much of the traffic was actually search engine bots downloading them, presumably to add to one of the new video searching facilities the search engines have all jumped on. After crafting a nice robots.txt and adding code to my download manager program to block search engine referrers from downloading the files, the bandwidth usage has dropped dramatically. It turned out that one of the worst offenders was ConveraMultiMediaCrawler, which showed up almost continuously in the access log. With my robots.txt and my modified downloader, none of the search engines can access those video and music files unless configured to allow it. Robots.txt may be old tech, but don’t let that make you think it isn’t a useful tool. It should be added that for it to work, the bot in question has to support the robots exclusion standard, but all the big ones do and that ensures you can control where your information ends up.