HTML FIX IT.COM: A site with free scripts, advanced scripts, tutorials, online tools, live help, Perl, PHP, HTML, XHTML, Javascript, CSS etc.

Selected article

Controlling search engine bots with robots.txt.

by Franki

Robots.txt is hardly new and is almost as old as the net itself. Having said that it is very handy when it comes to making sure that the search engines only spider the parts of your site that you actually want to show up on search engines like Google, Yahoo, MSN, Altavista etc. You can make your own robots.txt file with a text editor, or you can use this handy online tool from Webtoolcentral. With the reports coming in about security flaws and data mining happening via specially crafted search engine queries, it makes more sense then ever to ensure that you limit what information people can dig out of search engine indexes. It can also be handy for limiting bandwidth caused by excessive spidering as I found out yesterday.

I’ve been trying to work out why so much of our normally sufficient bandwidth was suddenly getting used up for no immediately apparent reason. After much searching, tcpdumping and access_log watching, I discovered one of our hosting clients had a huge directory of video and music files, some of which were 250MB in size. It turns out much of the traffic was actually search engine bots downloading them, presumably to add to one of the new video searching facilities the search engines have all jumped on. After crafting a nice robots.txt and adding code to my download manager program to block search engine referrers from downloading the files, the bandwidth usage has dropped dramatically. It turned out that one of the worst offenders was ConveraMultiMediaCrawler, which showed up almost continuously in the access log. With my robots.txt and my modified downloader, none of the search engines can access those video and music files unless configured to allow it. Robots.txt may be old tech, but don’t let that make you think it isn’t a useful tool. It should be added that for it to work, the bot in question has to support the robots exclusion standard, but all the big ones do and that ensures you can control where your information ends up.

This article was posted on Friday, April 8th, 2005 at 3:53 pm

Comments are closed.

This site is totally free to use, you have absolutely no moral or legal obligations to help us continue.
There are however, some costs involved in running the site.

<random humor>
Plus Don needs a new snow shovel.
</random humor>

So if this site helped you find your way, perhaps you could consider contributing to our costs. Whatever amount you feel this site was worth to you would be just wonderful.
Use PayPal if you do decide to share and help us with the costs and in appreciation for our time and attention, or alternatively buy a book from our Bookstore..

Browser Statistics
Internet Explorer 8	5.88%
IE 7	17.63%
IE 6	2.3%
IE 5	0.00%
IE other	8.6%
Moz Firefox 3.x	3.03%
Moz Firefox 2.x	0.18%
Moz Firefox 0.x/1.x	26.65%
Netscape 8.x	0.00%
NS 6+/Mozilla	2.73%
Moz Seamonkey	0.00%
K-meleon	0.00%
Epiphany	0.00%
Netscape 4.x	0.00%
Opera 9.x	0.00%
Opera 8.x	0.00%
Opera 7.x	0.42%
Opera 6.x	0.00%
Opera other	0.42%
Safari Mac/Intel	5.21%
Safari Mac/PPC	0.06%
Safari Windows	25.2%
Google Chrome	1.51%
Konqueror	0.18%
Galeon	0.00%
WebTV	0.00%

Resolution Statistics
640 x 480	0.25%
800 x 600	26.14%
1024 x 768	36.55%
1152 x 864	0.25%
1280 x 800	11.68%
1280 x 854	0.00%
1280 x 1024	17.01%
1400 x 1050	0.00%
1600 x 1200	1.02%
1920 x 1200	7.11%
2560 x 1024	0.00%

OS Statistics
Windows 7	41.55%
Windows Vista	2.4%
Windows 2003	3.91%
Windows XP	20.86%
Windows 2000	0.36%
Windows NT4	0.05%
Windows 98/ME	0.05%
Windows 95	0.00%
Linux/UNIX/BSD	8.76%
Mac OSX	8.03%
Mac Classic	0.00%
Misc	14.03%

Selected article

Other HTMLfixIT articles:

Controlling search engine bots with robots.txt.

Archives

Categories