Blocking Spam Bots Query

Post here if you are having problems with your account or web site.

Blocking Spam Bots Query

Postby Ganceann on Thu Nov 09, 2006 9:02 am

Hi,

I checked my logs and found a bot accessing just about all of the pages on my site without checking the robots.txt - so I queried the IP on google and found it is among a host of 'bad bots' that are normally just scraping content and in the long term could lead to a lot of excessive bandwidth being used.

I found this information and a way to trap the bad bots from scraping content - however, I dont know the technical implications on unflux server and whether this would be ok or not so I am just posting the link to the page explaining it and hopefully I get some feedback if it is ok to try and implement it.

The Ip for the spambot was: 63.80.56.36

http://www.kloth.net/internet/bottrap.php (there are two methods, one via htaccess and the other via php - although I did see that it involved placing a 1px graphic on the page to trap the bad bots - this may not be what I am looking for and I might just have to look at manually putting bad bots on a deny list as the hidden link in the form of the 1px graphic could potentially get my site penalized by search engines).


There is also a topic discussion on bad bot exlusion at webmasterworld forums, although I dont know much about the methods used: http://www.webmasterworld.com/forum23/1281.htm

Overall, just wondering if it will be better to ban them via cpanel or is there an effective automated method as outlined by any of the above methods that wouldn't adversely affect unflux servers or performance.

Thanks.
Ganceann
Registered User
 
Posts: 23
Joined: Sun May 14, 2006 5:23 pm

Postby Bigwebmaster on Thu Nov 09, 2006 10:39 pm

Banning via IP address wouldn't be very effective as most bots will change IPs frequently. Banning by the user agent will be semi-effective, but those can easily be modified as well to get around it. There really is no way to stop bots, but I usually just use .htaccess files to stop common scraping bots and programs. For instance putting this in an .htaccess file will stop a good amount:

Code: Select all
SetEnvIfNoCase User-Agent "Download Ninja 2.0" bad_bot
SetEnvIfNoCase User-Agent "Fetch API Request" bad_bot
SetEnvIfNoCase User-Agent "HTTrack" bad_bot
SetEnvIfNoCase User-Agent "ia_archiver" bad_bot
SetEnvIfNoCase User-Agent "JBH Agent 2.0" bad_bot
SetEnvIfNoCase User-Agent "QuepasaCreep" bad_bot
SetEnvIfNoCase User-Agent "Program Shareware 1.0.0" bad_bot
SetEnvIfNoCase User-Agent "TestBED.6.3" bad_bot
SetEnvIfNoCase User-Agent "WebAuto" bad_bot
SetEnvIfNoCase User-Agent "WebCopier" bad_bot
SetEnvIfNoCase User-Agent "Wget/1.8.2" bad_bot
SetEnvIfNoCase User-Agent "Offline Explorer" bad_bot
SetEnvIfNoCase User-Agent "Franklin Locator" bad_bot
SetEnvIfNoCase User-Agent "LWP::Simple" bad_bot
SetEnvIfNoCase User-Agent "Larbin" bad_bot
SetEnvIfNoCase User-Agent "AA" bad_bot
SetEnvIfNoCase User-Agent "Rufus Web Miner" bad_bot
SetEnvIfNoCase User-Agent "Port Huron Labs" bad_bot
SetEnvIfNoCase User-Agent "Sphider" bad_bot
SetEnvIfNoCase User-Agent "voyager/1.0" bad_bot
SetEnvIfNoCase User-Agent "DynaWeb" bad_bot

SetEnvIfNoCase User-Agent "EmailCollector/1.0" spam_bot
SetEnvIfNoCase User-Agent "EmailSiphon" spam_bot
SetEnvIfNoCase User-Agent "EmailWolf 1.00" spam_bot
SetEnvIfNoCase User-Agent "ExtractorPro" spam_bot
SetEnvIfNoCase User-Agent "Crescent Internet ToolPak" spam_bot
SetEnvIfNoCase User-Agent "CherryPicker/1.0" spam_bot
SetEnvIfNoCase User-Agent "CherryPickerSE/1.0" spam_bot
SetEnvIfNoCase User-Agent "CherryPickerElite/1.0" spam_bot
SetEnvIfNoCase User-Agent "NICErsPRO" spam_bot
SetEnvIfNoCase User-Agent "WebBandit/2.1" spam_bot
SetEnvIfNoCase User-Agent "WebBandit/3.50" spam_bot
SetEnvIfNoCase User-Agent "webbandit/4.00.0" spam_bot
SetEnvIfNoCase User-Agent "WebEMailExtractor/1.0B" spam_bot
SetEnvIfNoCase User-Agent "autoemailspider" spam_bot

<Limit GET POST HEAD>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
Deny from env=spam_bot
deny from 111.11.11.11
deny from 111.11.11.12
</Limit>


I also put the deny from 111.11.11.11 as you can add those one per line there if you want to ban via IP address. Using the .htaccess file is probably better than using PHP as the .htaccess file will ban all access from files to images.
UNFLUX.NET SUPPORT
User avatar
Bigwebmaster
Technical Director
 
Posts: 109
Joined: Sun Jun 27, 2004 10:54 pm

Postby Ganceann on Fri Nov 10, 2006 2:30 pm

k, thanks for the reply as I knew my original site got scraped a lot but with the stats I had access to (very basic analog stats it seemed) - I couldn't tell what was being accessed apart from an IP had accessed 1 or more pages from the site etc.

The urchin stats and raw logs was something I had wanted to gain a greater understanding of who and what was visiting.

Thanks again.
Ganceann
Registered User
 
Posts: 23
Joined: Sun May 14, 2006 5:23 pm


Return to Technical Support

Who is online

Users browsing this forum: MSN [Bot], MSNbot Media and 2 guests

cron