The Problem
Recently I’ve been receiving lots of 404 errors on a site that I work on. All of these errors were caused by one bot; WeSee.
As an example, a valid URL on this site would look like this:
www.example.com/item/123/item-name/
WeSee would try and also access both of the following:
www.example.com/item/123/
www.example.com/item/
What Is WeSee
From the WeSee website, it looks like they are merely crawling for images so they can sell data to their customers.
“Our software is used so that visual content can be turned into machine-readable data so that the content can for the first time play a significant role in Digital Advertising, Content Verification, Ecommerce and Visual Search. Our software holds the key in turning visual content in to lucrative advertising friendly targetable real estate…”
Blocking WeSee Bot
This isn’t as easy as it should be. Most well behaved bots and crawlers allow you to block them with a robots.txt file. WeSee ignores robots.txt (trust me, I tried).
What I had to do was block the IP addresses on the server manually. While this isn’t the most ideal solution as WeSee could use new IP’s at any time. Below is a list of all IP’s I’ve seen WeSee from:
199.115.116.97
199.115.116.97
199.115.116.88
199.115.115.144
178.162.199.101
178.162.199.98
178.162.199.86
178.162.199.77
178.162.199.69
178.162.199.35
95.211.156.228
95.211.159.93
95.211.159.68
95.211.159.66
If I see anymore, I’ll be sure to add them to the list.
Good man! thanks.. I’ll let you know if I see any new ips also
They’re using more IP addresses than those. Apparently all rented from Leaseweb. The following IP ranges (CIDR notation) will cover a superset of addresses they may be using, that is some of Leaseweb’s allocations. This will have the side effect of possibly blocking more Leaseweb IP’s than necessary, but datacenters usually don’t produce good traffic anyway.
199.115.112.0/21
178.162.192.0/21
95.211.144.0/20
Depending on how you choose to block them, you could also block them based on user agent. Matching on WeSEE (maybe case insensitive) should be sufficient, unless they fake UA’s or change it in the future,
Hello,
If you wish to control which parts of your site can be accessed by our robot, you can do this by using the industry standard robots.txt – full details, including the specification itself, can be found at http://www.robotstxt.org, it’s not overly complicated and it is very easy to implement.
As a simple example, you could have an entry in your robots.txt file such as
User-agent: *
Disallow: /cgi-bin
This would stop all robots, (not just WeSEE’s), from accessing the /cgi-bin folder of your site.
If you want to stop only our robot, you could use the following:
User-agent: WeSEE
Disallow: /cgi-bin
We generally refresh our robot rules daily, so if you’ve changed your robots.txt file, please allow 24 hours for us for your changes to come into effect.
For more information, questions, or any help with your robot.txt and our robot, please do email [email protected]
Thanks for your reply Anastasia. When I wrote this post, from my testing, your crawlers ignored the robots.txt file. Has this now changed?
I had your bot in the robots.txt file for over a week and it was still ignored.
Hi Alex,
could you, please, let us know which website did you block the access to, how did you do that and when? We will have to have a little investigation.
Thank you.
The other question is why they crawled my site in the first place. I don’t have any ads on it and I’m not planning to put any ads on it. (This is, btw, not my blog which I put in the website field to the comment, where I actually do have Google ads, but a different site.) I haven’t seen WeeSee in the logs since my last comment, though.
Forget tracking all ip’s
in .htaccess
RewriteCond %{HTTP_REFERER} \.wesee\.com
RewriteRule .* – [F]