OSVDB -How bad is the scraping problem?

[This was originally published on the OSVDB blog.]

Via Twitter, blogs, or talking with our people, you may have heard us mention the ‘scraping’ problem we have. In short, individuals and companies are using automated methods to harvest (or ‘scrape’) our data. They do it via a wide variety of methods but most boil down to a couple methods involving a stupid amount of requests made to our web server.

This is bad for everyone, including you. First, it grinds our poor server to a stand-still at times, even after several upgrades to larger hosting plans with more resources. Second, it violates our license as many of these people scraping our data are using it in a commercial capacity without returning anything to the project. Third, it forces us to remove functionality that you liked and may have been using in an acceptable manner. Over the years we’ve had to limit the API, restrict the information / tools you see unauthenticated (e.g. RSS feed, ‘browse’, ‘advanced search’), and implement additional protections to stop the scraping.

So just how bad is it? We enabled some CloudFlare protection mechanisms a few weeks back and then looked at the logs.

  • The attacks against OSVDB.org were so numerous, the logs being generated by CloudFlare were too big to be managed by their customer dashboard application. They quickly fixed that problem, which is great. Apparently they hadn’t run into this before, even for the HUGE sites getting DDoS’d. Think about it.
  • We were hit by requests with no user agent (a sign of someone scraping us via automated means) 1,060,599 times in a matter of days…
  • We got hit by 1,843,180 SQL injection attack attempts, trying to dump our entire database in a matter of weeks…
  • We got hit by ‘generic’ web app attacks only 688,803 times in a matter of weeks….
  • In the two-hour period of us chatting about the new protection mechanisms and looking at logs, we had an additional ~ 130,000 requests with no user-agent.

To put that in perspective, DatalossDB was hit only 218 times in the same time period by requests with no user agent. We want to be open and want to help everyone with security information. But we also need for them to play by the rules.

Leave a Reply

%d bloggers like this: