Vidar Hokstad V2.0

Home Blog

2008-05-15 13:10 UTC How to beat comment spam (for now, anyway)

I've always loathed captcha's, and they've gotten progressively worse. But I've noticed some blogs seem to do reasonably well with just a single static word, presumably because most spammers won't customize things for a single blog unless you get a lot of traffic.

I also have a soft spot for an idea that I believe originated with various schemes to reduce e-mail spam:

Make the client pay

"Pay" doesn't have to mean money. Paying by carrying out a computation is just as good. The overall idea is to either make spamming you "too expensive" or at least sufficiently more expensive to make you a less attractive (and hopefully a money losing) target.

So since I've had to delete 50-100 comment spams every day for the last few weeks, I got fed up and did the following as a first step. It doesn't yet add a computational cost for the spammers, but it makes it trivial for me to make it more complex, and if I start seeing comment spam again, I'll ramp it up immediately:

  • First I added a "script" tag to the comment form. All it contains is this: document.write("");
  • Then I added a check for the "captcha" in my controller

That's it. So far it's stopped all but one comment spammer, who from the looks of things was an inept manual one.

Escalation

But this won't stop people for long, especially not if more people do it and it gets worthwhile to circumvent, so here's how I'll escalate things if I start seeing comment spam again:

Consider that for an average poster, waiting an additional second or two for a comment to successfully submit isn't a problem. Even a bit more is probably acceptable - you probably spend more than that trying to figure out captcha's or even being forced to register at a regular basis.

For a comment spammer, though, if 2 seconds of CPU time is wasted per comment posted, that means a real cost. I don't know what kind of throughput these guys manage, but I know that when I worked at Edgeio we had no problems doing several millions HTTP requests from a single dual or quad CPU box (can't remember) to retrieve feeds. I'd be surprised if you couldn't churn out a million comment spams per 24 hours if nothing is stopping you.

"Real" captcha's slows that down, but they also inconvenience users. Feeding the client browser a javascript function that needs to be executed to find the captcha value achieves the same goal, but transparent to the user (assuming they are not using an ancient browser or turns javascript off - I have no sympathy), and it can potentially massively raise the bar because the function need not be understandable by a human user.

"Costing" a comment spammer that can otherwise handle 1 million a day per core 2 seconds of time per comment translates to a reduction in rate from around 1m to about 43k per day per core, or a factor of 23 times. That means the yield for each spam needs to be 23 times as high for the spammer to break even or make any money.

So how to go about it?

Generating semi-random functions. Creating a valid parse tree for a function that applies some semi-random permutation to an input value included with the script, and then serializing it as javascript instead of my dumb static function is easy. Yes, you need to execute the function yourself too, but only when the comment form is submitted, and the comment volume you deal with is likely to be far lower than what an aggressive spammer is dealing with, unless you're Slashdot or something.

Why not just use a single hash function? Simple, it can be optimized on the client side. You want to generate someone randomly to force them to take the full computational cost for every comment.

Retribution

One of the beautiful things about this is that you can make it even worse. You are forcing your adversaries to execute arbitrary Javascript, and while you can't do anything really nasty, if you catch someone posting spam (or trying to) repeatedly from the same IP or in ways that are otherwise easily detectable, you can blow up the computational cost arbitrarily - determining how soon (or if) your function will complete is called the halting problem, and is undecidable. Meaning the only generic way of avoiding to fall into the equivalent of a tarpit is to aggressively time out the execution of the functions. But then they don't get their spam out the door.

In other words: If you're careful, the spammers can't know in advance whether they've been fed a "genuine" function or one aimed at keeping them stuck forever.

Of course you need to be careful not to hit genuine posters this way, or they'll quickly learn to stay away.


Comments

2008-05-15 14:32 UTC
Steve
Interesting idea. We will be very interested to hear how well it works for you.
2008-05-15 16:26 UTC
So far so good - no spam to delete today, but the real challenge won't come until the spammers catch on and force me to actually start generating expensive functions.
2008-05-15 18:38 UTC
If you're lucky it will take years before that happens. By the looks of it using a custom solution will catch pretty much all automated spam for years to come.

Using one or two simple traps/honeypots seems to be enough. What I did over at one site was pretty simple:

1. validate the email address 2. check if the name is a valid email address (bots are stupid) 3. check if one hidden field (css) stayed empty 4. check if one hidden field (css) stayed unchanged

Really simple stuff. And not even one spam message for about 4 years.

Spammers are always aiming at the low hanging fruit (like everyone else). If your page requires some extra work, they won't bother. Their time is more effective spent if they target some common spam protection instead.

Oh and you really should put some hint there that JS is mandatory.

2008-05-15 18:44 UTC
Frankly, so much of the web is broken these days if you use a client that doesn't support javascript, that I'm not sure I care, but point taken - I'll add a noscript section.

Interesting comments - I haven't bothered validating the e-mail addresses. Most of the bots I've seen here have been hitting me with seemingly valid e-mail addresses, so it didn't seem worthwhile.

Using fields hidden with CSS is cute. I'd have assumed the bots would leave fields they didn't understand unchanged, but on thinking about it I guess that would likely result in a higher failure rate from mandatory fields etc.

2008-05-15 18:47 UTC
Vidar Hokstad
Oh, and in a way I almost hope someone tries - I'd get a certain sense of pleasure from feeding them something nasty, as I see from my logs that most of the spambots that have hit me surprisingly does not look to come from botnets but from a relatively small (compared to the number of comments) static ips (and a lot of them stupidly keeps sending the same referrer fields too).

Post a Comment

Basic HTML allowed. Javascript required for anti-spam check (I am testing a new anti-spam measure. Problems commenting? Please e-mail me: vidar@hokstad.com)

About me

E-mail: vidar@hokstad.com
Skype: vhokstad
View my LinkedIn profile

I was born April 21st, 1975, in Oslo, Norway. Since 2000 I've been living in London, UK. I'm married.

I'm working for Aardvark Media as Director of Technology. I'm also currently on the board of SpatialQ, a startup in the GIS space, and an advisor to Skoach, a startup doing a time management app for people with ADD.

Recent posts to my blog

Tags

(1) (2) (1) (3) (2) (3) (2) (15) (10) (3) (2) (2) (2) (2) (2) (3) (5) (2) (4) (2) (2) (2) (2) (2) (3) (4) (4) (4) (3) (30) (5) (2) (1) (33) (1) (2) (2) (4) (2) (3) (3) (2) (2) (1) (3) (2) (4) (2) (3) (2)

StumbleUpon My link page

(Links I have stumbled and like)