Form protection (written on August 5th, 2011 by marko-mrdjenovic-2)

Form protection

I’ve seen a discussion recently on how to protect your forms from spammers/bots that come and fill the forms to either fill your database with crap data or fill your page with porn links. When I read the answers I figured out that none of the people read the amazing article I did years ago, so I decided to try to remember what it said. So, a big fat disclaimer: I read this in an article somewhere and I don’t remember where. If you know the original article please post it in the comments, I’d love to link to it, I bet it has way more info than this one.

The problem

Almost all websites now have some forms on them, some of them are contact / registration forms, others use the data submitted and display it on the site itself (comment forms). But letting others submit data to your site/database opens you to all sorts of attacks. If you actually show the content of the submitted form, you’ll get a bunch of spammers posting comments with lots of links. If you only store data and not show it anywhere you’re still at risk – if you don’t notice your disk can fill up, your database may grow beyond its limits,… So what we want to do is to prevent bogus form posting.

Spammer approach

If you think about writing a spam-bot that will try to spam as many sites you possibly can you have two basic approaches.

Record / replay

This is a very simple approach – you use a person to submit the form, preferably with something that looks like real input and record the request made. Then you hand that data off to the bot, it changes some content and tries to resubmit it.

Automation based on heuristics

I wanted to say AI, but it really isn’t. What it is is a set of simple rules and randomized values that the bot thinks might trick your site into accepting the submit. Let’s say you have three fields, two are inputs with field names “name” and “email” and the third field is “comment”. A simple script can fill these with “valid” data and try to submit it.

Human entry

By far the simplest, but also most costly for spammers. Go on Amazon Turk or whatever other service, send a link to a Google Spreadsheet and have people manually enter the stuff into your forms. This is the source of “Sorry, this is my job” endings to spam comments.

Protect yourself differently

These are the techniques that should prevent most bogus form entries from random passing bots, except “Human entry” – no protection for that, even though Captchas try hard. There is not much you can do when you’re targeted…

Honeypot field

Use this field to trick autoguessing bots to submit something in a field you know should be empty.

Add an input field to your form with a regular name (state, maiden-name,…) that does not appear on your form otherwise.
Use a label that will clearly communicate that it needs to be empty.
Hide it with CSS, preferably not by adding class=”hidden”.

If the form post includes content in this field discard it and redirect back to the form. The trick is to make sure the bots don’t figure out this is a honeypot, so use valid looking but nonsensical classes…

Date field

Use it to prevent resubmit of data too far from the creation date. Allow users a few hours to post the form.

To prevent manual modification you can use either proper encryption (symetric or asymetric) that will allow you to decode it on form post or use this date in combination with the onetime token.

Onetime token

Use this field to prevent replay of request data. If you can, save it into the database.It is a good idea to make this token in a form that it cannot be faked (say one character changed ad you have a valid one). This can be done with hashing data or encryption.

This one can be as tricky as you want. What I usually do (disclaimer: I don’t know much about encryption so this might be crap advice) is use a plain datetime field with the onetime token generated from IP address, UserAgent and the date field with HMAC. There is no need for this token to be reversible – I can recreate the same thing with the data from the form post and check if it matches.

When using these techniques make sure you take care of the user experience. If you detect a problem on what might be valid user input (“timeout” on the date field with a non used onetime token, wrong onetime token from an ip change by the service provider), you might want to display a second step from the “2-step process”. Whatever you do, don’t call your users spammers or bots – be nice, bots don’t read the text anyway.

Did I miss anything?

I know of no plugin that uses all of these techniques, but I haven’t really looked for it. What I do know is that I don’t want to ever use a Captcha, cause it often keeps me out, and the 2-step process in just too weird sometimes. Hope this helps. And again – if you find the original article (must be some 5 years old now at least if not more) or have any other solutions you use or endorse, do leave a comment.

This entry was posted on Friday, August 5th, 2011 at 04:34 and is filed under css, design, general, html, interface, technology. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

8 Responses to “Form protection”

Magne Andersson says:

August 7th, 2011 at 02:27

I suppose you could also add a “honeypot submit-button”, tricking the bot into believing that it’s submitted it’s message even though it just ends up on a “for bots only”-page. You could even have it submit to an actual database, building a “Wall of spam”, separate from the normal users. The honeypot button could use the classic id=”submit” etc, while the button a normal user is supposed to press have something a bit random. (And the honeypot-button would also be hidden from the user, of course). I’d even suggest to have two of these, one before and one after the normal submit button, as we can’t know if the bot is written to choose the first or last button it finds.

(I wrote this in a bit of a rush, but I hope it makes sense.)
gasper_k says:

August 11th, 2011 at 09:42

I’m missing any kind of Javascript protection method from your list. Is it safe to assume that most users have JS enabled, and most bots can’t execute a complex enough (and possibly randomly generated) piece of code?
Marko Mrdjenovic says:

August 11th, 2011 at 11:13

I don’t want to add JavaScript into the mix if there is no need for it. The honeypot depends on JavaScript to some extent…

Do you have any suggestions on how to add JavaScript into the mix?
gasper_k says:

August 12th, 2011 at 03:05

Well, didn’t really think a lot about it, but it could be something very easy; javascript adds a field to the form in which it writes 1, but bots don’t know anything about it. Very similar to honeypot technique, except that the field isn’t anywhere in HTML.

Other methods may include changing the form action url upon submit to the correct one (and bots submit to the fake address written in the form tag). You could rename the input fields, or submit button value, or whatever. The thing can be complicated to include even Ajax, but that’s probably overdoing it. There are many simple tricks that may work very well, but I never tested them.
Marko Mrdjenovic says:

August 12th, 2011 at 09:16

That’s only optimizing the Honeypot technique, which I think works well without JavaScript already.

The only thing you do with JavaScript is have a bigger % of people that fall into the spam bucket and that can be a problem.
gasper_k says:

August 16th, 2011 at 04:21

It may be only optimizing, but honeypot is fairy easy to brute force; just post multiple comments, leaving out a different field upon each post, or even try different combinations. I wouldn’t be surprised if bots already did that.
Michael says:

September 13th, 2011 at 05:53

One JavaScript technique I find appealing is the use of hashcash. You provide a token, the user provides a hash of the token that has a specific number of leading zeros. the more leading zeros, the more CPU power required to compute a proper hash.

You have to make sure the CPU expenditure is significant enough to deter botists but not enough to deter regular users.

You should also combine this with something like the one-time token or a date-field as the token to be hashed
Marko Mrdjenovic says:

January 31st, 2012 at 07:40

Found the article that most of this is from: http://nedbatchelder.com/text/stopbots.html

You can find a similar technique at http://jeffcroft.com/blog/2012/jan/31/shut-down-comment-spam/

outbreak