Monday, May 24, 2010

Captcha? What's that?

Captchas - just what are they? The coinage is slowly moving into the mainstream, but there are a number of people who do not know what they are or understand how they work. I mentioned using them in the post below, so it's worth spending a brief moment on this topic...

The intent of a Captcha is to prevent automated systems from scouring the web, scraping up information as they go, or posting spam en masse. The idea is that, if human intervention can be imposed, we can slow down the vile intents of these systems.

But first, a brief word about the word itself. Strictly speaking, it should be rendered in all caps, because it supposedly is a backronym for Completely Automated Public Turing test to tell Computers and Humans Apart. Personally, I find the all-cap rendering ugly, so I've resorted to initial-capping it.

Computers are not very good at interpreting images. A Captcha is an image of (usually) a series of letters and/or numbers, which then have to be interpreted by human eyes, and re-keyed manually. Current technology is not at the point where machines can process these efficiently.

Needless to say, there are attempts to overcome this situation, but typically this still involves humans - essentially, it's done by paying a token sum to individuals (often in third-world countries) to interpret the images. This move is only partially successful because, no matter how low the pay is (typically about US$0.001 each), it still becomes an intolerably large amount for spammers.

Typically, a Captcha is a series of characters which are "warped" to the point of still being able to be interpreted by human eyes, but not by image scanning software. Techniques involve the use of color, skewing the image, or superimposing lines over it. There are a few variations on this theme, one being the use of pictures which then need to be identified (e.g. a picture of a camera, or an apple). A fascinating new one is the "Recaptcha", which deserves more explanation.

You may be aware that companies like Google have embarked on massive projects which involve the scanning of text and automatically converting it to computer characters, using optical character recognition (OCR). Much of this is the interpretation of old books, where the text is not always clear. As I said above, computers have trouble with this. So, why not use the legions of users out there who are asked to enter Captchas, and get them to assist with the interpretation of the unreadable stuff? Enter the Recaptcha (well, OK, officially reCAPTCHA!). You are probably familiar with seeing not one, but two words that must be interpreted, like so:
The words are presented in random order. One of these words is already known by the requesting machine; the other one is not. The first one is used to perform the original purpose of the Captcha, while the second one is the unknown word. (It's often possible to guess which is which, based purely on the relative quality of each.) Your answer then gets sent back to Google, or wherever, and, when there is enough consensus that a word is what we say it is, then it is assumed to be correct and is inserted into the text where the mystery blob originally appeared. Clever, huh?

Click these Wikipedia links to learn more about Captchas and Recaptchas. Tell me what YOU think about this idea, as well as the use of Captchas in general.

