As long as we continue to make advancements with internet technology there will always be someone trying to figure out how to hack, crack or game the technology. The most recent victim is the CAPTCHA … Refer article entitled “Digital Deception” by The Washington Post Staff Writer, Peter Whoriskey that summarizes the state of the problem.

What is a CAPTCHA you may well ask? Refer below for an example of Carnegie Mellon’s version - ReCAPTCHA – they’ve been popping up everywhere across the web.

CAPTCHA - Completely Automated Public Turing Test to Tell Computers and Humans Apart

The primary purpose is to prevent abuse from automated programs (bots) usually written to generate spam, blogspam, pingspam, etc. It works on the basis that a computer program cannot read distorted text as well as humans – the bots cannot then traverse the web site.

To simplify, CAPTCHA works on the premise that Optical Character Recognition (OCR, the former and well-established state of the art in recognizing letters, words) cannot recognize fuzzy, distorted letters.
OCR relies upon letters and words … Cracking a CAPTCHA, however, doesn’t rely on OCR.

OCR was the state of the art when we were dealing in a character based world. Sorry, not our world any more … we advanced past ‘green-screens’ to billions of pixels and rich digital media.

The more recent state of the art drills down to the pixel, the spatial relationship between pixels and massive-scale pattern matching. While the CAPTCHA is a good deterrent, it is not unbreakable. Here’s the highlights if you have the time energy:

  • Most CAPTCHAs are the same size, say, 300x60 pixels
  • Within the 300x60 pixel box there are distorted characters
  • You’ll need to develop Corpora (massive database of examples for the pattern matching algortihms to compare and solve) … think millions and millions of samples … By the way, this is what Carnegie Mellon is doing with their ReCAPTCHA project (they’re doing for good reasons to be able to digitize books – refer Carnegie Mellon ReCAPTCHA website)
  • Write a bot to access the CAPTCHA, attempt to solve against the Corpora with the pattern matching engines …
  • Crack the CAPTCHA and you’re in

Now, just when you’re thinking about taking a few hours on Saturday afternoon, you’ll need a little more than a MacGyver supply list and a spare server …

  • 2-3 PhD Mathematicians, Statisticians to develop and test the algorithms … btw, Mathematicians need to be deep in Graph Theory
  • Millions, if not hundreds of millions, of samples for the Corpora for the pattern matching algorithms
  • Some smart developers (OK, if you’re really smart, you’ll do it yourself)
  • Enough servers, infrastructure and technology to capture, process and initiate the hack

The lesson? Relying on old technology (such as OCR) to solve new problems will always have a limited shelf life.

The bad news for any innovation is that you have to spend as much time innovating as you do figuring out how someone will hack or game the system … and then the games begin.

Average rating
(0 votes)
digg    del.icio.us