Spam Statistics

Or, “Why oh why don’t people make raw data accessible?”

The Akismet people have made some statistics available on how many spam messages their WordPress plugin has trapped. They use a Flash applet to display their graph, which I hope means that the graph is being updated (instead of merely implying horrible software design). Here’s a screen shot from a moment ago:

This graph shows a few features of interest. First, there’s a big jump — of apparently several hundred thousand — legitimate messages in mid-May. I wonder if this actually represents a new spamming technique. Second, both “ham” and spam show periodicity. Running this time series through a Fourier transform might yield intriguing results.

Sadly, the Akismet folks aren’t providing actual numbers to go along with the pretty pictures, and extracting them from a graph like this doesn’t sound like my idea of a fun Wednesday afternoon.

I’d also be curious to see what the ratio of spams caught to Akismet plugins installed looks like as a function of time.

UPDATE (12 July 2007): The algorithm always finds raw data! The numbers necessary to draw the chart can be retrieved in XML format here, and the snapshots I’ve been playing with are here and here.

6 thoughts on “Spam Statistics”

  1. The chart data as XML is here.

    (The almost orgasmic Firebug extension lets you see all the network traffic from a page, and lots of other goodies. Orgasmicity may vary.)

    I wonder what the jump is, too. New spamming technique is definitely possible, and I can’t come up with a slam-dunk other idea.

    At first I thought Akismet got a bunch of new users suddenly (Slashdot or something), but then spam would also spike and stay higher. If, when FFT’d, the “new ham” closely follows the weekday/weekend patterns of past ham rather than those of spam, then maybe it truly is ham and Akismet lowered its false positive rate, although that’s strange because it looks like the ham tripled overnight. Maybe some service with its own filtering started using Akismet on its “probably ham” messages? Or, quite possibly, some chart data is flat wrong or they changed how they counted something, though I can’t figure what that would be.

  2. Hmmm, upon cursory inspection — i.e., open up Octave and try the first things I think of — the spam spectrum seems to have a peak at a frequency of about 5 days, while the “ham” spectrum has a peak at 7. I might be able to tell more, but I have to leave the office now to head to a birthday dinner.

    Ciao / chow!

  3. Aha — “a major site started running a type of content through it that wasn’t comments, it’s 95% legit rather than the other way around, so the ham spiked because of that.” Thanks.

    I’ll post again if the Fourier transform, autocorrelation etc. analyses show anything interesting.

  4. Next observation: the autocorrelation of the spam count decreases fairly steadily, but the autocorrelation of the “ham” count drops precipitously over about 50 days and then decreases much more slowly — almost a “hockey stick” kind of curve. If you leave out the part with the “new service” (which kicks in at day 578 of 630) then the autocorrelation drops smoothly.

Comments are closed.