May, 2006 | Breaking Eggs And Making Omelettes

A brief digression: At a frequency of roughly once every 2 days, the MultimediaWiki sustains a drive-by spamming attack. It usually takes 2-3 minutes to clean up, although one morning I woke up to a massive spam attack that took me hours to revert; that’s what prompted me to enforce user registration. What strikes me is how much more serious this problem could possibly be. I occasionally get so annoyed that I investigate MediaWiki’s anti-spam features.

Second-order digression: If you think it’s hard to find good documentation on FFmpeg, try finding the documentation you need for a Wiki package, which is — in the time-honored tradition of eating one’s own dog food — all in Wiki form. Why is this a problem? It just feels so… “squishy”. It’s not all there, it’s always in flux, it can give you a general idea of what you want to know but never feels authoritative– the same controversial points as, for example, Wikipedia. In fact, my first encounter with the Wiki paradigm was the online documentation for some open source program or another. They constructed a Wiki outline and expected users to fill it in. That experience gave me a serious aversion to Wiki for a long time to come. That said, would it be hypocritical for me to mention that I very much want to set up a Wiki-based knowledge base for FFmpeg users and developers?

I have watched the email spam arms race with much interest for many years. I am fascinated by the technical challenges involved and the solutions proposed, each with its pros and cons. Every proposed measure could be thwarted with enough effort. A few years ago, Bayesian filtering caught on and it always struck me as the tactical nuclear weapon of spam filtering. It did a lot to solve the problem on the client side (though counter measures at various levels of the email network help matters).

Then blogs, with comments, and Wikis gained prevalence. The spam problem started all over again. What I can’t seem to understand is why people fighting the good fight on this new frontier have chosen to start the arms race from square one by banging at the problem with rocks instead of going straight to the nukes. I’m wondering why there aren’t any Bayesian solutions in the Wiki space. (Thankfully, it appears that there are Bayesian comment filtering plugins available for, at least, WordPress). How would it work? Perhaps initialize it by claiming that the entire set of existing pages is valid and then allow administrators to mark certain pages as spam, or certain users as known spammers. When an edit is submitted the Wiki runs the edit through the filter to determine if it “looks” like spam and rejects it. However, one of the underlying operating principles of the Bayesian method as applied to email is that every user’s mailbox looks very different than everyone else’s. A spammer would require knowledge of an individual mailbox in order to reliably thwart the filtering. Unfortunately, the “mailbox”, or body of messages, in this case would be unified and public. This would afford a spammer an ergonomic, interactive environment by which to test spams by dumping in the text of valid pages and tweaking them with spammy URLs until the pages get through.

Okay, so maybe the idea isn’t that straightforward after all. Forget I even brought it up.

Through it all, though, I still stand by the Wiki paradigm.

Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering

Monthly Archives: May 2006

Wiki Counterspam

Summer Of Code Reminder