Wiki Counterspam | Breaking Eggs And Making Omelettes

A brief digression: At a frequency of roughly once every 2 days, the MultimediaWiki sustains a drive-by spamming attack. It usually takes 2-3 minutes to clean up, although one morning I woke up to a massive spam attack that took me hours to revert; that’s what prompted me to enforce user registration. What strikes me is how much more serious this problem could possibly be. I occasionally get so annoyed that I investigate MediaWiki’s anti-spam features.

Second-order digression: If you think it’s hard to find good documentation on FFmpeg, try finding the documentation you need for a Wiki package, which is — in the time-honored tradition of eating one’s own dog food — all in Wiki form. Why is this a problem? It just feels so… “squishy”. It’s not all there, it’s always in flux, it can give you a general idea of what you want to know but never feels authoritative– the same controversial points as, for example, Wikipedia. In fact, my first encounter with the Wiki paradigm was the online documentation for some open source program or another. They constructed a Wiki outline and expected users to fill it in. That experience gave me a serious aversion to Wiki for a long time to come. That said, would it be hypocritical for me to mention that I very much want to set up a Wiki-based knowledge base for FFmpeg users and developers?

I have watched the email spam arms race with much interest for many years. I am fascinated by the technical challenges involved and the solutions proposed, each with its pros and cons. Every proposed measure could be thwarted with enough effort. A few years ago, Bayesian filtering caught on and it always struck me as the tactical nuclear weapon of spam filtering. It did a lot to solve the problem on the client side (though counter measures at various levels of the email network help matters).

Then blogs, with comments, and Wikis gained prevalence. The spam problem started all over again. What I can’t seem to understand is why people fighting the good fight on this new frontier have chosen to start the arms race from square one by banging at the problem with rocks instead of going straight to the nukes. I’m wondering why there aren’t any Bayesian solutions in the Wiki space. (Thankfully, it appears that there are Bayesian comment filtering plugins available for, at least, WordPress). How would it work? Perhaps initialize it by claiming that the entire set of existing pages is valid and then allow administrators to mark certain pages as spam, or certain users as known spammers. When an edit is submitted the Wiki runs the edit through the filter to determine if it “looks” like spam and rejects it. However, one of the underlying operating principles of the Bayesian method as applied to email is that every user’s mailbox looks very different than everyone else’s. A spammer would require knowledge of an individual mailbox in order to reliably thwart the filtering. Unfortunately, the “mailbox”, or body of messages, in this case would be unified and public. This would afford a spammer an ergonomic, interactive environment by which to test spams by dumping in the text of valid pages and tweaking them with spammy URLs until the pages get through.

Okay, so maybe the idea isn’t that straightforward after all. Forget I even brought it up.

Through it all, though, I still stand by the Wiki paradigm.

16 thoughts on “Wiki Counterspam”

Benjamin Larsson May 3, 2006 at 1:26 am

How about requiring a waiting period of 2-3 days after registering before you can post ?

cartman May 3, 2006 at 3:00 am

Update to latest MediaWiki, it now has CAPTCHA support hence it will stop those automated spam bots.

Multimedia Mike Post authorMay 3, 2006 at 9:22 am

Benjamin: I really don’t want to put up any further impediments to prevent people from editing. I think it was extreme enough to have to enforce registration.

cartman: I have been looking into the captcha support. I did not realize that it was part of the formal release now (and I dread having to figure out how to enable it). I was hesitant about it since I don’t want to have to type a captcha for each of my edits. But the experimental captcha system I saw only challenges the user if they try to add an external link to a page. So it may be worth investigating.

cartman May 3, 2006 at 9:58 am

Hi Mike,

At least recent MediaWiki’s use CAPTCHA on account registration.

Multimedia Mike Post authorMay 3, 2006 at 10:39 am

In that case, it could be a feature of the latest experimental 1.7 line. Wikipedia is on the bleeding edge with their own technology: http://en.wikipedia.org/wiki/Special:Version

VAG May 3, 2006 at 10:56 am

(Block log); 04:09 . . Multimedia Mike (Talk | contribs) (blocked “User:Multimedia Mike” with an expiry time of infinite: spammer)

And now you even decided to ban yourself? :)

Multimedia Mike Post authorMay 3, 2006 at 12:11 pm

Oh wow, that’s a stupid move even by my standards! Dang, I’ve never had to figure out how to unblock someone. I hope I’ll be able to undo this move. This came about because I decided to mark the offenders as spammers in their user pages, revert their changes, and THEN ban the offending users. I was toying with not banning them at all to see if the spammer actually would re-use the same account but decided against it. Unfortunately, since I deleted all of the newly-created spammed pages, the new spam usernames were no longer in any log entries, except on the ones where I edited their user pages. Thus, I accidently banned myself when I clicked the entries with the spammer usernames.

Okay, I just figured out how to unblock myself…

cartman May 3, 2006 at 12:54 pm

No, actually a Google search[1] reveals that thousands of sites are already using MediaWiki with Captcha support.

[1] http://www.google.com/search?q=mediawiki+captcha&ie=UTF-8&oe=UTF-8

Benjamin Larsson May 3, 2006 at 1:41 pm

Ok, I’d vote on Captcha for the registration page, it should be enough.

Multimedia Mike Post authorMay 3, 2006 at 2:30 pm

I will look into the captchas. Meanwhile, there was another drive-by a little while ago (it’s like someone runs in and takes a dump on the recent changes page). One curious point about this spam is that the spammy content always begins with “[We are delicate. We do not delete your content.]” and they probably think that makes it all right. Google for that phrase and find a bunch of spammed Wikis. However, as a very simple counterspam measure, I have entered that string into the Wiki’s $wgSpamRegex variable until I can get captchas installed correctly (and upgrade to 1.6.5 just for fun).

VAG May 3, 2006 at 4:35 pm

You could do a simple user-name filter to kill all logins that generated using time() function.

Multimedia Mike Post authorMay 3, 2006 at 4:49 pm

Ah! So that’s how they’re generating those random usernames! I have been curious. Yes, I thought about checking if a username started with “114” and blocking those.

These are all very basic progressions in the arms race. We presume there is a naive automated program on the other end committing these acts. I always ask myself, “Would it be very difficult to modify such a program to get around these simple roadblocks we’re putting up?” Of course not. But hopefully the viability of Wiki spam will go away soon thanks to the fact that MediaWiki software now instructs robots not to follow links outside the Wiki. This really defeats the purpose of the spamming. Not that the less intelligent spammers will ever figure that out.

Mat May 4, 2006 at 9:03 am

I don’t know the level of wikimedia captcha, but some are easy to crack : http://sam.zoy.org/pwntcha/

Multimedia Mike Post authorMay 4, 2006 at 9:12 am

I found that pwntcha project the other day when I was investigating Wiki captcha possibilities. Very fascinating work, ratcheting up the technology. Somtimes I feel like attemption to write a Wiki spam script that will bypass all of the silly counter-measures I can think up, just to push the Wiki anti-spam technology forward more quickly.

cartman May 4, 2006 at 2:20 pm

Also see http://meta.wikimedia.org/wiki/ConfirmEdit_extension

SvdB May 6, 2006 at 5:39 pm

I’m using the ConfirmEdit extension myself, and it drastically reduces spam. About once a week there’s still the lone wannabe spammer who actually spams a page by hand, not realising we’ve got rel=nofollow.

I advice against registrations, as you’ll lose contributions.

Comments are closed.