RegioWikiCamp 2011/Sessions/Surviving SPAM

About SPAM
I'll keep things short here: anyone who has access to the internet and useses e-mail knows about SPAM.

But there is a special kind of SPAM, which isn't known to much to the people: Weblink-Spam.

What's Weblink-Spam?
You may have noticed ugly weblinks on blog comments, guest books, bulletin boards and of course in wikis.

The weblinks may be surrounded by randomized text, chineese or english, maybe adwords for life style drugs (like v14gr4) for online pharmacies to simulate real content.

The goal is to bring internet users searching for something directly to web presences of online pharmacies, online gambling instead of the product owners web presence (today: the article on wikipedia).

Websearch results
The goal is to be the first result on popular search engines like google or even get listed on the first page (rank 1-10 on google). The calculation of the famous pagerank is somewhat complicated, but is mainly based on the number of weblinks pointing to a site.

So basically you need a LOT of weblinks, to get your page listed on the best results when searching for a worldwide well known term like viagra or gambling.

Accelerating web spiders with web 2.0
In web 1.0 the web crawlers needed about some days or weeks to find a term and your webpage or website.

In web 2.0 we have ATOM/RSS feeds on blogs, wikis or bulletin boards, sometimes theese sites ping special RSS sites on current activity. If your blog or wiki is big/famous enough and has lots of good content, the RSS feed will be polled by special crawlers in short intervals, like every hour.

If you write a new or edit a known article about some word chances are good, that this word will be found after some hours when searching for it on google or yahoo or maybe msn. If this word is very common, chances are bad that you will get listed in top 100 or top 500 (pages 1..50 on google).

getting on top with common words on short time
Weblinks-spammers don't have much time to get good pageranks for common words. So they will need a brute force method getting tons of weblinks with their word pointing to their sites.

Brute force means: writing content with special weblinks to ten thousands of unmoderated guest books, weblogs, open bulletin boards and open wikis.

Recognizing SPAM
Everyone is able to distinguish regular article content and spam content. But this is a very complicated task for computers or algorithms.

using bad-word-lists
Using static bad word lists is a very easy way to prevent some kinds of spam. But the lists may grow and grow as the spam will change from time to time.

Some words may be used for spam but will also have a legitimate meaning.

For example if you want to prevent spam for a life style drug just filter out viagra. But if you then want write about the Pfizer Company in your city or region you won't be able to document they produce viagra.

Static word filtering is not context sensitive and a static bad word list is never complete.

manually
You may want to live without automated spam filtering and do it manually. This is accurate but time consuming. You will have to do every day, even if you are in holidays. You can revert spam with one click in mediawiki. Basically this is a very easy task.

But you must do this every day, because a spammed wiki looks shabby and a shabby wiki will make users running away.

Conclusion
Automated SPAM filtering based on keywords is always incomplete and may (will!) prohibit regular users from writing legitimate things. In short time this is an easy and cheap way to prevent some recurring spam bots, but it is not optimal for prophylactical spam filtering.

Identify fried or foe
Instead of just analyzing submitted content you may want to analyze the current user and try to find out if the user is friend (real users) or foe (spambots). There are some views on a user, so lets discuss them:

... by IP Adress
Every user uses a web client and access the wiki using TCP/IP. Every request on our wiki is originated from an IP address.

IP-adresses can have different "types":
 * static IP adresses on leased lines, as companys have: the DNS reverse record may(!) point to names like proxy.example.com or gateway.example.com.
 * dynamic IP adresses on dial-up, mobile internet or consumer-DSL: the DNS reverse record looks like 254-253-123-4.dyn.area.provider.net or 23-42.net13.broadband.provider.net
 * static IP adresses of dedicated/leased servers: the DNS reverse record may look like www.example.com, server23.hostingprovider.net, customer23-42-55.yourserver.rrdns.hostingprovider.net

IP adresses belong to IP networks and IP networks belong to IP network blocks and are dedicated to providers.

Using whois will bring up information about the owner of an IP adress and might provide information about network ranges, provider names and often abuse contacts.

If there's no detailed information about the provider you may want to traceroute to this IP adress and may analyze IP adresses and reverse-dns records for the steps before the targetted IP adress.

You may find information about the internet carrier (backbone provider), where the provider is connected.

Using this information is sometimes a valuable source when analyzing weird behaviour on your wiki originated from an anoymous user (just having an IP adress).

You MUST NOT rely just on this information when trying to identify fried or foe.

... by user agent
Practically every webbrowser sends a so called user-agent header with each request. This information may contain webbrowsers name and revision, operating system, plugins and extensions ...

Normal users use something like Mozilla Firefox, Microsoft Internet Explorer (MSIE), Apple Safari or something else.

All kind of bots or scripts may use wget, curl, libwww or anything else, which doesn't look like a normal browser.

But: This header information is just additional information and never a reliable source!

WWW libraries (e.g. libcurl) enable setting arbitrary user-agent strings. A "bad bot" may mask itself as a MSIE 8.0.

You MUST NOT rely just on this information when trying to identify fried or foe.

... by behavioural analysis
You may look on successive requests on a user session and will find different types of requests:


 * looking on single requests:
 * Most webbrowsers will fetch /favicon.ico initially, bots wont most times
 * Most webbrowsers will fetch images or other embeded objects, bots wont.
 * most webbrowsers will fetch referenced external CSS templates or JavaScript, bots often wont
 * when fetching referenced external files (images, css, js, ...), webbrowsers will often send a "referer header", which identifies a request (URL) as a subsequent request of a loaded page. Referers may be blocked by proxies or faked by bots, so you cannot rely on this information


 * looking on subsequent requests:
 * A normal user visit begins with a GET request on a Wiki page, maybe with a referer from a search engine, followed by subsequent requests on css, js or images. Before submitting acrticle content the user has to load the edit page (GET), may use the preview function (POST) before submitting changed text (POST).


 * A typical spambot session might even GET an article before submitting new content, no css/js/images, no referers etc, you will just see a POST request in the servers access.log. Even the bot is not so stupid, it won't act like a normal user. Even if the bot requests every css/js/image, accepts cookies and set correct referers, uses a correct user-agent string and so on, most bots will do this within few seconds while a real user may need about 10 to 60 sec before submitting article content. Even heavy users will need more than 10 sec per edit.

... by heuristics
You may (automatically) look on on different things if you receive a POST request (=saving an article, preview, logging in) and combine technical characteristics of the previous requests to get a 90% fried/foe rate. But you will have false positives which might block real users or grants access to a spam bot by 10%.

Some information cannot be retrieved within the client request like the whois information on the IP adress or the domain of the reverse record of the IP adress (if any). You may ask DNS on every request (it will cache information and give answers within 10ms), whois is a database request and is limited to very few requests per minute, so you MUST NOT NEVER EVER send whois requests on every POST request on your wiki.

Information which may be extracted from the web servers log file may only be extracted asyncronous by an external software, which might save informations in a database. Informations based on subsequent requests may be available after 5min to one day but not within a request.

... using a credit point account and scoring
You may use an credit point account in your database which will accumulate credit points from different actions. Every account can be requested, increaded or decreased immediately (within 5 milliseconds).

You may define events or actions which will result in increasing or decreasing the account balance. This events or actions may be syncronous (within a request) or asyncronous (result of a later request or behaviour analysis).


 * Actions or events may be:
 * reading an article (+1p)
 * searching for articles (+5p)
 * creating personal user page (+30p, asyncronous by bot)
 * using preview before submitting articles (+5p each, sycronous)
 * resolve a CAPTCHA (+50p each, syncronous)
 * submitting plain weblinks (-5p each, asyncronous by bot)
 * using Template:Weblink (+10p each, asyncronous by bot)
 * set up and validate a personal e-mail address (+100p once, asyncronous by bot),
 * logging in (+10p asyncronous each)
 * not logged in for one day (-5P, async, by cron)
 * not logged in for one week (-20P, async, by cron)
 * not logged in for one month (-100P / resetting score, async by cron)
 * attending a wiki meeting (real live) (+100P, async by admin)
 * rude user behaviour (-50p, async by admin)


 * Additionally you may alter the account balance with per-request scores like:
 * access from whitelisted/blacklisted IP adress range (+20p / -20p)
 * access using white/blacklisted user agent patterns (+10p / -10p)
 * access with from up/downscored (banned?) user names (+10p / -50p), e.g. "known good users" or "known trolls"
 * having a special black/white scoring cookie containing (encrypted, signed) offsets (-100 … 0 … +100p)


 * Based on his account balance plus request-score the user may:
 * edit talk pages (needs >5P)
 * edit articles (needs >20P)
 * create new articles (needs >50P)
 * rename articles (needs >70)
 * upload files (needs >90P)
 * bypass spam filter (needs >50P)
 * or may only act read-only here (less than 5p)
 * or may only act read-only here (less than 5p)

You SHOULD make this credit points and request scoring transparent to the user!

... manually
You may analyze users (or anynous users IPs) behaviour on the wiki manually by
 * resolve IP adresses of user names (Extension:CheckUser)
 * reading server log files
 * reading mediawiki logs
 * reading ip/users contributions
 * look on DNS reverse records, doing whois requests (maybe the client IP belongs to a "give us money, we wont ask questions" hosting provider in eastern europe with Western Union billing contact in panama city)

Then you may decide whether or not to block a user, an IP adress or IP adress range (based on whois or just an /16 or /24 on gut feeling) using the normal block mechanisms in mediawiki.

No aditional developing is required here and no "automatic spring gun" may kill innocent users.

Hardening MediaWiki
Mediawiki has a fine granular permission scheme where you can configure possible actions for user groups (anonymous users, registered users, confirmed users, sysops).

... using Out-of-the-Box methods

 * Configure LocalSettings.php using $wgGroupPermissions
 * Read the in-depth documentation about User rights

... using extensions

 * Manual:Combating spam
 * Category:Spam management extensions
 * Extension:AntiBot is a simple framework for spambot checks and trigger payloads. The aim is to allow for private development and limited collaboration on filters for common spam tools such as XRumer.
 * Extension:Bad_Behavior detects weird user actions
 * Extension:Check_Spambots checks client IP adress to some IP databases
 * Extension:ConfirmEdit lets you use various different CAPTCHA techniques, to try to prevent spambots and other automated tools from editing your wiki, as well as to foil automated login attempts that try to guess passwords.

approach
Don't disturb real users (even anonymous!), effectively block bots and keep unknown users happy.

Combining some of the above methods will give us non-obtrusive integration for most real users. Normal users never should get disturbed by spam prtection mechanisms such as CPATHAS or even should not need do get in contact with theese things.

So what we need is a passive detection of "user ./. bots" and only when in doubt if a user is a real user we may rise up she shields where users need to prove they are no bots. And if we then can be sure it's a real user, we don't want to disturb him again.

Relying on users IP is sometimes useless, sometimes hazardous.

For example mobile users may get different IP adresses every time they reconnect with UMTS, other mobile users will always be permanently behind the providers web proxy (with always the same IP for all users on this provider). Therefore: binding "good user"/"bad user" to IP adresses is useless.

Relying on per-request information like client request headers such as user-agent, referer and so on is useless, because (well designed) bots can (and will) fake this information.

Relying on content analysis is never accurate, as the mechanisms are based on regular expressions or just bad words lists. Maintaining theese rules is time consuming with minimal effect.

Analyzing users behavior cannot be done within one request as the needed informations are extracted from many requests.

So what we need is a sever side database containing anonymous "user profiles" which can have a score or an credit point account. Users will continuously contribute to their user profile ...

Permissions based on a score
On Rhein-Neckar-Wiki I experimented with some simple scoring mechanism based on IP adress ranges, DNS information (TLD, patterns) and user agent strings. Depending on the score editing was prohibited or just prohibited for anonymous edits or open.

Permissions by scoring is a good idea, but always getting a reliable score is not trivial. The worst case is prohibiting a real user from editing articles in any way, the user might get lost forever.

I disabled the scoring based mechanism then and pursue another strategy: Wikispam is 99% based on submitting URLs into articles. But practically every wiki article has legal weblinks.

Block plain Weblinks
So I developed some wiki templates for all kinds of weblinks and set the string 'http' and 'ftp' on the static spam filter rules – actually we only have this two strings filtered!

Whenever someone submits article text containing http, he will read MediasWiki:Spamprotectiontext which contains some instructions how to save articles with webliks properly.

Instead of submitting Rhein-Neckar-Wiki users will have to use one of the following
 * short form:


 * pros:
 * We don't need any other spamprotection due to spam-bots won't read and understand the instructions how to make weblinks in Rhein-Neckar-Wiki. It's a kind of a Turing test.
 * It's very easy to set up out-of-the-box with MediaWiki
 * cons:
 * Users from other wikis, like wikipedia, get confused or angry about this "barrier"
 * the accepting by users is as different as day and night.
 * You will have to search and replace any occurence of 'http' in existing articles and convert them from Example to, even on talk pages and everywhere. On wikis with thousands of articles and talk pages this might be a huge task.
 * You need to explain this everywhere in documentation
 * You need to advertise this whenever someone is complaining about this
 * You might get frustrated like your users will get frustrated

Ideas and Forecast
I guess we will get back to score based rules and re-enableing "http" strings.

But before I will need to develop and research a more reliable algorithm for scoring, using not just IP adresses, DNS information or other client based static information like user agents (may be faked!).

My idea about behavioral scoring by tracking users activities and increasing or decreasing a long time score based on a credit point account and request scoring (see above).

Users will have to accept (encrypted) scoring cookies in their web browser (as they must if they want to get logged in).

Bots won't do any of theese things and won't get (too much) positive scores and therefore won't be enabled to edit or create articles.

The score might be defined from 0P to 100P.

Based on the users credit point account balance plus request score he will be enabled or disabled from different actions in Rhein-Neckar-Wiki (editing/creating/renaming articles, uploading files, ... )

Q&A
Questions anyone?