RegioWikiCamp 2011/Sessions/Surviving SPAM

From Regiowiki
Jump to: navigation, search

Contents

About SPAM

I'll keep things short here: anyone who has access to the internet and useses e-mail knows about SPAM.

But there is a special kind of SPAM, which isn't known to much to the people: Weblink-Spam.

What's Weblink-Spam?

You may have noticed ugly weblinks on blog comments, guest books, bulletin boards and of course in wikis.

The weblinks may be surrounded by randomized text, chineese or english, maybe adwords for life style drugs (like v14gr4) for online pharmacies to simulate real content.

The goal is to bring internet users searching for something directly to web presences of online pharmacies, online gambling instead of the product owners web presence (today: the article on wikipedia).

Websearch results

The goal is to be the first result on popular search engines like google or even get listed on the first page (rank 1-10 on google). The calculation of the famous pagerank is somewhat complicated, but is mainly based on the number of weblinks pointing to a site.

So basically you need a LOT of weblinks, to get your page listed on the best results when searching for a worldwide well known term like viagra or gambling.

Accelerating web spiders with web 2.0

In web 1.0 the web crawlers needed about some days or weeks to find a term and your webpage or website.

In web 2.0 we have ATOM/RSS feeds on blogs, wikis or bulletin boards, sometimes theese sites ping special RSS sites on current activity. If your blog or wiki is big/famous enough and has lots of good content, the RSS feed will be polled by special crawlers in short intervals, like every hour.

If you write a new or edit a known article about some word chances are good, that this word will be found after some hours when searching for it on google or yahoo or maybe msn. If this word is very common, chances are bad that you will get listed in top 100 or top 500 (pages 1..50 on google).

getting on top with common words on short time

Weblinks-spammers don't have much time to get good pageranks for common words. So they will need a brute force method getting tons of weblinks with their word pointing to their sites.

Brute force means: writing content with special weblinks to ten thousands of unmoderated guest books, weblogs, open bulletin boards and open wikis.

Recognizing SPAM

Everyone is able to distinguish regular article content and spam content. But this is a very complicated task for computers or algorithms.

using bad-word-lists

Using static bad word lists is a very easy way to prevent some kinds of spam. But the lists may grow and grow as the spam will change from time to time.

Some words may be used for spam but will also have a legitimate meaning.

For example if you want to prevent spam for a life style drug just filter out viagra. But if you then want write about the Pfizer Company in your city or region you won't be able to document they produce viagra.

Static word filtering is not context sensitive and a static bad word list is never complete.

manually

You may want to live without automated spam filtering and do it manually. This is accurate but time consuming. You will have to do every day, even if you are in holidays. You can revert spam with one click in mediawiki. Basically this is a very easy task.

But you must do this every day, because a spammed wiki looks shabby and a shabby wiki will make users running away.

Conclusion

Automated SPAM filtering based on keywords is always incomplete and may (will!) prohibit regular users from writing legitimate things. In short time this is an easy and cheap way to prevent some recurring spam bots, but it is not optimal for prophylactical spam filtering.

Identify fried or foe

Instead of just analyzing submitted content you may want to analyze the current user and try to find out if the user is friend (real users) or foe (spambots). There are some views on a user, so lets discuss them:

... by IP Adress

Every user uses a web client and access the wiki using TCP/IP. Every request on our wiki is originated from an IP address.

IP-adresses can have different "types":

IP adresses belong to IP networks and IP networks belong to IP network blocks and are dedicated to providers.

Using whois will bring up information about the owner of an IP adress and might provide information about network ranges, provider names and often abuse contacts.

If there's no detailed information about the provider you may want to traceroute to this IP adress and may analyze IP adresses and reverse-dns records for the steps before the targetted IP adress.

You may find information about the internet carrier (backbone provider), where the provider is connected.

Using this information is sometimes a valuable source when analyzing weird behaviour on your wiki originated from an anoymous user (just having an IP adress).

You MUST NOT rely just on this information when trying to identify fried or foe.

... by user agent

Practically every webbrowser sends a so called user-agent header with each request. This information may contain webbrowsers name and revision, operating system, plugins and extensions ...

Normal users use something like Mozilla Firefox, Microsoft Internet Explorer (MSIE), Apple Safari or something else.

All kind of bots or scripts may use wget, curl, libwww or anything else, which doesn't look like a normal browser.

But: This header information is just additional information and never a reliable source!

WWW libraries (e.g. libcurl) enable setting arbitrary user-agent strings. A "bad bot" may mask itself as a MSIE 8.0.

You MUST NOT rely just on this information when trying to identify fried or foe.

... by behavioural analysis

You may look on successive requests on a user session and will find different types of requests:

looking on single requests
looking on subsequent requests

... by heuristics

You may (automatically) look on on different things if you receive a POST request (=saving an article, preview, logging in) and combine technical characteristics of the previous requests to get a 90% fried/foe rate. But you will have false positives which might block real users or grants access to a spam bot by 10%.

Some information cannot be retrieved within the client request like the whois information on the IP adress or the domain of the reverse record of the IP adress (if any). You may ask DNS on every request (it will cache information and give answers within 10ms), whois is a database request and is limited to very few requests per minute, so you MUST NOT NEVER EVER send whois requests on every POST request on your wiki.

Information which may be extracted from the web servers log file may only be extracted asyncronous by an external software, which might save informations in a database. Informations based on subsequent requests may be available after 5min to one day but not within a request.

... using a credit point account and scoring

You may use an credit point account in your database which will accumulate credit points from different actions. Every account can be requested, increaded or decreased immediately (within 5 milliseconds).

You may define events or actions which will result in increasing or decreasing the account balance. This events or actions may be syncronous (within a request) or asyncronous (result of a later request or behaviour analysis).

Actions or events may be
  • reading an article (+1p)
  • searching for articles (+5p)
  • creating personal user page (+30p, asyncronous by bot)
  • using preview before submitting articles (+5p each, sycronous)
  • resolve a CAPTCHA (+50p each, syncronous)
  • submitting plain weblinks (-5p each, asyncronous by bot)
  • using Template:Weblink (+10p each, asyncronous by bot)
  • set up and validate a personal e-mail address (+100p once, asyncronous by bot),
  • logging in (+10p asyncronous each)
  • not logged in for one day (-5P, async, by cron)
  • not logged in for one week (-20P, async, by cron)
  • not logged in for one month (-100P / resetting score, async by cron)
  • attending a wiki meeting (real live) (+100P, async by admin)
  • rude user behaviour (-50p, async by admin)
Additionally you may alter the account balance with per-request scores like
  • access from whitelisted/blacklisted IP adress range (+20p / -20p)
  • access using white/blacklisted user agent patterns (+10p / -10p)
  • access with from up/downscored (banned?) user names (+10p / -50p), e.g. "known good users" or "known trolls"
  • having a special black/white scoring cookie containing (encrypted, signed) offsets (-100 … 0 … +100p)
Based on his account balance plus request-score the user may
  • edit talk pages (needs >5P)
  • edit articles (needs >20P)
  • create new articles (needs >50P)
  • rename articles (needs >70)
  • upload files (needs >90P)
  • bypass spam filter (needs >50P)
  • or may only act read-only here (less than 5p)

You SHOULD make this credit points and request scoring transparent to the user!

... manually

You may analyze users (or anynous users IPs) behaviour on the wiki manually by

Then you may decide whether or not to block a user, an IP adress or IP adress range (based on whois or just an /16 or /24 on gut feeling) using the normal block mechanisms in mediawiki.

No aditional developing is required here and no "automatic spring gun" may kill innocent users.

Hardening MediaWiki

Mediawiki has a fine granular permission scheme where you can configure possible actions for user groups (anonymous users, registered users, confirmed users, sysops).

... using Out-of-the-Box methods

... using extensions

Combine methods

approach

Don't disturb real users (even anonymous!), effectively block bots and keep unknown users happy.

Combining some of the above methods will give us non-obtrusive integration for most real users. Normal users never should get disturbed by spam prtection mechanisms such as CPATHAS or even should not need do get in contact with theese things.

So what we need is a passive detection of "user ./. bots" and only when in doubt if a user is a real user we may rise up she shields where users need to prove they are no bots. And if we then can be sure it's a real user, we don't want to disturb him again.

Relying on users IP is sometimes useless, sometimes hazardous.

For example mobile users may get different IP adresses every time they reconnect with UMTS, other mobile users will always be permanently behind the providers web proxy (with always the same IP for all users on this provider). Therefore: binding "good user"/"bad user" to IP adresses is useless.

Relying on per-request information like client request headers such as user-agent, referer and so on is useless, because (well designed) bots can (and will) fake this information.

Relying on content analysis is never accurate, as the mechanisms are based on regular expressions or just bad words lists. Maintaining theese rules is time consuming with minimal effect.

Analyzing users behavior cannot be done within one request as the needed informations are extracted from many requests.

So what we need is a sever side database containing anonymous "user profiles" which can have a score or an credit point account. Users will continuously contribute to their user profile ...


Example: Rhein-Neckar-Wiki

Permissions based on a score

On Rhein-Neckar-Wiki I experimented with some simple scoring mechanism based on IP adress ranges, DNS information (TLD, patterns) and user agent strings. Depending on the score editing was prohibited or just prohibited for anonymous edits or open.

Permissions by scoring is a good idea, but always getting a reliable score is not trivial. The worst case is prohibiting a real user from editing articles in any way, the user might get lost forever.

I disabled the scoring based mechanism then and pursue another strategy: Wikispam is 99% based on submitting URLs into articles. But practically every wiki article has legal weblinks.

Block plain Weblinks

So I developed some wiki templates for all kinds of weblinks and set the string 'http' and 'ftp' on the static spam filter rules – actually we only have this two strings filtered!

Whenever someone submits article text containing http, he will read MediasWiki:Spamprotectiontext which contains some instructions how to save articles with webliks properly.

short form

Instead of submitting

[http://rhein-neckar-wiki.de Rhein-Neckar-Wiki]

users will have to use one of the following

{{Homepage2|rhein-neckar-wiki.de|Rhein-Neckar-Wiki}}
{{Weblink|rhein-neckar-wiki.de/Hauptseite|Rhein-Neckar-Wiki}}
{{Weblink|1=rhein-neckar-wiki.de/index.php?title=Hauptseite&oldid=58535|2=Rhein-Neckar-Wiki}}
pros
cons

Ideas and Forecast

I guess we will get back to score based rules and re-enableing "http" strings.

But before I will need to develop and research a more reliable algorithm for scoring, using not just IP adresses, DNS information or other client based static information like user agents (may be faked!).

My idea about behavioral scoring by tracking users activities and increasing or decreasing a long time score based on a credit point account and request scoring (see above).

Users will have to accept (encrypted) scoring cookies in their web browser (as they must if they want to get logged in).

Bots won't do any of theese things and won't get (too much) positive scores and therefore won't be enabled to edit or create articles.

The score might be defined from 0P to 100P.

Based on the users credit point account balance plus request score he will be enabled or disabled from different actions in Rhein-Neckar-Wiki (editing/creating/renaming articles, uploading files, ... )

Q&A

Questions anyone?

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox