User:Jwalling
From MySeattle
John Walling
| Table of contents |
Spam Patrol
- My latest anti-spam project is at Meta Wikimedia (http://meta.wikimedia.org/wiki/Spam_Filter)
- I make regular contributions at http://www.katrinahelp.info/
- I am working on this wiki article: Animal Rescue Resources (http://www.katrinahelp.info/wiki/index.php/Animal_Rescue_Resources)
- I setup Spam Patrol (http://www.katrinahelp.info/wiki/index.php/Spam_Patrol) guidelines for KatrinaHelp.info.
- Feel free to leave a message on my talk page
- I can be e-mailed here: mailto:wallingconsulting@yahoo-dot-com
- I am interested in these strategies to defeat wiki spam (http://meta.wikimedia.org/wiki/Wiki_Spam) see also [1] (http://www.communitywiki.org/en/WikiSpam)[2] (http://en.wikipedia.org/wiki/Referrer_spam)[3] (http://en.wikipedia.org/wiki/Link_spam)[4] (http://meta.wikimedia.org/wiki/Anti-spam_Features).
- SpammerBlockPattern (home grown)
- SpamBlacklist extension (http://meta.wikimedia.org/wiki/SpamBlacklist_extension)
- UsernameBlacklist (http://cvs.sourceforge.net/viewcvs.py/wikipedia/extensions/UsernameBlacklist/?only_with_tag=MAIN)
- Bad Behavior MediaWiki extension (http://www.ioerror.us/software/bad-behavior/installing-and-using-bad-behavior/on-mediawiki/)
- kses - PHP HTML/XHTML filter (http://sourceforge.net/projects/kses)
- Blocking spam hidden with CSS [5] (http://chongqed.info/wiki/CSS_Hidden_Spam)[6] (http://wiki.chongqed.org//CSSHiddenSpam)
- Chongqed spam [7] (http://chongqed.org/)[8] (http://lordmatt.co.uk/item/160/)[9] (http://lordmatt.co.uk/tools/chongqtool.php)
- I have websites at myseattle.com (http://www.myseattle.com) & fulgeo.com (http://www.fulgeo.com)
- I am bootstrapping a wiki at http://www.myseattle.com/mediawiki/
- --jwalling 22:08, 24 Jan 2006 (CET)
SpammerBlockPattern
Introduction
- The following Regex (http://en.wikipedia.org/wiki/Regular_expression) fragment list came from scraping affiliate marketing (http://en.wikipedia.org/wiki/Affiliate_marketing) spam [10] (http://en.wikipedia.org/wiki/Spamvertising)[11] (http://spamvertised.abusebutler.com/)
- The most useful regex fragment is height:\s*\dpx
- The list minimizes domain names since they change constantly
- The list is maintained with a regex editor: http://www.weitz.de/regex-coach/
- The list can be used as a
- spam blacklist [12] (http://meta.wikimedia.org/wiki/Spam_blacklist)[13] (http://www.communitywiki.org/en/BannedContent)[14] (http://wikitravel.org/bannedcontent.txt)
- guide for spam search
- regular expression ($wgSpamRegex = "/frag1|frag2|...|fragn/i";)
- see LocalSettings.php (http://meta.wikimedia.org/wiki/LocalSettings.php) and Anti-spam features (http://meta.wikimedia.org/wiki/Anti-spam_Features)
- More details at KatrinaHelp.info [15] (http://www.katrinahelp.info/wiki/index.php/Spam_Patrol),[16] (http://www.katrinahelp.info/wiki/index.php/Talk:Spam_Patrol)
- Botspam posting has dropped over 90% since installing the list Dec 15, 2005.
Regex fragment list
Rationale: Sensitivity vs. Specificity
False Positives (http://en.wikipedia.org/wiki/False_positive) vs. False Negatives (http://en.wikipedia.org/wiki/False_negative) (FP:FN) is a classic problem in medical testing. This tension is expressed as Sensitivity (http://en.wikipedia.org/wiki/Sensitivity_%28tests%29) vs. Specificity (http://en.wikipedia.org/wiki/Specificity). The more sensitive a test is the more likely you will see false positives (Type I error). The more specific a test is the more likely you will see false negatives (Type II error). (This conundrum reminds me of the Uncertainty Principle (http://en.wikipedia.org/wiki/Uncertainty_principle).) We have the same problem with blacklists. (Whitelists counteract false positives.)
The true rates of error depend on testing accuracy and precision (http://en.wikipedia.org/wiki/Accuracy_and_precision), and the frequency of true positives in the population of interest. These factors can be addressed with Baye's theorem (http://en.wikipedia.org/wiki/Bayes%27_theorem) which is beyond the scope of this discussion.
In medical diagnostic testing, a common strategy is to screen with high sensitivty tests and then to verify positives with high specifity tests. This strategy has the benefit of reducing the cost of testing and minimizing the risk of false positives and false negatives. In order to follow the medical model we would need a two stage blacklist. The first blacklist would have high sensitivity and the second list would have high specificity. Passing the first list would allow messages to post. Failing the first blacklist would require passing the second blacklist to post a message.
In blacklists we can measure indirect cost as the number of elements (words,URLs,patterns) that must be compared to new content. The direct costs are system loading, user inconvenience, and maintenance. Our goal is keep the cost as low as possible without missing new spammers.
We compromise by using both types of tests in one or more regex blacklists which include spam text and spammer URLs that are not reliably associated with spam text. (e.g. \.5g6y\.info). We can complement the high sensitivity filtering with a domain table for high specificty filtering.
To recap (more% indicates percentage of a finite resource):
- more% blacklisted URLs => more specificity
- more specificity => more false negatives
- more false negatives => more permission for bad content
- more specificity => more false negatives
- more% blacklisted words => more sensitivity
- more sensitivity => more false positives
- more false positives => more blocks to good content
- more sensitivity => more false positives
One minimax (http://en.wikipedia.org/wiki/Minimax) strategy for a single blacklist:
- Reduce the 'cost' by relying more on spam words that are associated with many spammer URLs
- Reduce the number of false positives by tuning spam text patterns with regex (http://en.wikipedia.org/wiki/Regular_expression)
- Reduce the number of false negatives by including spammer URLs that do not associate with spam text
Most blacklists depend heavily on banning URLs. Spammers have an easy time finding new URLs and makes the blocking effort open ended. (A Google search for free web hosting (http://www.google.com/search?q=%22free+web+hosting%22) produces a few million hits.)
I am developing a blacklist that uses banned text primarily and banned URLs where unavoidable. The downside is it may block text that interferes with openness. (If a user attempts to post blocked text, feedback is solicited. [17] (http://www.katrinahelp.info/wiki/index.php/Spam_Protection_Comment)) The upside is the blacklist should require less intervention by administrators to block unidentified spammers.
--jwalling 01:42, 25 Jan 2006 (CET)
