User:Jwalling

From MySeattle

John Walling

Table of contents

Spam Patrol

--jwalling 22:08, 24 Jan 2006 (CET)

SpammerBlockPattern

Introduction

  • The following Regex (http://en.wikipedia.org/wiki/Regular_expression) fragment list came from scraping affiliate marketing (http://en.wikipedia.org/wiki/Affiliate_marketing) spam [10] (http://en.wikipedia.org/wiki/Spamvertising)[11] (http://spamvertised.abusebutler.com/)
  • The most useful regex fragment is height:\s*\dpx
  • The list minimizes domain names since they change constantly
  • The list is maintained with a regex editor: http://www.weitz.de/regex-coach/
  • The list can be used as a
    • spam blacklist [12] (http://meta.wikimedia.org/wiki/Spam_blacklist)[13] (http://www.communitywiki.org/en/BannedContent)[14] (http://wikitravel.org/bannedcontent.txt)
    • guide for spam search
    • regular expression ($wgSpamRegex = "/frag1|frag2|...|fragn/i";)
  • More details at KatrinaHelp.info [15] (http://www.katrinahelp.info/wiki/index.php/Spam_Patrol),[16] (http://www.katrinahelp.info/wiki/index.php/Talk:Spam_Patrol)
    • Botspam posting has dropped over 90% since installing the list Dec 15, 2005.

Regex fragment list

Rationale: Sensitivity vs. Specificity

False Positives (http://en.wikipedia.org/wiki/False_positive) vs. False Negatives (http://en.wikipedia.org/wiki/False_negative) (FP:FN) is a classic problem in medical testing. This tension is expressed as Sensitivity (http://en.wikipedia.org/wiki/Sensitivity_%28tests%29) vs. Specificity (http://en.wikipedia.org/wiki/Specificity). The more sensitive a test is the more likely you will see false positives (Type I error). The more specific a test is the more likely you will see false negatives (Type II error). (This conundrum reminds me of the Uncertainty Principle (http://en.wikipedia.org/wiki/Uncertainty_principle).) We have the same problem with blacklists. (Whitelists counteract false positives.)

The true rates of error depend on testing accuracy and precision (http://en.wikipedia.org/wiki/Accuracy_and_precision), and the frequency of true positives in the population of interest. These factors can be addressed with Baye's theorem (http://en.wikipedia.org/wiki/Bayes%27_theorem) which is beyond the scope of this discussion.

In medical diagnostic testing, a common strategy is to screen with high sensitivty tests and then to verify positives with high specifity tests. This strategy has the benefit of reducing the cost of testing and minimizing the risk of false positives and false negatives. In order to follow the medical model we would need a two stage blacklist. The first blacklist would have high sensitivity and the second list would have high specificity. Passing the first list would allow messages to post. Failing the first blacklist would require passing the second blacklist to post a message.

In blacklists we can measure indirect cost as the number of elements (words,URLs,patterns) that must be compared to new content. The direct costs are system loading, user inconvenience, and maintenance. Our goal is keep the cost as low as possible without missing new spammers.

We compromise by using both types of tests in one or more regex blacklists which include spam text and spammer URLs that are not reliably associated with spam text. (e.g. \.5g6y\.info). We can complement the high sensitivity filtering with a domain table for high specificty filtering.

To recap (more% indicates percentage of a finite resource):

more% blacklisted URLs => more specificity
more specificity => more false negatives
more false negatives => more permission for bad content
more% blacklisted words => more sensitivity
more sensitivity => more false positives
more false positives => more blocks to good content

One minimax (http://en.wikipedia.org/wiki/Minimax) strategy for a single blacklist:

Reduce the 'cost' by relying more on spam words that are associated with many spammer URLs
Reduce the number of false positives by tuning spam text patterns with regex (http://en.wikipedia.org/wiki/Regular_expression)
Reduce the number of false negatives by including spammer URLs that do not associate with spam text

Most blacklists depend heavily on banning URLs. Spammers have an easy time finding new URLs and makes the blocking effort open ended. (A Google search for free web hosting (http://www.google.com/search?q=%22free+web+hosting%22) produces a few million hits.)

I am developing a blacklist that uses banned text primarily and banned URLs where unavoidable. The downside is it may block text that interferes with openness. (If a user attempts to post blocked text, feedback is solicited. [17] (http://www.katrinahelp.info/wiki/index.php/Spam_Protection_Comment)) The upside is the blacklist should require less intervention by administrators to block unidentified spammers.

--jwalling 01:42, 25 Jan 2006 (CET)