2005-10-01
Abstract
John Graham-Cumming reviews: Ending spam by Jonathan A Zdziarski
Copyright © 2005 Virus Bulletin
See Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification on Amazon
Title. Ending Spam
Author. Jonathan A. Zdziarski
Publisher. No Starch Press
ISBN. 1593270526
Ever since Paul Graham posted his renowned ‘A Plan for Spam’ web page, the web has been the publishing medium of choice for the hackers behind the annual Spam Conference at MIT.
Jonathan Zdziarski has done an adequate job of summarizing this collective web wisdom in his book Ending Spam. The book covers all the major thoughts of the open source Bayesian spam-filtering community, but is marred by the author’s strong biases and missing explanations. Despite those problems the book is accessible to any reader with a computer science background and is essential reading for anyone wanting to understand Bayesian spam filtering.
The book opens with a redundant chapter recounting the history of spam from 1978 through 2005 and is followed by the oddly titled Chapter 2, ‘Historical approaches to fighting spam’, which describes an almost random collection of old and new spam fighting techniques, yet omits others. Techniques such as greylisting and fuzzy hashes (e.g. DCC) are not mentioned. The omission of fuzzy hashing is odd because the chapter includes a discussion of ‘collaborative filtering’.
Chapter 3 provides an overview of a statistical filter’s building blocks and introduces terminology that the author has popularized through his dspam project. There are two big disappointments here: first, there is no explanation of Bayes Theorem (just a couple of paragraphs that give a general description), and second, the section on ‘understanding accuracy’ promotes the use of a single ‘accuracy’ percentage as a way of comparing spam filters. It’s a pity that the author provides no discussion of false positives and false negatives, nor does he point out that users care much more about false positives than false negatives and that a single percentage accuracy figure can disguise a false positive problem.
It is also in Chapter 3 that the author’s open source axe to grind becomes obvious with the bizarre claim that ‘Most manufacturers are a bit concerned with the idea of deploying a box that learns on its own. Their customers will no longer need annual contracts for nightly updates [of rule sets] or as many software upgrades, which certainly puts them in a precarious financial position’. That’s probably news to the folks at Proofpoint (amongst others).
Chapter 4 describes in detail the operation of a statistical spam filter with a clearly worked example. In addition, the chapter explains the various mathematical techniques used in a number of filters (starting with Paul Graham’s original proposal and going through to the Inverse Chi-square test proposed by Gary Robinson).
Chapter 5 points out that messages need to be decoded into a readable form for a statistical filter to work. It brushes very lightly over quoted-printable and base 64 encoding without describing how they work, and talks about some HTML encodings used by spammers to disguise messages. There’s also a small, odd section entitled ‘Message actualization’ that reads like an implementation detail of dspam.
Chapter 6 talks about message tokenization with an interesting discussion of what constitutes a word and how, for example, words in the subject line of an email are treated differently from the same words appearing in the body. The inadequate section on ‘internationalization’ reveals the author’s anglophone-centric world view with the statement: ‘The issue of foreign languages will eventually require a solution’ – I suggest ignoring this bit.
Chapter 7 describes the tricks that spammers use to attempt to subvert spam filters. There’s an excellent discussion of why these tricks don’t work and the author busts through a few myths about statistical spam filtering with clear explanations and examples of actual spammer tricks.
Chapters 8 and 9 could have been omitted. Chapter 8 describes a number of database solutions and their relative merits with respect to spam filtering; chapter 9 outlines some of the issues that a spam filter author faces when their filter is used in a large organization.
The chapters in Part III are the most lucid in the book. They draw heavily on the author’s previous writing and cover spam filter testing (Chapter 10), tokenization methods other than ‘split the message into words’ (Chapter 11), removing useless features from a message to improve accuracy (Chapter 13) and some examples of how Bayesian spam filters can collaborate (Chapter 14). Chapter 12 provides an interesting look at a non-Bayesian spam filtering technique using Hidden Markov models.
An appendix highlights five spam fighters: POPFile (for which I was interviewed), SpamProbe, TarProxy, dspam and CRM114.
Overall this is a book worth buying. If you want to know how Bayesian spam filters work then open the book at Chapter 3; if you already know how they work then jump straight to Chapter 10.
Know of a useful infosecurity book? Why not tell us about it so we can let others know - email: [email protected].
View this book on Amazon