The TREC 2005 Spam Filter Evaluation Track

2006-01-01

Gordon V. Cormack

University of Waterloo, Canada

Editor: Helen Martin

Abstract

The 14th Text Retrieval Conference (TREC 2005) took place in November 2005. One of the highlights of the event was the TREC spam filter evaluation effort. Gordon Cormack has all the details.

Table of contents

Introduction
Evaluation tools and corpora
Evaluation measures
Results
Observations

Introduction

The 14th Text Retrieval Conference (TREC 2005) took place in November 2005. One of the highlights of the event was the TREC spam filter evaluation effort, in which 53 spam filters developed by 17 organizations were tested on four separate email corpora totalling 318,482 messages. The purpose of the exercise was not to identify the best entry; rather to provide a laboratory setting for controlled experiments. The results provide a helpful insight into how well spam filters work, and which techniques are worthy of further investigation.

The technique of using data compression models to classify messages was demonstrated to be particularly worthy, as evidenced by the fine performance of filters submitted by Andrej Bratko and Bogdan Filipi of the Josef Stefan Institute in Slovenia. Other techniques of note are the toolkit approach of Bill Yerazunis' CRM114 group and the combination of weak and strong filters by Richard Segal of IBM.

Evaluation tools and corpora

Each spam filter was run in a controlled environment simulating personal spam filter use. A sequence of messages was presented to the filter, one message at a time, using a standard command-line interface. The filter was required to return two results for each message: a binary classification (spam or ham [not spam]), and a 'spamminess' score (a real number representing the estimated likelihood that the message is spam). After returning this pair of results, the filter was presented with the correct classification, thus simulating ideal user feedback. The software created for this purpose – the TREC Spam Filter Evaluation Tool Kit – is available for download under the Gnu General Public License (http://plg.uwaterloo.ca/~trlynam/spamjig/).

The spam track departed from TREC tradition by testing with both public and private corpora. The public corpus tests were carried out by participants, while the private corpus tests were carried out by the corpus proprietors. This hybrid approach required participants both to run their filters, and to submit their implementations for evaluation by third parties.

The departure from tradition was occasioned by privacy issues which make it difficult to create a realistic email corpus. It is a simple matter to capture all the email delivered to a recipient or set of recipients. However, acquiring their permission, and that of their correspondents, to publish the email is nearly impossible. This leaves us with a choice between using an artificial public collection and using a more realistic private collection. The use of both strategies allowed us to investigate this trade off.

The three private corpora each consisted of all the email (both ham and spam) received by an individual over a specific period. The public corpus came from two sources: Enron email released during the course of the Federal Energy Regulatory Commission’s investigation, and recent spam from a public archive. The spam was altered carefully so as to appear to have been addressed to the same recipients and delivered to the same mail servers during the same period as the Enron email. Despite some trepidation that evidence of this forgery would be detected by one or more of the filters, the results indicate that this did not happen.

To form a test corpus, each message must be augmented with a gold standard representing the true classification of a message as ham or spam. The gold standard is used in simulating user feedback and in evaluating the filter's effectiveness. As reported previously (see VB, May 2005, p.S1), much effort was put into ensuring that the gold standard was sufficiently accurate and unbiased. Lack of user complaint is insufficient evidence that a message has been classified correctly. Similarly, we believe that deleting hard-to-classify messages from the corpus introduces unacceptable bias. In comparing the TREC results with others, one must consider that these and other evaluation errors may tend to overestimate filter performance.

Evaluation measures

The primary measures of classification performance are ham misclassification percentage (hm%) and spam misclassification percentage (sm%). A filter makes a trade-off between its performances on these two measures. It is an easy matter to reduce hm% at the expense of sm%, and vice versa. The relative importance of these two measures is the subject of some controversy, with the majority opinion being that reducing hm% is more important, but not at all costs with respect to increasing sm%. At TREC we attempted to sidestep the issue by reporting the logistic average (lam%) of the two scores, which rewards equally the same multiplicative factor in ham or spam misclassification odds. More formally:

lam% = logit^-1 = ( logit(hm%) + logit(sm%) ) / 2

where

logit(x) = log(odds(x))

and

odds(x) = x / (100% - x)

Another way to sidestep the trade-off issue is to use the spamminess score to plot a Receiver Operating Characteristic (ROC) curve that represents all (hm%, sm%) pairs that could be achieved by the filter by changing a threshold parameter. Figure 1 shows the ROC curve for the best filter from each organization, as tested on the public corpus. In general, higher curves indicate superior performance regardless of the trade off between hm% and sm%, while curves that intersect indicate different relative performance depending on the relative importance of hm% and sm%. The solid curve at the top (ijsSPAM2full; Bratko's filter) shows sm% = 9.89% when hm% = 0.01%, sm% = 1.78% when hm% = 0.1%, and so on.

A useful summary measure of performance is the area under the ROC curve, ROCA, a number between 0 and 1 that indicates overall performance. In addition to the geometric interpretation implied by its name, this area represents a probability: the probability that the filter will give a random spam message a higher spamminess score than a random ham message. TREC reports (1-ROCA) as a percentage, consistent with the other summary measures which measure error rates rather than success rates.

TREC 2005's spam evaluation used three summary measures of performance: lam%, (1-ROCA)%, and sm% at hm% = 0.1. Each provides a reasonable estimate of overall filter performance; none definitively identifies the best filter.

Results

The TREC spam evaluations generated a vast number of curves and statistics, which will appear in the TREC 2005 proceedings to be published early in 2006 (http://trec.nist.gov/pubs.html). We summarize the results with respect to the public corpus.

Figure 1. ROC curve for the best filter from each organization, as tested on the public corpus.

Table 1 associates each of the selected test runs (i.e. the best per organization) with its author. Only 12 of the filters were authored by official TREC 2005 participants; the other five were popular open-source spam filters, configured by the spam track organizers in consultation with their authors.

Run	Comment	Author
bogofilter	Bogofilter (open source)	David Relson (non-participant)
ijsSPAM2	PPM-D compression model	Andrej Bratko (Josef Stefan Institute)
spamprobe	SpamProbe (open source)	Brian Burton (non-participant)
spamasas-b	Spamassassin Bayes filter only (open source)	Justin Mason (non-participant)
crmSPAM3	CRM-114 (open source)	Bill Yerazunis (MERL)
621SPAM1	Spam Guru	Richard Segal (IBM)
lbSPAM2	dbacl (open source)	Laird Breyer
popfile	Popfile (open source)	John Graham-Cumming (non-participant)
dspam-toe	DSPAM (open source)	Jon Zdziarski (non-participant)
tamSPAM1	SpamBayes (open source)	Tony Meyer
yorSPAM2		Jimmy Huang (York University)
indSPAM3		Indiana University
kidSPAM1		Beijing U. of Posts & Telecom.
dalSPAM4		Dalhousie University
pucSPAM2		Egidio Terra (PUC Brazil)
ICTSPAM2		Chinese Academy of Sciences
azeSPAM1		U. Paris-Sud

Table 1. The selected test runs and their authors.

Table 2 shows the three classification-based measures (hm%, sm%, and lam%) for each filter, ordered by lam%. Note that hm% and sm% give nearly opposite rankings, indicating their heavy negative correlation and dependence on threshold setting.

Run	Hm%	Sm%	Lam%
bogofilter	0.01	10.47	0.30
ijsSPAM2	0.23	0.95	0.47
spamprobe	0.15	2.11	0.57
spamasas-b	0.25	1.29	0.57
crmSPAM3	2.56	0.15	0.63
621SPAM1	2.38	0.20	0.69
lbSPAM2	0.51	0.93	0.69
popfile	0.92	1.26	0.94
dspam-toe	1.04	0.99	1.01
tamSPAM1	0.26	4.10	1.05
yorSPAM2	0.92	1.74	1.27
indSPAM3	1.09	7.66	2.93
kidSPAM1	0.91	9.40	2.99
dalSPAM4	2.69	4.50	3.49
pucSPAM2	3.35	5.00	4.10
ICTSPAM2	8.33	8.03	8.18
azeSPAM1	64.84	4.57	22.92

Table 2. The classification-based measures, ordered by lam%.

Table 3 shows the three summary measures: (1-ROCA)%, hm% at sm% = 0.1%, and lam% and the rank of each filter according to each of the measures. Note that while the rankings are not identical, they have a high positive correlation. The measures with respect to the other corpora vary somewhat but give the same general impression.

Run	(1-ROCA)%	Rank	Sm% @ Hm%=0.1	Rank	Lam%	Rank
ijsSPAM2	0.02	1	1.8	1	0.5	2
lbSPAM2	0.04	2	5.2	7	0.7	7
crmSPAM3	0.04	3	2.6	3	0.6	5
621SPAM1	0.04	4	3.6	6	0.7	6
bogofilter	0.05	5	3.4	5	0.3	1
spamasas-b	0.06	6	2.6	2	0.6	3
spamprobe	0.06	7	2.8	4	0.6	4
tamSPAM1	0.16	8	6.9	8	1.1	10
popfile	0.33	9	7.4	9	0.9	8
yorSPAM2	0.46	10	34.2	10	1.3	11
dspam-toe	0.77	11	88.8	15	1.0	9
dalSPAM4	1.37	12	76.6	13	3.5	14
kidSPAM1	1.46	13	34.9	11	3.0	13
pucSPAM2	1.97	14	51.3	12	4.1	15
ICTSPAM2	2.64	15	79.5	14	8.2	16
indSPAM3	2.82	16	97.4	16	2.9	12
azeSPAM1	28.89	17	99.5	17	22.9	17

Table 3. The summary measures and the rank of each filter according to those measures.

Observations

The most startling observation is that character-based compression models perform outstandingly well for spam filtering. Commonly used open-source filters perform well, but not nearly so well or nearly so poorly as reported elsewhere. We have reason to believe that reports on the performance of other filters are similarly unreliable; only standard evaluation will test their credence.

The main result from TREC is the toolkit and methods for filter evaluation. These may be used by anyone to perform further tests. The public corpus will be made available to all, subject to a usage agreement. The private corpora will remain in escrow so that new filters may be tested with them. Plans are already under way for TREC 2006, in which the same and new tests will be conducted on new filters and corpora. The new tests will include modelling of unreliable user feedback, use of external resources and other email processing applications.

Latest articles:

Nexus Android banking botnet – compromising C&C panels and dissecting mobile AppInjects

Aditya Sood & Rohit Bansal provide details of a security vulnerability in the Nexus Android botnet C&C panel that was exploited to compromise the C&C panel in order to gather threat intelligence, and present a model of mobile AppInjects.

Cryptojacking on the fly: TeamTNT using NVIDIA drivers to mine cryptocurrency

TeamTNT is known for attacking insecure and vulnerable Kubernetes deployments in order to infiltrate organizations’ dedicated environments and transform them into attack launchpads. In this article Aditya Sood presents a new module introduced by…

Collector-stealer: a Russian origin credential and information extractor

Collector-stealer, a piece of malware of Russian origin, is heavily used on the Internet to exfiltrate sensitive data from end-user systems and store it in its C&C panels. In this article, researchers Aditya K Sood and Rohit Chaturvedi present a 360…

Fighting Fire with Fire

In 1989, Joe Wells encountered his first virus: Jerusalem. He disassembled the virus, and from that moment onward, was intrigued by the properties of these small pieces of self-replicating code. Joe Wells was an expert on computer viruses, was partly…

Run your malicious VBA macros anywhere!

Kurt Natvig wanted to understand whether it’s possible to recompile VBA macros to another language, which could then easily be ‘run’ on any gateway, thus revealing a sample’s true nature in a safe manner. In this article he explains how he recompiled…

Bulletin Archive