2009-03-01
Abstract
Martijn Grooten reports on Virus Bulletin's first comparative anti-spam test.
Copyright © 2009 Virus Bulletin
Following months of discussion and debate, both internally and with industry experts and vendors, this month has seen the start of Virus Bulletin’s first comparative anti-spam test. At the time of writing, emails are flowing through our system and are being scrutinized by more than a dozen different products.
It was decided long ago that the first test would be a trial run – free of charge for vendors, but with the results anonymized for publication. If there is one thing that the trial has demonstrated, it is the need for such a dry run. In particular, we were somewhat overwhelmed by the amount of interest in this first test, and our attempts to accommodate all of the products submitted, along with some hardware problems, forced back the start date of the trial several times. However, we feel that the delay will be outweighed by the benefits when we start running the real tests and are fully versed in the requirements for our system.
The delay does have one major consequence: this report lacks actual performance figures. These will be available very soon, however, and will be published at http://www.virusbtn.com/vbspam/.
A detailed description of our test set-up and the reasons behind our choices are included in articles published in the January and February issues of VB (see VB, January 2009, p.S1 and VB, February 2009, p.S1). The general outline is described below, along with some minor changes that have been made to our original plans.
This test uses Virus Bulletin’s live email stream: any email sent to any ‘@virusbtn.com’ address. This mail is received by a gateway Mail Transfer Agent (MTA) that accepts any email, regardless of the sender and intended recipient, as long as the latter’s address is on the virusbtn.com domain.
Upon receipt, the email is stored in a database that resides on the same server as the MTA. Immediately afterwards, the email is sent to all the filters participating in the test in random order. Finally, the email is sent to our regular email system and delivered to individuals’ inboxes.
Each product is thus exposed to exactly the same email stream, in real time. The sending IP address is the only way in which the mail stream received by the gateway MTA differs from that received by the filters; the latter see an extra Received: header to reflect this.
The products must then decide whether each email is ham or spam. How they do this is for the product developers to decide: they are free to use any information that is available to them. In particular, they are permitted to perform live checks on the Internet.
Finally, the filters are configured so that they forward the email to a ‘catch-all’ MTA. This MTA is configured so that it knows which filter corresponds to which (local) IP address. Moreover, it knows how the various filters mark emails: some classify emails by adding an extra header, while others block all spam and forward only the ham. In the latter case, if an email hasn’t reached the ‘catch-all’ MTA 60 minutes after its original receipt, it is considered to have been marked as spam. The ‘catch-all’ MTA stores the information in the same database as the original email, thus giving a large incidence matrix of all the emails received and the classification of each one by each of the filters in the test.
This last step differs from our original proposal which required the filters to keep logs of their filtering. However, for most products the new method is more natural: they are run as gateway filters themselves and the filtered email is forwarded to the company’s mail filter. Moreover, we are now able to measure the average time it takes for a filter to process a (ham) email, which is an interesting and useful performance metric.
Once an email has been classified by all filters, we are able to set our ‘golden standard’ for it. If we are lucky, all filters will agree and we will assume that they are right. Using knowledge of our email behaviour we can also classify a lot of email automatically as spam – for instance, those sent to non-existent addresses, those sent in foreign alphabets not known to any VB employee etc. (It will be strictly forbidden for filters to use such VB-specific knowledge as an aid to filtering, and regular tests will be carried out to ensure that ‘cheating’ in this way is not taking place. Moreover, we will, of course, take regular samples of the ‘automatically classified’ spam to ensure that mistakes are not being made.) If none of these methods result in a golden standard, the email will be presented to the end-user in a web interface and they will make their own decision on the golden standard.
For the gateway MTA, we need an agent that meets the following requirements:
It must be very basic and not make the slightest attempt to prevent spam.
It must be easily extendable.
It must be open source.
We found in qpsmtpd (http://smtpd.develooper.com/) an smtpd daemon that meets all three requirements. The daemon is written in Perl and has a very basic core which can easily be extended by a variety of plug-ins; writing plug-ins is an easy task, even for someone with limited Perl knowledge. With regard to the third requirement (for the MTA to be open source), no judgement is made here on the quality and/or importance of open source MTAs. However, when third-party software is to play such a significant role in the test, it is important for the tester to verify exactly what the code he uses is doing and why.
The catch-all MTA runs a different instance of qpsmtpd, with different plug-ins.
At the time of writing, 17 products are participating in the trial, though we may see the addition of one or two more. The products can be divided into five categories:
Linux server products. For the trial, all Linux products have been installed on a clean SuSE10 installation. Four Linux server products were submitted for the trial:
Avira MailGate Suite 3.0.0-14
BitDefender Security for Mail Servers v3
COMDOM Antispam 2.0.0-rc3
Kaspersky Anti-Spam 3.0.84.1
Windows Server products. Three Windows Server products were submitted. One of them, SPAMfighter Mail Gateway v3.0.2.1, runs as its own MTA, while the other two run as extensions to Microsoft Exchange. These are Sunbelt’s Ninja Email Security and Microsoft’s Forefront Security for Exchange - Beta 2.
Virtual products. Three vendors submitted products to be run inside VMware Server. In all three cases the products are sold both as VMware applications and as hardware appliances:
Messaging Architects’ M+Guardian 2008.3
SpamTitan 413
Symantec Brightmail Gateway 7.0
Hardware appliances. Three companies submitted a total of five hardware appliances:
Fortinet FortiMail
Fortinet FortiClient
McAfee IronMail
McAfee EWS
Sophos ES4000
Hosted solutions (Software as a Service). Finally, just one company, Webroot, submitted a hosted solution where the mail is both filtered and stored at the company’s server.
I was pleased to find that the installation and configuration of most products was a rather straightforward task. No company or ISP can do without spam filtering these days and good configuration of the filter is essential to make sure the filter does not let through large amounts of spam or, worse, block important emails. Most products were up and running within half an hour and, depending on my experience with the environment, some of them took a lot less than that. Most products provide a web interface for configuration, while many also offer a more basic text interface.
Where possible, I have opted for the products’ default settings to be applied and have turned off all filtering apart from spam filtering. In particular, while many products are capable of detecting viruses in email, this feature has been turned off where possible. If a legitimate email arrives that contains a piece of malware (which is not entirely unlikely, given the nature of Virus Bulletin’s business), we do not expect the product to mark it as spam as this would be beyond the scope of our tests; however, given the importance of blocking malware, we will not penalize any product if the email is blocked. Experience has taught us that the number of such emails we receive every month is very small.
Observant readers will have noted that I have mentioned only 16 products so far. This is because we decided to include the open source SpamAssassin in the test alongside the commercial products.
Having been around since 1997, SpamAssassin is one of the oldest spam filters around. Although it usually makes use of Bayesian filtering based on the end-user’s feedback (something we will not be providing in our tests – see VB, February 2009, p.S1), this is not a requirement and the program will work straight ‘out of the box’. This is one of the reasons for its inclusion in the trial.
Another reason is that, because of its free availability, SpamAssassin provides an interesting benchmark for the various commercial products taking part in our trial. The commercial products will have to justify the prices they charge for their products, and performing better than a free product is one of those possible justifications.
The version of SpamAssassin used in the trial is SpamAssassin 3.1.8, which comes pre-installed with SuSE10. It has been used directly from the command line on both the email header and the email body, without any additional configuration.
As mentioned previously, due to time constraints, we have not been able to include any actual figures in this article. These will be published in mid-March at http://www.virusbtn.com/vbspam/. I can, however, shed some light on how these numbers will be reported.
Consider a test in which 10,000 ham messages and 200,000 spam messages are sent through the system (for a 30-day testing period, this is equivalent to about 14 ham messages and 278 spam messages per hour) and in which two fictional products, AgainstSpam and ForHam, take part.
AgainstSpam, known for its effective spam-blocking, blocked 194,000 of the spam messages, but rather overzealously also classified 250 ham messages as spam. ForHam, on the other hand, is more cautious with its blocking and marked just 50 ham messages as spam, but only managed to block 188,000 spam messages.
The spam catch rate of AgainstSpam will then be:
194,000 / 200,000 x 100% = 97.0%
while that of ForHam is:
188,000 / 200,000 x 100% = 94.0%
The false positive rate for AgainstSpam is:
250 / 10,000 x 100% = 2.5%
while that of ForHam is: 50 / 10,000 x 100% = 0.5%
It should be noted here that many vendors publish their false positive rates as a ratio of the total amount of email. In this case, ForHam has a false positive rate of 0.024% (or 1 in 4,200 messages); AgainstSpam has a false positive rate of 0.12% (or 1 in 840 messages). Since all products will have been exposed to the same spam corpus, for comparison it will not matter which of the two rates is used; in the comparative reviews both will be published.
Two things should be remarked on here. First, a different corpus (received by a different company, during a different time period, etc.) might have seen products perform a few percentage points better or worse; hence it is important to see our tests as comparative and always to consider the numbers in the context of the test and of the other products’ performance.
Secondly, in the above example, it is far from clear which product performs ‘better’: the one with the higher spam catch rate, or the one with the lower false positive rate. In the end, it will be for the customer to decide on which is the best performing product for them, depending on their particular requirements.
We will thus not be able simply to rank products according to their performance. Rather, we will provide vendors with awards that indicate that their product has performed better than a certain benchmark in both false positive rate and spam catch rate. The awards will be at three levels: silver, gold and platinum, with platinum being the hardest to achieve. The benchmarks themselves may change over the course of time, to reflect changes in our test, our mail stream and/or the global spam corpus. However, the benchmark will always be fixed prior to the start of each test.
We are very open about the fact that our test set-up, although real (as in not artificial) is not very realistic – a problem that is common to all anti-spam tests. A particular issue is the fact that the spam filters receive mail from a gateway rather than directly from the senders. While one would still expect a filter to block most spam this way, many products use even more effective methods to block spam based on the connection. In previous articles, I have already mentioned our plans to run a second test, parallel to the first, in which products are exposed to a mail stream coming directly from the Internet.
It will take some time to make the necessary preparations for this second test and as such it is unlikely to be run as part of the first real test, which is scheduled to take place in April (with publication of the results in May) – but we hope to have it in place in time for the following one. As always, comments and suggestions from vendors, researchers and end-users are welcome.
With the first test scheduled to run in April, the deadline for product submission will be towards the end of March; vendors and developers will be notified in due course of the exact deadline and the conditions for submitting a product. Any vendors interested in submitting products for review or finding out more should contact [email protected].