2009-02-01
Abstract
Martijn Grooten answers some of the common queries raised by vendors about the proposed test set-up for VB's upcoming anti-spam comparative testing.
Copyright © 2009 Virus Bulletin
Last month I outlined the proposed test setup for VB’s comparative anti-spam tests (see VB, January 2009, p.S1). Following the publication of the article we received a lot of feedback from vendors, researchers and customers alike. It is great to see so much interest in our tests, and even better to receive constructive comments and suggestions.
Of course, several queries have been raised about our proposals – this article answers three of the most commonly asked questions.
For customers who want to buy an anti-spam solution for their incoming email – generally embedded into a larger email suite – the choice is not simply one of comparing different vendors. They could choose a product that can be embedded into an existing mail server, or one that is a mail server in itself – in which case there is a further choice between products that come with their own hardware and products that need to be installed on an existing operating system. But there are also products where both email filtering and mail hosting take place at the vendor’s server; such products, labelled ‘Software as a Service’ (SaaS), are becoming increasingly popular.
Many vendors have asked whether our test will be able to accommodate SaaS products. The answer is yes – since the two major test criteria, the false positive rate and the false negative rate, can be measured for each of the product types mentioned above and can also be compared amongst them.
Of course, there are other metrics that describe a product’s performance, not all of which apply to all types of product. For instance, the average and maximum CPU usage of a product are important measures for those that need to be installed on the user’s machine, but are of little or no importance for products that provide their own hardware or are hosted externally. As a result, we aim to measure these aspects of performance in products for which they are relevant, but the measurements will not be part of the certification procedure.
One of the properties of spam is that it is indiscriminate; one of the properties of ham is that it is not. A classic example is that of pharmaceutical companies – whose staff might have legitimate reasons for sending and receiving email concerning body-part-enhancing products, but may find such email content blocked by spam filters. Many spam filters, however, are not indiscriminate and can learn from feedback provided by the end-user. Some filters even rely solely on user feedback: by default, all email messages have a spam probability of 0.5 and by combining user feedback with, among other things, Bayesian and Markovian methods, the product will ‘learn’ what kind of emails are unwanted and should be filtered as spam.
However, for a number of reasons, we have decided to test all products out-of-the-box using their default settings and not to provide filters with any user feedback.
Firstly, providing feedback would complicate our test setup. In the real world, feedback is delivered to a learning filter whenever the user reads their email, which is generally multiple times during the day. In our setup, the ‘golden standard’ will be decided upon by our end-users at their leisure (meaning they do not have to make classification decisions under pressure, thus minimizing mistakes), so our feedback would not be representative of a real-world situation.
Secondly, the performance of a learning filter as perceived by the user will not depend solely on its ability to learn from user feedback, but at least as much on the quality of the feedback given. If deleting a message is easier/less time-consuming than reporting it as spam, users might just delete unwanted email from their inbox; messages that are wanted but do not need to be saved might be read in, but not retrieved from the junk mail folder; the ‘mark as spam’ button might be used as a convenient way of unsubscribing to mailing lists. The quality of the feedback given thus depends on the end-user’s understanding of how to provide feedback, as well as the ease with which they can provide it. We do not currently believe we can test this in a fair and comparable way. Of course, we will continue to look for possible ways to include learning filters in our tests.
A wide range of anti-spam measures are based on the content of the email or the context in which it was sent, and most filters use a combination of such measures. However, many filters also take a more pro-active approach, where they try to frustrate the spammers, for instance by delaying their response to SMTP commands (‘tarpitting’) or by temporarily refusing email from unknown or unverifiable sources (‘greylisting’).
Such methods assume that legitimate senders will keep trying to get the message delivered, while many spammers will give up: apart from the fact that mail agents used by spammers are often badly configured, the spammers’ economic model is based on being able to deliver a large volume of messages in a short period of time and it will generally not be viable for them to keep trying.
From the receivers’ point of view, these methods are as good as any other to stop spam, but with two major drawbacks. Firstly, greylisting could cause significant delays to the delivery of some legitimate email, which could be disadvantageous in a business environment. Secondly, any such pro-active anti-spam method could result in false positives that are impossible to trace – which, again, is undesirable for a business that wants to be able to view all incoming emails, even those classified initially as spam.
Such methods also cause a problem for the tester: the efficiency of an anti-spam method can only be tested if both the spam catch rate and the false positive rate can be measured. This is impossible with pro-active methods, since these ‘block’ email before it is sent. This is one of the reasons why we will not be able to test against such methods with the setup that uses our own email stream.
We realize that this will be a problem for products that make extensive use of these methods, and as a compromise we are looking for ways to expose all products to the email stream sent to a spam trap, which is (almost) guaranteed to be spam only. Of course, this will not solve the problem of testing for false positives.
We will be running a trial test this month. During the trial it is possible (indeed probable) that the test configuration will be changed. The results, therefore, may not be representative of those that would have been derived from a real test. For this reason, we intend to publish the results of the trial without specifying which products achieved them.
The first real test will start towards the end of March; vendors and developers will be notified in due course of the deadline and conditions for submitting a product.
As always, we welcome comments, criticism and suggestions – and will continue to do so once the tests are up and running. Our goal is to run tests in which products are compared in a fair way, and which will produce results that are useful to end-users. Any suggestions for better ways in which our tests could achieve these goals will be given serious consideration (please email [email protected]).