2010-10-01
Abstract
Claudiu Musat and George Petre explain why spam feeds matter in the anti-spam field and discuss the importance of effective spam-gathering methods.
Copyright © 2010 Virus Bulletin
Spam feeds matter. The fact that this aspect of anti-spam technology has received little attention compared to filtering methods could be blamed on the fact that most vendors, after obtaining a detection rate considered satisfactory at the time, tend to believe that their own spam feed is a good representation of the total amount of spam in the wild. But when their false negative rates need to decrease tenfold in order for their product to remain competitive, obtaining a supply of niche spam is paramount. This is where different spam-gathering methods become important. And since just one false positive can have an enormous impact on a product’s reputation, making sure no newsletters or other ‘grey zone’ mail have permeated the spam feed is also important. Finally, establishing a method to measure the value of the spam feed might help create an environment where vendors exchange spam and everyone in the industry contributes to a relevant collection of spam.
Many, if not most of today’s spam filters rely heavily on the message content to make their filtering decisions. We are also among those who believe that the message body and headers contain the most relevant pieces of information that a reactive classification can be based upon, from IP blacklists to pattern matching techniques.
Numerous content-based filtering methods have been employed to ensure that anti-spam filters do not mistake ham for spam, and usually those distinctions are made at runtime based upon previous training conducted on pre-classified messages (both spam and legitimate). This follows the assumption that the spam and ham emails on which the filters are trained have previously been separated into non-overlapping sets.
This is an important and often overlooked weakness of content-based filters – they are completely reliant on the quality of the training data. If the filters are not trained to detect a specific type of message, whether directly or indirectly, odds are that they won’t detect any subsequent similar ones. Therefore, if a spam feed is not sufficiently diverse, any filter that relies on it will see its detection rate skewed downwards.
Also, if the training feed is polluted, odds are that the respective filter – whether it consists of a clustering algorithm, a neural network, a Bayes network or any other content-based method – will perform poorly at runtime. Ensuring that pollution levels remain low in the training feed is paramount for the success of the entire filtering process.
In the subsequent sections we describe how we create and enhance a spam feed, how we eliminate known types of pollution, and how we evaluate it.
In 2002, when our anti-spam engine development started, most email servers didn’t integrate anti-spam engines. As a result of the poor anti-spam filter coverage, the spammers’ job was quite easy, and the variety of spam in existence was small.
Our first spam collection was composed of the corpus provided by spamarchive.org (a project that is now defunct), our personal email spam, our colleagues’ ‘donations’ and a few emails from honeypots posted on our sites. At the time, this was sufficient to cover a significant part of the spam phenomenon, but it was just the beginning. Soon afterwards, the complexity of spam increased, and it started morphing as quickly or even faster than malware. At the same time, the spamarchives.org project became obsolete, our colleagues’ donations had become difficult to atomize (since every one of them had different email clients, different types of forwarding settings and so on), and our honeypots were only collecting small quantities of spam, which in turn were rather homogeneous. This was the point at which we decided to create a department responsible for the development of spam honeypots.
Our first option was to deploy honeypots on the Usenet groups. This was not an easy task since it is difficult to deploy honeypots at these locations while at the same time avoiding becoming a spammer yourself. This is also the reason why the first trial was not efficient: it was time consuming to post messages that were relevant to each group. But even if there weren’t enough honeypots deployed, the size of Usenet and the fact that Google indexed these groups was a good start. We didn’t have many messages, but we started to get a wider variety. This was our first real-time spam flow.
During that period we also started to search for people who had tackled the same problem: collecting representative samples of all the spam types in the world. We found a lot of discussions on different forums, including Slashdot, regarding the best methodology to create a spam flow. Some people claim that if you want to receive a clean spam flow you need to put your honeypots in public places and wait for spam to arrive, because the spam gathered by this means is unsolicited – whereas if you subscribe to a site, it is no longer spam. However, we did not follow this recommendation, because we didn’t want to lose out on the valuable messages that form the ‘grey zone’ of spam.
In order to explain how we define the ‘grey zone’, we will describe a few examples of sites that, in our opinion, belong to this category. The first is an employment website. Users can sign up to receive a newsletter from the site, however the site continues to send the newsletter even when the user has unsubscribed from it. It is a high-traffic site and it is impossible for an anti-spam product to block its emails, because there are many people still interested in receiving the newsletter. On the other hand, it is very likely that the customer database of this kind of site is sold to third parties, which makes it a good site on which to place honeypots.
The second example is a category of sites which promise the user free prizes. Usually the items consist of the latest popular gadget on the market (e.g. iPhone4 or iPad). Of course, the route to the prize consists of a lot of steps, including registration to participate in a lottery. Winning the prize is a long shot, but what you are certain to get is an inbox full of spam. This is the kind of spam that you usually receive after registering with such a site:
The third example is related to websites with weak security. There are innumerable legitimate sites that use popular CMSs or scripts. When harvesters find a security hole in the scripts of these websites, they are able to appropriate their database. Email addresses harvested in this way have a high price on the black market. This is why we decided to place honeypots on a large variety of sites that contain scripts with such weaknesses. Later, we developed methods to separate the legitimate newsletters sent by these sites from spam.
For a long time, open relay servers were responsible for sending a significant quantity of spam. However, after the most popular MTAs started to disable the open relay option by default, the importance of this vulnerability decreased significantly. This was also the reason why the ORDB (Open Relay Database) project came to an end. But we wanted to be sure that open relay technology could not still be used by spammers. To test this, we set up a QMail that relayed every email but we deleted the binary responsible for delivery. In a short time we started to receive some automated messages relayed to a free email account containing our IP address. Because our IP was never used as an MX, we concluded that people were still scanning for open relays. We delivered these messages manually and as a result, started to receive large numbers of samples from one or two spam templates. Although the quantity was considerable, there were only a few spam campaigns and there was not sufficient variety to make the project interesting.
After applying the previously mentioned methods, we started to receive higher quality spam. However, more was still needed. Initially we wanted to extract links to images from the emails, or long links, and follow them. This way, the spammers would know that the email had been read, so they would continue to send more spam to that address. This approach led to an increase in our spam flow just as we had expected. Clicking on the unsubscribe links was also a good way to increase the spam flow.
The second plan was to identify within the spam the advertised sites which contain newsletters or subscriptions. Because spammers use the same web template many times, it was easy to create a crawler that automatically introduced new honeypots in the advertised sites.
Another idea was to configure the MTAs from the mail servers in a weak way by also accepting messages without a helo/ehlo, setting up for each domain a catch-all account, storing messages for every domain, etc.
After implementing the above methods the quality and quantity of our spam flow started to get close to what we were aiming for. But in the meantime our expectations increased as well, and new methods were needed.
We started to exchange emails with scripting kids on some IRC channels, but the results were not impressive. We concluded that script kiddies are not active in the spam world, and that the professionals have dedicated communication channels that are hard to discover and track.
One successful attempt to receive spam was related to guessing unregistered but spammed domains. These domains are split into two categories: domains that have been used in the past as mail servers but whose business has since closed, and mistyped email server domains. While the first category is obvious, the second one is based on the idea that when writing their email in a public place, many people mistype the domain, but spammers’ crawlers are not aware of this fact.
Since in many of the methods above legitimate emails – especially newsletters – are intertwined with spam, we present a method to extract newsletters from spam feeds, thus ensuring that those feeds will be suitable for training spam filters. The cleansing method relies on the differences between the incoming frequencies of spam and newsletters. While spam usually comes in bursts, also known as spam waves or campaigns, newsletters come in small, constant numbers and have a fairly constant periodicity.
Techniques that differentiate signals based on their periodicity are nothing new in signal processing, but they have, to the best of our knowledge, never been used in the anti-spam industry to separate newsletters from a given spam feed. The problem of lowering the newsletter frequency so that it will not be mistaken for spam is, however, relevant for email marketers. They long ago discovered that limiting the number of messages they send in a given time frame, and sending them apart from each other rather than in bursts, increases the chances of the messages passing through spam filters.
We analysed our incoming flow of spam messages for a period of more than one month. We had reasons to believe that within the millions of spam messages there were also newsletters coming in. Since our cleansing method only considered the differences between the sending patterns of various mails received, all information except the source of the message and the time it came in was discarded. Logs containing these sender/time-of-arrival pairs were thus the input of the system. Any further analysis would be source-oriented, so all the log entries were sorted by their source domain (e.g. coming from bitdefender.com).
Although in the case of spam these source domains could easily be forged, there is little doubt that where newsletters are concerned the stated source is the real one. Since we were only looking for a sending pattern for legitimate mail sources, the fact that spammers lie about the message source was not a deterrent – quite the contrary, since this fact only adds randomness to the patterns obtained for domains which appear to send spam.
We applied several heuristics in order to determine the subset of messages that were most likely to be newsletters:
They must arrive in a fairly constant number on each day they appear.
The temporal distance between different occurrences must be constant.
They must not exceed a maximum number of messages per day.
Heuristics such as ‘the newsletter must only come on a daily/weekly/monthly basis’ are also valid and complementary to the ones above.
For the first runs we tested whether the emails whose log descriptions had been gathered as described above were actually ham, and in close to 80% the prediction was correct. That is a huge number given that they were extracted from a spam feed.
One variation that we have implemented as a backup to this method is to sort the incoming spam not according to their alleged source (the ‘from’ field in an email is not necessarily the real one), but by the web domains contained in the message body.
One of the constant problems we face is determining how relevant our feed is to real-world spam. We can measure the quantity of spam and the number of different waves; however it is difficult to estimate how many real-world spam waves we have in our corpus.
One of the ideas in testing our spam-gathering method was to set up some internal comparative tests between our product and the main competitors’ engines. This was quite difficult because different products have different speeds, and we were interested in analysing the detection of different products at a given time on the same messages. This limited our comparative tests to the speed of the slowest engine. When presented with this scenario, it was necessary to create a small but very different spam flow. After we set up the test we noticed that there were moments when the detection rate of all the products decreased, the drops in detection being generated by new spam waves. This was a promising sign, but we continued to look for others as well.
We continued by extracting IP addresses from the corpus and verifying them with popular RBLs. We observed that most of the IP addresses were listed there, but we also discovered some that were not listed. This was proof not only that we had a relevant flow, but that we had emails that were not yet blacklisted by the popular RBLs’ servers.
We repeated the process with the URLs from the spam, and we obtained even better results, having a higher number of URLs that weren’t listed on the popular URL blacklist servers.
An important metric regarding a spam feed is the quantity of information it contains, which is not always proportional to the number of messages within the feed in a given period of time. We focused on the most important subset of a spam feed – the uncaught spam messages (or false negatives) taken from two feeds we use: our proprietary one and a benchmark feed used throughout the industry. A total of 21,648 spam emails came through our servers and eluded our filters during the experiment, 70.2% of which came from the proprietary feed.
To further compare the two feeds we used a clustering algorithm to divide the message pool into clusters of similar messages, each corresponding to a spam campaign. A total of 3,254 clusters were obtained in the experiment, which were then divided into two sets. Each set is comprised of clusters where the majority of the messages came from a given feed. From each set of clusters we further chose subsets that correspond to clusters that contain almost no messages (maximum 10%) from the competing feed which we call ‘almost exclusive campaigns’. From these we selected the campaigns that only contained messages from one feed – ‘exclusive campaigns’. The results are shown in Figure 3.
The information above helps in determining whether a given corpus or feed could improve an anti-spam product’s accuracy. For instance, just because roughly 28% of the false negatives belonged to the benchmark test does not mean that adding that feed will improve the accuracy with a similar percentage. It is more likely that the improvement would be close to 18% – the percentage of campaigns where said feed holds the majority. However, if the filters have a steep learning curve then the detection bonus would appear only for campaigns where the overwhelming majority of the messages are found only in that feed – which is close to 15%. That detection bonus could drop half a per cent further if the spam campaigns are not extremely varied and a filter would only need a single representative message to train on in order to detect the entire campaign.
We have described how we create and enhance a spam feed, how we eliminate known types of pollution and how we evaluate a new feed’s contribution. We believe this could be a step towards a greater integration of different anti-spam solutions. None of us can filter spam we do not receive and it would be in everyone’s interest to create a mechanism where each vendor would contribute an equal share to create a common spam pool.