2007-03-01
Abstract
The effectiveness of content-based spam filters is directly related to the quality of the features used in the filter’s classification model. Vipul Sharma and Steve Lewis discuss how retiring features that have become ineffective can improve the filter's performance.
Copyright © 2007 Virus Bulletin
The effectiveness of content-based spam filters is directly related to the quality of the features used in the filter’s classification model. Features are the specific attributes examined by the spam filter [1] [3]. Highly effective filters may employ an extremely large number of such features (in the order of hundreds of thousands), which can consume a significant amount of both storage space and classification time.
In the ongoing battle between spammers and spam filter developers, new techniques and technologies are continually being introduced by both sides. This means that the number and importance of the features needed to classify spam accurately is subject to continual change. A given feature might be very important at one point in time, but become irrelevant after a few months as spam campaigns and their associated techniques change.
Discarding on a regular basis features that have become ineffective (‘bad features’) will benefit the spam filter with reduced classification time (reduced model training time and email delivery time), reduced storage requirements, increased spam detection accuracy and a reduced risk of overfitting of the model. (Overfitting occurs when the model trains on a sample set that is skewed by samples that are not representative of real-world threats – while the filter’s performance against the training samples will continue to increase, its performance against new, unseen samples will worsen.)
This article reports the results of experimentation with continuous feature-selection methods in real-world spam filters.
In machine learning, features are the inherent representation of an instance (email messages in our case). To handle efficiently the continuous introduction of new types of spam emails, it is important to add new features or attributes to the spam filter model. One very important step to keep classifiers efficient is to keep track of these attributes and to monitor their discriminative ability.
It is essential to keep good (highly discriminative) features to ensure ongoing classification accuracy. But it is also important to discard bad (irrelevant or ineffective) features for the following reasons:
Bad attributes increase the error rate in classification, thus bringing down the overall effectiveness of the filter.
As an increasingly large number of attributes are added, the complexity of the model grows, resulting in increased computation cost (classification time).
There is a risk of overfitting the model, caused by redundant or useless attributes.
Being able to distinguish between good and bad features is essential for ensuring the long-term effectiveness of the model. The factors involved in differentiating between good and bad features are described below.
The logic behind any feature extraction in spam filtering is that the feature should occur frequently in spam messages and infrequently in ham messages (i.e. legitimate, non-spam emails), or vice versa. An ideal feature would occur only in spam or only in ham messages.
The methods used to evaluate the quality of features are extremely important to ensure effectiveness and low false positive rates.
One well known example of a content-based spam filter is the open source SpamAssassin (SA), which calculates the effectiveness of a feature using the S/O (spam/overall) metric. The S/O of a feature is the proportion of total occurrences of the feature that were spam messages (i.e. the number of times the feature occurs in spam messages divided by the total number of times the feature occurs). A feature with S/O 1.0 occurs only in spam messages, while a feature with S/O 0.0 occurs only in ham messages.
Measuring the quality of features based purely on their S/O value would bias the classification model towards ‘all spam’ features, since this metric will only select features that occur frequently in spam emails. It is important also to select features that are indicative of ham.
Feature | Spam | Ham | S/O |
---|---|---|---|
Viagra | 92.1% | 7.9% | 0.921 |
Buy Viagra | 99.8% | 0.2% | 0.998 |
MSGID_RANDY* | 82% | 18% | 0.82 |
\/i@gr@@** | 100% | 0% | 1.0 |
visit online | 50% | 50% | 0.5 |
X_NO_RULES_FIRED*** | 20% | 80% | 0.2 |
Table 1. Features and their S/O values. *MSGID_RANDY is a SpamAssassin rule that checks for patterns in headers of spam messages. **\/i@gr@@ is a common obfuscation of the drug trade name ‘Viagra’. ***X_NO_RULES_FIRED occurs when no rule or Meta fires, and is indicative of ham messages.
Table 1 compares the effectiveness of features based on their S/O values. The second column of the table reports the percentage of messages that are spam when a given feature is present. The table shows that the feature ‘visit online’ has a higher S/O value than the feature ‘X_NO_RULES_FIRED’, since the former is seen in more spam messages than the latter (50% as compared to 20%). However, rating these features purely by their relative S/O values ignores the fact that ‘visit online’ is present equally in both spam and ham messages, hence it is of no use in discriminating between the two types.
On the other hand, the feature ‘X_NO_RULES_FIRED’ is found significantly more often in ham messages than in spam messages, hence it is a good feature. Using a metric like S/O alone will not select such a feature and the final model will have a higher false positive rate as a result.
In order to address this aspect of feature selection, we benchmarked several statistical feature selection techniques and discuss one of them in the next section.
Information Gain (IG) is a widely used method of feature selection in machine learning. The goal of IG is to reduce the overall size of the feature space (i.e. dimensionality reduction). In this way, IG is essentially used as a preprocessing stage prior to training.
IG measures the change in entropy (or randomness) of the model due to a given feature [5] [6]. (A model is more predictive if it is less random. If the randomness decreases due to a feature, it is believed to be a good feature, and if the randomness increases, then the feature is considered to be a bad one.)
Generally, for a training set S that consists of positive and negative examples of some target concept (such as spam/ham), the information gain of an attribute A that can take values from value(A) is given by:
IG (S, A) = Entropy(S) - ∑v € value(A) (|Sv | / |S|) Entropy (Sv)
Sv is the subset of S for which attribute A has value v
Sv = {s € S|A(s) = v})
For a given training set S, the entropy is defined as:
Entropy(S) = ∑i € Class –pi log2 pi
where pi is the proportion of S belonging to class i.
In our ongoing investigations of information gain as a method of feature selection, an IG threshold was chosen to produce the best accuracy on the training corpora (the set of spam and ham messages used to train the learning-based spam filter). Features that had an IG below the threshold were discarded.
We observed a performance boost of around 8% after discarding the features with IG below the threshold. Some of the rules were optimized after understanding the tricks spammers were using to bypass them, and many other rules were optimized for improved time performance.
We also noticed that after removing bad features, the error rate on the training data was reduced – meaning that we were producing better models that resulted in greater anti-spam effectiveness. Employing this process on a regular basis ensures that the feature set is cleaned of ineffective features, thereby ensuring a high level of effectiveness over time.
Regular feature extraction is required to keep spam filters functioning at the highest levels of effectiveness, but this can also result in an ever-increasing feature set and accompanying increases in processing time.
IG has been shown to be an effective method for measuring the quality of features and determining those which should be discarded. Using our technique, we were able to improve our spam filter’s performance substantially and increase its accuracy. The use of feature selection also decreases the risk of overfitting as the filter no longer trains itself on bad features.
[1] SpamAssassin. http://www.spamassassin.org/.