Detecting spam pictures using statistical features

Sándor Antal VirusBuster

The problem we want to solve is to detect spam messages which contain essential information in an attached picture.

Unfortunately, nowadays spammers usually vary the pictures randomly (e.g. include little dots or lines), which is why images of two instances of the same spam differ. The aim of the spammers who do this is to avoid their spam pictures being detected by hash-based methods. Our goal was to eliminate the problems caused by this trick and develop a fast method which is not as sensitive to the little differences in pictures as the hash-based methods are.

The methods we have developed and use are to calculate statistical parameters of the image file (size, average, STD etc.) without rendering the image to smooth the image using differnet IF methods (for example Gaussian Blur or various types of granulation filters) to remove several disturbances (e.g. random dots) to calculate global parameters of an image (e.g. brightness, contrast) to use these parameters in a hash function which gets similar hash values for similar pictures. It means that if there is a little difference between the hash values of two pictures then they are the same or almost the same considering these parameters as spam/ham features and using the Bayesian method. This means that it is enough to teach only a few (maybe only one) spam instance and (unless the pictures are varied significantly) the filter can detect the modified variations as well.