Automatically detecting spam at the cloud level using text fingerprints

2012-06-01

Marius Nicolae Tibeica

Bitdefender, Romania

Adrian Toma

Bitdefender, Romania

Editor: Helen Martin

Abstract

With content-based anti-spam technologies decreasing in efficiency, Marius Tibeica and Adrian Toma propose a fingerprinting algorithm that maps similar text inputs to similar signatures.

Table of contents

1. Introduction

2. The fingerprint

2.1 The basic fingerprint
2.2 The zoom in
2.3 The zoom out
2.4 Choosing the hash function
2.5 Example of fingerprint generation
2.6 Zoom levels on spam & legitimate emails

Comparing fingerprints

Detection and FP rates

Further study and limitations

Due to increases in spam volume, as well as language diversity, content-based anti-spam technologies have decreased in efficiency. Alternative methods of similarity/outbreak detection are much needed, and by taking advantage of technological advances in the cloud infrastructure, we can reduce the impact on clients’ resources.

To address the similarity problem, we propose a fingerprinting algorithm that maps similar text inputs to similar signatures. There are two steps: the first involves creating an element of the fingerprint from each word or group of words, chosen by certain heuristics. The size of the text on which the fingerprint is created is very important: too little information can generate false positives, and too much information can make the matching process costly. Our approach is either to zoom in (increasing the number of fingerprint elements each word generates) if the text is too short, or zoom out (gradually reducing the length by eliminating certain groups) if the text is too long. We have tested the method using a clean stream of spam to train a matching filter with the Levenshtein distance as an indicator of similarity.

1. Introduction

Spammers constantly adapt their techniques in order to avoid detection filters. Signature-based anti-spam filters require frequent updates in order to remain effective, especially given the speed with which spam changes. Bayesian filters need constant training and can also miss spam with malicious attachments. IP address blocking is also problematic, as most spammers now rapidly change their IP addresses. Furthermore, a legitimate server that has been compromised for a short period of time cannot be blacklisted, as it also sends legitimate email messages. Spammers try to decrease the efficiency of URI blacklisting by registering a large number of domains or by using URL-shortening services. The need for an automatic similarity/outbreak detection method is clear.

The increasing popularity of portable devices and recent technological advances in the cloud infrastructure make moving processing away from the client an obvious choice – both to reduce the impact on the client’s resources and to significantly decrease update times. This shift in perspective calls for the use of a reliable fingerprint generation algorithm.

2. The fingerprint

There is an existing algorithm that generates fingerprints: context-triggered piecewise hashing (also known as fuzzy hashing) [1]. Unfortunately, on text of small dimensions, the length of the signature generated by the algorithm is too small and is unusable. This represents a big portion of spam messages, and these are also the hardest to detect using other content-based filters.

Our approach to creating a fingerprint for text is to focus on the actual words contained in the message, as this gives a good separation of entities in most languages. By generating a character from each word we obtain a basic fingerprint, but this has several limitations, which we address as follows.

2.1 The basic fingerprint

The creation of the basic fingerprint involves several steps:

The input string is separated into different entities by delimiters [2]. These entities can be considered words.
Entities larger than a certain threshold value can be further separated.
We apply a hash function to each entity.
A base64-encoded value of the six least significant bits of the hash is appended to the final fingerprint.

This will produce a fingerprint with a length equal to the number of entities found. If a fingerprint with a length within a certain range is needed, further processing is required.

2.2 The zoom in

Too little information from a fingerprint can generate false positives. To avoid this we can increase precision by gradually increasing the number of encoded values each hash appends to the final fingerprint. The number of encoded values represents the zoom level.

In this case, a 30-bit hash can offer up to five levels of zooming (for a 64-letter alphabet), and the possibility to increase the length of the fingerprints up to five times.

2.3 The zoom out

Too much information can make the matching process costly, especially when using a time-consuming [3] edit distance. To decrease the length of the fingerprint, we try to eliminate some of the entities in a way that gives two similar texts similar fingerprints:

To avoid losing too much information, we create new hashes from groups of entities.
For an X zoom-out level, a base64-encoded value of the six least significant bits of the hash is appended to the final fingerprint if hash % X = 0.

There is no way of finding the length of a fingerprint with a certain zoom-out level without calculating it, so zooming in will be done gradually until an acceptable length is found.

2.4 Choosing the hash function

We checked several hash functions to see which offered the least number of collisions on words from emails in various languages. The best choice was RSHash.

Hash function	Collisions 32 bits	Collisions 30 bits
RSHash	0	4
BKDRHash	1	6
SDBMHash	2	7
OneAtATimeHash	2	6
APHash	4	6
FNVHash	7	10
FNV1aHash	7	10
JSHash	266	277
DJBHash	266	268
DEKHash	435	720
PJWHash	1687	1687
ELFHash	1687	1687
BPHash	61907	70909

Table 1. Analysis of hash functions on over 122,000 words from emails in various languages.

2.5 Example of fingerprint generation

Table 2 and Table 3 show how fingerprints with different zoom levels are generated from the text:

‘High end designer watch and handbag replica sale. Compare our price on a handful of our high end replicas!’

Entities	Hash in hex	Basic fingerprint	2x zoom in fingerprint	4x zoom in fingerprint
high	25c4f948	I	EI	lE5I
end	260c1435	1	M1	mMU1
designer	84f5afb	7	P7	IPa7
watch	34f5dc75	1	11	01c1
and	2367c3d9	Z	nZ	jnDZ
handbag	1aa88b79	4	o5	aoL5
replica	33381eca	K	4K	z4eK
sale	e96c2eb	r	Wr	OWCr
compare	1c947587	H	UH	cU1H
our	24b80bd8	Y	4Y	k4LY
price	3b54d80d	N	UN	7UYN
on	1777af4f	P	3P	X3vP
a	61	h	Ah	AAAh
handful	380be94e	O	LO	4LpO
of	1777af47	H	3H	X3vH
our	24b80bd8	Y	4Y	k4LY
high	3f155a68	o	Vo	/Vao
end	260c1435	1	M1	mMU1
replicas	ad4c229	p	Up	KUCp

Table 2. Basic fingerprint and zooming in example.

Entity groups	Hash in hex	1/2x zoom out	1/3x zoom out	1/4x zoom out	1/5x zoom out	1/6x zoom out
high end designer	54206878	4		4	4
end designer watch	63514ba5		1		1
designer watch and	60acfb49
watch and handbag	73062bc7		H
and handbag replica	71486e1c	c		c
handbag replica sale	5c776d2e	u	u			u
replica sale compare	5e63573c	8		8	8
sale compare our	4fe3444a	K
compare our price	7ca1596c	s		s
our price on	77849334	0	0	0	0	0
price on a	52cc87bd				9
on a handful	4f8398fe	+
a handful of	4f8398f6	2
handful of our	743ba46d
of our high	7b451587		H
our high end	89d97a75
high end replicas	6ff630c6	G	G			G

Table 3. Zooming out example.

The basic and zoom-in fingerprints are generated from the same hashes, with the following results:

Basic fingerprint: I171Z5KrHYNPhOHYo1p
Fingerprint with 2x zoom in: EIM1P711nZo54KWrUH 4YUN3PAhLO3H4YVoM1Up
Fingerprint with 4x zoom in: lE5ImMU1IPa701c1jnDZaoL5z4eKOWCrcU1Hk4LY7 UYNX3vPAAAh4LpOX3vHk4LY/VaomMU1KUCp

The zoom-out fingerprints are:

1/2 x 4cu8Ks0+2G
1/3 x lHu0HG
1/4 x 4c8s0
1/5 x 4l809
1/6 x u0G

2.6 Zoom levels on spam & legitimate emails

We analysed the spam flux and legitimate email messages over the course of two weeks. Setting a desired fingerprint length of 127 to 256 characters, we obtained the following results:

Figure 1. Zoom levels on spam emails.

The cumulative results are:

Legitimate emails zoom in: 21.84%
Legitimate emails no zoom: 19.88%
Legitimate emails zoom out: 57.23%
Legitimate emails no suitable text: 0.99%
Spam emails zoom in: 36.26%
Spam emails no zoom: 20.82%
Spam emails zoom out: 42.15%
Spam emails no suitable text: 0.72%

Comparing fingerprints

Two fingerprints can be compared to determine whether the texts from which they were derived are similar.

Because the method of creating a fingerprint differs with each zoom level, only those with an identical zoom level can be compared. The examination looks at the zoom level and computes a Levenshtein distance, which then is scaled to produce a match score. For two fingerprints, f₁ and f₂, the score is:

By choosing a threshold T, in the range from 0 through 1, (1 meaning that a perfect match is required), we can say that the two fingerprints match if T ≤ S (f₁, f₂).

Detection and FP rates

We took a continuous stream of spam (15 hours, 865,000 emails) and divided it into 10-minute intervals. For a certain interval, we trained the filter with all the emails from the previous intervals and found a detection rate with a similarity threshold of 0.75 for both fuzzy hashing (with variable block size) and the proposed fingerprinting algorithm. The results are presented in Figure 2.

Figure 2. Detection rates on spam emails.

We then trained the filters with all the spam emails and ran a check on a corpus of 500,000 legitimate emails and newsletters. No false positives were registered.

The fingerprinting technology was also used in between official tests in VBSpam comparative testing and the zero false-positive rate was confirmed.

Further study and limitations

The fingerprint is based on content, especially words. As long as an email message has no words (including emails that only contain images or URLs) a fingerprint cannot be generated.

Bibliography

[1] Levenshtein, V. I. Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR, 4, 163, 845–848, 1965.

[2] The delimiters that we chose are: {‘ ’, ‘\n’, ‘\t’, ‘\r’, ‘\0’, ‘.’, ‘,’, ‘:’, ‘;’, ‘(‘,’)’, ‘{‘,’}’, ‘[‘,’]’, ‘\\’, ‘/’, ‘^’, ‘\”’, ‘!’, ‘?’, ‘`’, ‘\’’, ‘+’, ‘*’, ‘^’, ‘$’, ‘|’, ‘?’, ‘”’}

[3] The Levenshtein edit distance [1] is found in O(mn) time (where m and n are the length of the measured strings).

[4] Kornblum, J. Identifying almost identical files using context triggered piecewise hashing. DFRWS conference 2006.

Latest articles:

Nexus Android banking botnet – compromising C&C panels and dissecting mobile AppInjects

Aditya Sood & Rohit Bansal provide details of a security vulnerability in the Nexus Android botnet C&C panel that was exploited to compromise the C&C panel in order to gather threat intelligence, and present a model of mobile AppInjects.

Cryptojacking on the fly: TeamTNT using NVIDIA drivers to mine cryptocurrency

TeamTNT is known for attacking insecure and vulnerable Kubernetes deployments in order to infiltrate organizations’ dedicated environments and transform them into attack launchpads. In this article Aditya Sood presents a new module introduced by…

Collector-stealer: a Russian origin credential and information extractor

Collector-stealer, a piece of malware of Russian origin, is heavily used on the Internet to exfiltrate sensitive data from end-user systems and store it in its C&C panels. In this article, researchers Aditya K Sood and Rohit Chaturvedi present a 360…

Fighting Fire with Fire

In 1989, Joe Wells encountered his first virus: Jerusalem. He disassembled the virus, and from that moment onward, was intrigued by the properties of these small pieces of self-replicating code. Joe Wells was an expert on computer viruses, was partly…

Run your malicious VBA macros anywhere!

Kurt Natvig wanted to understand whether it’s possible to recompile VBA macros to another language, which could then easily be ‘run’ on any gateway, thus revealing a sample’s true nature in a safe manner. In this article he explains how he recompiled…

Bulletin Archive