2008-12-01
Abstract
Roel Schouwenberg highlights one potential consequence of the attention currently being devoted to dynamic testing: the possibility that the increased focus on dynamic testing may inspire malware authors to devote more attention to circumventing products’ protection capabilities, rather than just their detection abilities.
Copyright © 2008 Virus Bulletin
Anyone who is remotely involved in the anti-malware industry will know that testing is a hot topic – the subject has received a lot of attention lately, particularly following the formation of AMTSO, the Anti-Malware Testing Standards Organization, earlier this year.
This article is not intended to be the nth paper describing how testing should be performed. However, it will highlight one potential consequence of the attention currently being devoted to dynamic testing: the possibility that the increased focus on dynamic testing may inspire malware authors to devote more attention to circumventing products’ protection capabilities, rather than just their detection abilities.
This article will use a number of examples and scenarios to evaluate the risk associated with dynamic testing. It will also put forward a number of suggestions for ways in which testers can mitigate the risk.
Before evaluating the risks associated with dynamic testing we will first have a look at a number of other testing methods.
Static testing is the most straightforward type of testing: an on-demand scan is run on a collection of malware. In order to produce meaningful results, any respectable static test these days must use a collection of malware containing thousands of samples, while the test sets used by testing bodies AV-Test and AV-Comparatives usually contain hundreds of thousands of samples and in some cases more than a million samples.
Since the test collections are so large, the results contain little (if any) useful detail that malware authors can use to help them fine-tune their creations.
Another thing to bear in mind when conducting static tests is the way in which the test collection is broken down. Certain less credible tests seem to have been performed on a big pile of unsorted malware samples – and although the tester may have internally enumerated the results of different sub test sets, this would still be considered bad testing practice.
Other, more credible tests make proper differentiation in their published test results. They sort their test collections into sub-sets – such as viruses, worms and trojans – and publish results for each sub-set. Some tests go even further and attempt to differentiate, for instance, between backdoors and spyware trojans.
Although this provides a greater level of detail , it still does not provide much useful information for the malware author. The only possible risk from static tests comes from those that look at how well products detect polymorphic/metamorphic malware. Such tests may highlight weaknesses in certain products.
However, polymorphic/metamorphic malware is inherently more difficult to detect than static malware – and a malware author who needs the results of a test to find this out is most likely not competent enough to write such malware anyway. Having said that, someone who is not competent enough to write a polymorphic piece of malware can now go to the underground and either buy such a piece of code or advertise for someone to create it.
Although response time testing is carried out only rarely these days, it is still worth looking at. It was most popular during the era of the big email worm outbreaks – NetSky, Bagle, Mydoom, Sober and Sobig are classic examples of that period.
In contrast to static testing, the size of the sample sets used for response time testing are very small. One type of response time testing measures the overall performance of AV vendors based on a larger, though still relatively small, test set [1] – this type doesn’t show the results for individual samples. The second type of testing measures detection on a per-sample basis [2].
Such specific, per-sample results could be a clear aid to malware authors. One can speculate that the published response time test results may have led certain malware authors to change their approach to make detection of their malware harder. One example is that of W32/Sober.K [3], which appended random garbage at the end of its file during installation in a deliberate attempt to slow down AV detection. It is quite possible that the author introduced this functionality after being unhappy with the response time results of earlier Sober variants.
Today’s response time testing results are published in a more generic fashion with little reference to specific samples. The risk of malware authors gaining too much information is therefore very low.
In retrospective testing, an out-of-date product is tested against up-to-date malware samples – but, other than its purpose (to investigate the product’s ability to detect unknown samples) and the age of the updates, retrospective testing is no different from static testing.
The level of risk that retrospective tests introduce is very low, just like regular static tests.
In dynamic testing, malware samples are introduced into the test system with the intent to execute them. Only a small number of samples can be used in these tests because the process of executing each one is very time consuming. AMTSO has published a document which explains the idea behind dynamic testing in greater detail [4].
Ideally, the samples are introduced onto the test system in the ‘right’ way – for instance via drive-by download. Even when automated, this is a very time-consuming task, and the fact that virtual machines need to be avoided in order to obtain valid test results does not help the matter.
The number of samples included in tests is currently in the dozens. In time, with better hardware and optimized processes, we can expect the number of samples being used to reach the hundreds. However, the risk to which the industry is exposed with the introduction of dynamic testing is far greater than that associated with any of the other popular testing methods.
A large part of the industry is spending a lot of time on educating people about (new) proactive technologies, and AMTSO has put a lot of work into compiling a best practices paper for dynamic testing [1]. What is collectively being said is that the focus of testers on products’ detection rates alone is outdated, and testers now need to consider their protective measures as well.
There’s little doubt that the malware authors are also listening.
Some five years ago, Symantec incorporated a protection feature called ‘anti-worm’ into its Norton products. This was a behavioural system used to catch email worms proactively. Symantec expected to have to update the technology no more than six months after its introduction in order for it to remain effective against evolving malware. However, the developers have never had to update it [5]. Malware authors either did not know about the technology, didn’t bother to circumvent it, or were unable to.
In May 2006, Kaspersky Lab launched its version 6 product line which featured a new-generation behaviour blocker. Six months later a patch had to be released for the behaviour blocker because new variants of the LdPinch trojan family were bypassing the technology that had initially been capable of catching them.
What was the difference between the two behaviour blockers? ‘Anti-worm’ was introduced in the era of big malware epidemics, driven mostly by fame-seeking authors. By 2006 the majority of malware out there was being driven by profit, including LdPinch. Another thing to bear in mind is that Kaspersky Lab is a Russian company and LdPinch was a Russian creation, targeted mainly at the Russian market.
However, there could be a third explanation for the difference (though it should be noted that the author of this article is by no means a marketing expert): it would seem that Kaspersky Lab invested more effort into the marketing of its behaviour blocker than Symantec did at the time it introduced ‘anti-worm’, thus malware authors were more aware of the Kaspersky product and made the effort to circumvent it.
Multi-scanners are another interesting demonstration of malware authors using legitimate services to gain information for their own purposes. Online multi-scanners enjoy great popularity these days, with JottiScan [6] and VirusTotal [7] being the most popular.
These websites provide a service whereby the user can upload a file and have it scanned by a whole range of scanners to see what the various products detect it as (and if they detect it at all).
These services also enjoy some popularity with malware authors who use them to test their latest creations and see whether they successfully avoid detection. While some anti-malware vendors provide the multi-scanner sites with their most recent scanners and ask the maintainer to use the most paranoid settings to detect as much malware as possible, there are also vendors who don’t offer their most recent scanner to these sites. They may also ask the maintainer to use lower than maximum settings, so the product will detect less malware than it is capable of in real-world use [8], thus not revealing its true detection capabilities.
These days malware that bypasses protection features is by no means rare. However, the vast majority of malware authors still focus solely on bypassing detection mechanisms.
Some anti-malware products have little in the way of protection measures, and bypassing their detection capabilities still means bypassing the entire product. Therefore it’s not so strange to see Win32 PE malware samples that are obfuscated in such a way that de-obfuscation takes roughly two minutes on a Core 2 Duo running at 2,500 MHz. However, the same malware samples can be detected proactively using behaviour blocker technology from two years ago [9].
There’s little doubt that the current noise regarding protection technologies and dynamic testing is causing some malware authors to pay extra attention to these types of technology. With the current buzz surrounding the topic, it’s likely that the interest of more malware authors will be piqued.
There are a number of scenarios likely to occur: firstly, it’s likely that new groups will form in the underg round that will focus on providing the means for malware to circumvent protection technologies. Secondly, there may be a new market for improved multi-scanners which will test products’ protection technologies as well as their detection abilities.
The big problem with malware bypassing protection technologies is the matter of fixing the holes the malware authors are exploiting. While vendors can ship a new signature update in hours or even minutes, fixing holes takes much longer – we’re talking about a response time of week(s) rather than days, let alone hours.
Revealing per-sample test result details is a much more dangerous idea with dynamic testing than it is with response time testing. While there is a low-to-moderate risk in revealing too much detail with response time testing, the risk is very high with dynamic testing.
Having a limited sample set for testing means that the samples tested need to be very relevant. If testers are going to publicize the results for each such important sample, including how individual products perform against them, then this is extremely valuable information for the malware authors. It will show them against which products they need to improve their creations.
Virus Bulletin is not yet publishing dynamic testing results, but plans to us e information from its (upcoming) prevalence table to pick samples for testing. While the testers will start out just by mentioning malware families, they may end up disclosing specific malware names as well [10].
AV-Comparatives is also not yet publishing dynamic testing results, but intends to publish the names of the samples being used for its future dynamic tests.
As pointed out, this approach should be avoided. AV-Test, which is already performing dynamic tests, takes a better approach. Magazines are prohibited from disclosing the malware names or hashes of files that were used in the test. However, AV-Test will share the hashes or samples with the AV vendors that participated in the test [10].
Though slightly less transparent for end-users, this approach is by far preferable in terms of risk mitigation, while also allowing for any vendor to notify the tester if they find that any non-relevant samples have been used in the test set.
The adoption of dynamic testing brings with it some new challenges. Now, more than ever, testing can have real consequences for malware authors and their actions.
Security vendors need to take care that in their quest for education they do not lose sight of what really is important: the protection of users.
Care must be taken to avoid a situation in which education speeds up malware evolution and causes more problems than solutions. AMTSO’s first rule of the fundamental principles of testing document states that testing must not endanger the public [4].
The attention on protection technologies and dynamic testing is inevitably leading to increased awareness on all parts, including that of malware authors. It will be up to the industry as a whole, possibly in the form of AMTSO, to minimize the risk and ensure that testers do not reveal too much detail in their public test results.
[1] AV-Test response time test 2005. http://voices.washingtonpost.com/securityfix/2005/12/ranking_response_times_for_ant.html.
[2] AV-Test W32/Sober variant response time results.http://www.pcmag.com/ article2/0,2817,1813927,00.asp.
[3] F-Secure W32/Sober.K write-up. http://www.f-secure.com/v-descs/sober_k.shtml.