Posted by Martijn Grooten on Jun 22, 2017
Researchers at Cisco have published a paper (PDF) describing how it may be possible to use machine learning to distinguish malware command-and-control (C&C) traffic using TLS from regular enterprise traffic, and to classify malware families based on their encrypted C&C traffic.
The need for malware to communicate with its operators, so that it can receive instructions and exfiltrate information from infected systems, is a weak point – it can't easily hide its activity from security products scanning network traffic. For this reason, the trend among malware of using SSL/TLS – the protocol over which a significant portion of today's web and email traffic is sent – is an understandable one.
A good encryption protocol makes encrypted content indistinguishable from random noise, but while TLS uses top-class encryption standards, it cannot avoid the use of metadata that can give away some essential details of the communication.
Even if one ignores the remote IP address and the domain sent in the certificate, both of which can help detect a known malware family, TLS includes explicit metadata, such as the cipher suites and TLS extensions offered and used, as well as more implicit metadata, such as the length and frequency of the packets and the variation seen in them.
The Cisco researchers trained their machine-learning classifier using a combination of malicious TLS traffic and legitimate enterprise TLS traffic. The classifier was able to identify the TLS traffic of most malware families with high accuracy – even that of families that had not been present in the training set.
The research is very much a work-in-progress and, as befits a good research paper, its authors openly admit the limitations to their work. For instance, the malware was run in Windows XP-based sandboxes, which could have helped the detection: malware often inherits TLS properties from the operating system in which it runs. At the same time, malware is mostly likely to live on older operating systems, making this set-up not too different from a real-world scenario.
It is also important to note that the classifier was not able to say anything about the content of the traffic; it would thus be useless as part of a data-loss prevention system. TLS, especially its most recent versions, is one of the strongest Internet protocols, and the fact that it properly protects content is a very good thing, even if it can be frustrating for malware analysis and detection.
At Virus Bulletin, we have repeatedly shown how malicious web traffic can be blocked by security products. Organizations using a web security gateway will have to make a decision as to whether to have it inspect TLS-encrypted web traffic as well. While I think that, in most scenarios, inspecting the traffic is a compromise worth making, this research shows that one may be able to block malware's ability to connect to its owners without being able to decrypt the traffic.