2012-07-01
Abstract
Although the PDF language was not designed to allow arbitrary code execution, implementation and design flaws in popular reader applications make it possible for criminals to infect machines via PDF documents. Didier Stevens explains how this is possible.
Copyright © 2012 Virus Bulletin
The Portable Document Format (PDF) is still a very popular vector with cybercriminals for infecting as many Windows machines as they can. Although the PDF language was not designed to allow arbitrary code execution, implementation and design flaws in popular reader applications make it possible for criminals to infect machines via PDF documents. Let us explore how this is possible.
The PDF file format is composed of objects that define how pages should be rendered by reader applications such as the ubiquitous Adobe Reader. These objects are logically organized in a hierarchical tree structure. We have a catalog object at the root, and find page objects lower in the tree structure. These page objects refer to other objects to define text and images to be drawn upon an empty page.
Here, we come across the first example of how malware authors can tailor PDF documents to attack PCs. PDF readers like Adobe Reader need to support a large number of image formats that can be included in PDF documents. This support requires a huge code base that inevitably contains programming errors. In 2009, Adobe had to release new versions of Reader to fix bugs in the JBIG2 rendering algorithms. JBIG2 is an image compression standard supported by Adobe Reader – but Adobe’s JBIG2 decompression algorithms were found to contain buffer overflows. Malware authors discovered how to craft a specially designed JBIG2 image that would cause a buffer overflow in the decompression algorithm.
Exploit developers love to discover buffer overflows because they can often lead to arbitrary code execution. The type of buffer overflows that exploit developers search for are the ones that eventually lead to EIP (Extended Instruction Pointer) control. The EIP is a crucial register in Intel x86 microprocessors, because it points to the next instruction to be executed. When exploit developers can control the value of the EIP register via a buffer overflow, they can control which instructions will be executed, and thus achieve arbitrary code execution. But controlling the address to which the EIP points is only one element of an exploit. Another important element is being able to include instructions that the malware author wants to execute. In most malicious exploits, these instructions are shellcode that will ultimately download and execute malware. Including shellcode in the exploit is often tricky, but malware authors have found a quick and dirty solution: the JavaScript heap spray. When a malware author develops an exploit that achieves EIP control, he still needs to be able to plant shellcode in memory at the address pointed to by the EIP. Including this shellcode in an exploit that triggers the vulnerability can often be very difficult or impossible to achieve, because of the specifics of the vulnerability.
The PDF language supports a couple of programming languages, one of which is JavaScript. PDF readers like Adobe Reader include a JavaScript interpreter. When JavaScript code is embedded inside a PDF document, it will be executed depending on the type of action that is defined. One such action is the opening of the PDF document – meaning that the PDF reader will execute the embedded JavaScript code when the PDF document is opened. This in itself is not a security issue, as the JavaScript implementation in PDF readers like Adobe Reader is sandboxed. Programs written in this JavaScript version cannot access or modify resources of the underlying operating system such as files and registry entries. JavaScript support in PDF documents is designed to augment the rendering of those documents – for example by calculating totals in order forms – and is designed to prevent alteration of system resources. This means, for example, that malware authors cannot write a JavaScript program to drop a trojan.
But malware authors can use JavaScript to plant the necessary shellcode for their exploit. They achieve this with heap spraying: the script creates a string that contains the shellcode preceded by a NOP sled – a long sequence of NOP instructions. Then it creates a large number of copies of this string. Since JavaScript is an interpreted language, it uses a memory management structure (heap) to store its variables. Thus, creating a large number of copies of a string that contains shellcode effectively fills the heap with shellcode. (This is likened to spraying shellcode into the heap, hence the term ‘heap spray’.)
Finding a vulnerability (like the JBIG2 vulnerability) in the PDF language parser is an important step towards achieving arbitrary code execution, but there is another popular tactic: finding a vulnerability in the JavaScript parser. A well-known example is the util.printf vulnerability. Exploit developers discovered that they can take control of the EIP register by calling util.printf with a very long numerical argument (Adobe released a new version of Reader to address this vulnerability in 2008). An exploit for util.printf first uses JavaScript code to perform a heap spray, then uses JavaScript to trigger the vulnerability in util.printf.
The two major exploit avenues present in malicious PDF documents found in the wild are: a JavaScript heap spray followed by the triggering of a vulnerability in the PDF language implementation, or the triggering of a vulnerability in the JavaScript language implementation.
As JavaScript heap sprays are so often found in malicious PDF documents, disabling JavaScript support in your PDF reader is often recommended as a mitigating action. Disabling JavaScript support in Adobe Reader means that JavaScript code embedded in PDF files is not executed. Remember that this course of action does not prevent PDF language exploits, but since they often rely on JavaScript heap sprays to plant shellcode, they ultimately fail when JavaScript support is disabled.
JavaScript is not only an essential tool for malware authors developing PDF exploits, but it is also crucial for the operation of exploit kits. Exploit kits are sets of programs running on a web server that are designed to automatically infect clients. When a user is directed to a web server hosting an exploit kit, the exploit kit will serve the client with malicious PDF files, Flash files, Java files etc., all containing exploits specifically tailored to infect the machine of the unsuspecting user. The exploit kit serves many exploits to the client in the hope that at least one will be successful and take control of the targeted machine. PDF documents with embedded JavaScript code are particularly well suited for use in exploit kits, because they offer two important advantages: versatility and stealthiness.
A PDF document with embedded JavaScript code is a versatile tool for an exploit kit because it can serve many exploits inside the same PDF document and activate the one that is most likely to be successful. Adobe’s JavaScript implementation comes with a function to check the version of Adobe Reader: app.viewerVersion. This function returns the version number of the reader that has opened the PDF document and is executing the embedded JavaScript code. By using the result of this function, authors of malicious PDFs can design their JavaScript code to include several exploits and select the best one with a JavaScript 'if' statement. For example, if the version of Adobe Reader is 8.1.2, the JavaScript code for the util.printf exploit will be launched, but if the version of Adobe Reader is 8.1.3, then the JavaScript code for the Collab.getIcon exploit will be launched. Launching the JavaScript code for the util.printf exploit with version 8.1.3 or later is pointless, because the util.printf vulnerability was patched with the release of version 8.1.3.
Malicious PDFs produced by exploit kits not only use app.viewerVersion to determine which exploit to launch. Many features in Adobe Reader are implemented via plug-ins. These plug-ins are actually DLLs that are loaded into the Adobe Reader process whenever the functionality they implement is required. JavaScript in Adobe Reader is implemented with the ECMA Script plug-in (file Escript.api). Malicious PDFs can retrieve the version number of the loaded ECMA Script plug-in by enumerating plug-in array app.plugIns and reading the property version for the plug-in with property name ‘EScript’.
This versatility not only allows authors of malicious PDFs to tailor their JavaScript code to launch the most appropriate exploit for the version of Adobe Reader their file is running in, but it even allows them to target different readers with the same PDF document, provided the targeted readers support embedded JavaScript. For example, assume a malware author wants to target both Adobe Reader and Foxit Reader with the same malicious PDF. Both readers had a vulnerability in the util.printf method, but the details of the exploit for each are quite different. An exploit for Adobe’s util.printf implementation does not work for Foxit’s util.printf implementation, and vice versa. Hence the malware author needs to write JavaScript code to determine which reader opened his malicious PDF document and to launch the appropriate exploit (provided the version is vulnerable).
One method to determine which reader the JavaScript code is running in is to use a property or method that is only declared in one reader, and not in the other. For example, the Net.SOAP.wireDump property is declared in Adobe Reader, but not in Foxit Reader. When this property is accessed from JavaScript code running in Foxit Reader, an exception will be thrown, while with Adobe Reader, a boolean value will be returned. When an exception is thrown, it interrupts the running JavaScript code, but this can be prevented by catching the exception with a JavaScript try-catch statement. So, by inserting the Net.SOAP.wireDump expression inside a JavaScript try-catch statement and catching the exception, it is possible to determine which reader the JavaScript code is running in, and launch the appropriate exploit.
Exploit kit developers want to prevent anti-virus programs from detecting their exploits, so they develop kits that serve ever-changing exploits. Malicious PDF documents with embedded JavaScript code are particularly suited for this, as JavaScript can be used to obfuscate the code in an infinite number of ways. This is especially the case if exploit developers limit their malicious PDF documents to JavaScript exploits, because then all malicious code can be obfuscated.
JavaScript obfuscation is a vast subject. New techniques appear all the time, making the task of anti-virus engine developers difficult. And with JavaScript code embedded in PDF documents, there are even more obfuscation possibilities. One popular way to obfuscate JavaScript code is to split it up into different parts. Inside a PDF document there are several ways to split up JavaScript code and store the different parts. PDF document annotations are often used to split up embedded JavaScript code. Annotations allow a user of a PDF reader to annotate the document he is reading. Annotations can be text, but also text highlights and other symbols. Annotations can be made invisible, so that a user can view the original document without annotations.
Invisible annotations are used by authors of malicious PDFs to store partial JavaScript code. These snippets of code are accessed from JavaScript code with the getAnnotations method, recombined with string concatenation and then executed via the eval function. The string concatenation code is often convoluted to add to the overall obfuscation of the JavaScript code.
One last obfuscation technique that deserves a mention is encryption. PDF documents can be encrypted for two reasons: for digital rights management and for confidentiality. When a PDF document is encrypted, the structure of the document remains unchanged – the structure is not encrypted, but the content is. This means that objects and their properties remain unencrypted, while the strings and streams stored inside objects (the actual content) are encrypted. PDF documents are encrypted with a key derived (amongst other elements) from a user password and a hashed owner password. The hashed user and owner password are stored inside the PDF document. If the user password is empty, the key can be completely derived from elements stored inside the PDF document, and thus the user does not need to provide a password to view the PDF document. In other words, a PDF document that is encrypted with a key derived from the hashed owner password (for DRM reasons, like disabling printing) is ‘obfuscated’ because of the encryption, but can be decrypted (hence viewed) without requiring a password. Anti-virus products that need to ‘deobfuscate’ such PDF documents need to be able to decrypt PDF documents.
Malicious PDF documents are used on a large scale to infect Windows PCs. This trend started several years ago, with mass mailings of malicious PDF documents, and is likely to remain popular for several years to come because of the versatility and stealthiness it offers to exploit kit developers.