2010-11-01
Abstract
Last year, a series of articles described some tricks that might become common in the future, along with some countermeasures. In this final article of the series we look at anti-unpacking by anti-emulating.
Copyright © 2010 Virus Bulletin
New anti-unpacking tricks continue to be developed as older ones are constantly being defeated. Last year, a series of articles has described some tricks that might become common in the future, along with some countermeasures [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14].
In this final article of the series we look at anti-unpacking by anti-emulating.
Unless stated otherwise, all of the techniques described here were discovered and developed by the author.
On Windows XP and later versions (but only on 32-bit platforms), if the CPU supports the SYSEXIT instruction, Windows will return in the EDX register the address of the next instruction to execute.
Example code looks like this:
;any value will work ;but requires user32.dll loaded or eax, -1 int 2eh l1: cmp edx, offset l1 jne being_debugged
The reason for this is obscure. This disassembly shows more:
test d [esp+4], 1 ;ring check l1: jne l2 ;taken if ring 3 ... l2: iretd ;return to caller l3: test b [esp+9], 1 ;check T flag jne l2 ;use iret if set pop edx ;edx=eip add esp, 4 ;discard cs and b [esp+1], -3 ;clear I flag popfd ;load flags pop ecx ;discard error code sti sysexit ;fast return to caller
In this disassembly, there is no reference to either l1 or l3. However, what cannot be seen here is that code exists elsewhere in the kernel, which checks for CPU support for the SYSEXIT instruction. If such support exists, then the kernel adjusts the value at l1+1 such that the branch reaches l3 instead of l2.
When active, the only way to reach l2 is if the T flag is set. In all other cases, the faster SYSEXIT instruction is used instead of the IRETD instruction. As a side effect of that change, the EDX value always contains the EIP value on return.
Interestingly, Windows 2000 contains similar code, as we can see in this disassembly:
;check for SYSEXIT support ;internal flag, not CPUID value test d [xxxxxxxx], 1000 je l1 ;taken if not supported test d [esp+4], 1 ;ring check je l1 ;taken if ring 0 ... pop edx ;edx=eip add esp, 8 ;discard cs, eflags pop ecx ;discard error code sti sysexit ;fast return to caller l1: iretd ;return to caller
Here, a variable is checked instead of using an altered branch. It has poorer performance, but it avoids the in-memory patch. However, the code that queries the CPU capabilities does not contain any code to enable this feature. As a result, the SYSEXIT path is never reached.
There is an additional unexpected behaviour in the 32-bit version of Windows Vista and later Windows versions. If the value in the EAX register exceeds the size of the standard service table, then Windows will call through the ntdll KiUserCallbackDispatcher() function, which in turn calls through the PEB->KernelCallbackTable table. The index that is used depends on the Windows version. For Windows Vista, the index is currently 0x4c, and for Windows 7, the index is currently 0x4a. These values could change in the future, but it is trivial to fill a table that can support any value. This technique could be used to redirect execution in an obfuscated manner for those platforms.
Example code looks like this:
call GetVersion cmp al, 5 jb l1 ;not Vista+ push offset l2 call GetModuleHandleA push offset l3 push eax call GetProcAddress xchg ecx, eax jecxz l1 ;not supported push eax push esp push -1 ;GetCurrentProcess() call ecx pop ecx loop l1 ;taken if not WOW64 mov eax, fs:[ecx+30h] mov d [eax+2ch], offset l4 int 2eh jmp being_debugged l1: ... l2: db “kernel32”, 0 l3: db “IsWow64Process”, 0 l4: dd 4ah dup (0) dd offset l1 ;Windows 7 dd 0 dd offset l1 ;Vista
The operand-size override (0x66) can be used on instructions that transfer control. The result is that the EIP register is truncated to a 16-bit value. Execution resumes (if possible) from the resulting address.
Example code looks like this:
xor ebx, ebx push 40h mov eax, esp push 3000h push esp push ebx push eax push -1 ;GetCurrentProcess() call NtAllocateVirtualMemory xchg ecx,eax db 66h jecxz l1 l1: ...
In this example, execution continues at the address (l1&0xffff). This technique works with all types of branch – the 7x form and the 0f xx form.
Example code looks like this:
xor ebx, ebx push 40h mov eax, esp push 3000h push esp push ebx push eax push -1 ;GetCurrentProcess() call NtAllocateVirtualMemory test eax, eax db 66h je l1 l1: ...
In this example, execution continues at the address (l1&0xffff). This technique also works with relative calls and relative jumps.
Example code looks like this:
xor ebx, ebx push 40h mov eax, esp push 3000h push esp push ebx push eax push -1 ;GetCurrentProcess() call NtAllocateVirtualMemory call l1 ;determine eip l1: pop ax ;discard low 16 bits call small l2 l2: ... l3: ...
As with the previous example, execution continues at the address (l1&0xffff). However, unlike the previous example, this one can return to l3, with a balanced stack, simply by executing a 32-bit RET instruction.
Note the explicit mention of a ‘32-bit RET instruction’. This is important because the technique also works with all types of return (near and far).
Example code looks like this:
xor ebx, ebx push 40h mov eax, esp push 3000h push esp push ebx push eax push -1 ;GetCurrentProcess() call NtAllocateVirtualMemory push small offset l1 db 66h ret l1: ...
As in the last example, execution continues at the address (l1&0xffff). Finally, this technique also works with the IRET instruction.
Example code looks like this:
xor ebx, ebx push 40h mov eax, esp push 3000h push esp push ebx push eax push -1 ;GetCurrentProcess() call NtAllocateVirtualMemory pushfw push small cs push small offset l1 iretw l1: ...
As with the previous example, execution continues at the address (l1&0xffff).
Since this is a most uncommon use of the operand-size override, it is possible that some emulators will not support it.
The CPU supports the running of multiple tasks. Each of those tasks has access to various resources such as the CPU registers and the FPU. However, when a task switch occurs, the CPU saves only the CPU registers and none of the FPU state. Instead, the CPU sets the ‘TS’ (Task Switched) bit in a control register, which signifies that a task switch has occurred. Whenever the CPU encounters an FPU, MMX, or SSE instruction (with a few exceptions), it checks the state of that bit. If the bit is set, then the CPU checks the state of the ‘MP’ (Monitor Processor) bit. This bit is under software control. If it is also set, then the CPU raises an ‘NM’ (Non-Maskable) exception that refers to the co-processor. The software-based task manager intercepts that exception and saves the state of the FPU, MMX and SSE environment prior to clearing the TS bit to avoid a redundant save. The reason the MP bit exists is to avoid the relatively large overhead of saving the FPU state in the event that it is entirely unnecessary because a task did not use the FPU at all. There is also the possibility that several related tasks might share the FPU. In such a case, the task manager can also clear the MP bit to avoid a redundant save.
The task-switching behaviour can be exploited as an anti-emulation trick. Specifically, a process can execute an FPU instruction, thus causing the NM exception to be raised and the FPU state to be saved. The task manager will clear the TS bit in response to this event, and potentially clear the MP bit too. After some time passes and other tasks are executed, the task manager will set the MP bit again if it was cleared, and the processor will set the TS bit again. This cycle will continue until eventually the process resumes execution. At that time, the two bits should be set. A process can detect this cycle.
Example code looks like this:
wait ;raise NM l1: smsw ax and al, 0ah cmp al, 0ah je l1 ;wait while TS and MP l2: smsw ax test al, 2 ;wait for MP je l2 test al, 8 ;check for TS je being_debugged
This technique is used by Waledac. However, it does not work on the 64-bit versions of Windows. Specifically, the loop at l2 never exits, because the MP bit is never set again for the process.
There are some common methods in shellcode for finding the value of the EIP register using instructions that contain no bytes with a value of zero. One of those methods uses an FPU instruction.
Example code looks like this:
l1: fldz fnstenv [esp-0c] pop eax l2: ...
When l2 is reached, the value in the EAX register will be the address of l1. Thus, given the following code, it seems reasonable to assume that the branch at l3 will never be taken:
l1: fldz fnstenv [esp-0c] pop eax l2: cmp eax, offset l1 l3: jne being_debugged
However, this is an invalid assumption. In VirtualPC, single-stepping over the fldz instruction results in a completely different value in the EAX register. The cause is unknown at the time of writing, but the value appears to be a constant (0x74b036). This means that the code could be altered in a very subtle way.
Example code looks like this:
org 74b035h l1: fldz fnstenv b [esp-0ch] pop eax dec b [eax+(offs l2-offs l1)-1] mov eax, offset l3+01000000h l2: mov ecx, offset being_debugged jmp eax l3: ;...
If the code executes freely, then execution continues from l3. However, single-stepping over the fldz instruction causes the ‘mov ecx’ instruction to become a ‘mov eax’ instruction, thus causing execution to resume from being_debugged.
That is a very subtle anti-debugging trick indeed.
The text of this paper was produced without reference to any Microsoft source code or personnel.