In addition to the main memory, processors also feature a limited number of local storage called registers. These registers are extremely fast: reading from memory is several times much slower than reading from registers. Programmers in high level languages are often unaware(or unconcerned) about existence of registers but when you come to assembly(and hence, reversing), registers become extremely visible and important. Many of the processor's instructions involve manipulating registers and sometimes, intermediate results are temporarily stored in registers.
For the eax, ebx, ecx, and edx registers, subsections may be used. for example, the least significant 2 bytes of eax can be treated as a 16-bit register called ax. the least significant byte of ax can be used as a single 8-bit register called al, while the most significant byte of ax can be used as a single 8-bit register called ah. these names refer to the same physical register. when a two-byte quantity is placed into dx, the update affects the value of dh, dl, and edx. these sub-registers are mainly hold-overs from older, 16-bit versions of the instruction set. However, they are sometimes convenient when dealing with data that are smaller than 32-bits
Registers are also extremely expensive to manufacture and hence, there are only a limited number of registers available in every process. The number of registers is fixed during manufacturing and each register is assigned a name as per the x86 assembly standard. Below is a list of all the general purpose registers(GPRs) available in x86 processors:
Register name | Register name |
---|---|
EAX | ESI |
EBX | EDI |
ECX | ESP |
EDX | EBP |
EIP | EFLAGS |
The Register EAX,ECX,EDX are considered volatile, i.e when a function is called these values are not preserved
The return value of a function is stored in EAX register
Instructions are operations performed by the CPU.
Each assembly language instruction is split into an opcode and an operand. The opcode is the instruction that is executed by the CPU and the operand is the data or memory location used to execute that instruction.
An x86 instruction can have zero to three operands. Operands are separated by commas (,). For instructions with two operands, the first (lefthand) operand is the source operand, and the second (righthand) operand is the destination operand (that is, destination<-source).
Operands can be immediate (i.e, constant expressions that evaluate to a value), register (a value in the processor number registers), or memory (a value stored in memory)
The terms used to describe sizes in the x86 architecture are:
byte
: 8 bits
word
: 2 bytes
dword
: 4 bytes (stands for "double word")
Let's take the below notation to understand the opcodes
opcode operand_1 operand_2
(instruction with 2 operands)
operand_1 - destination operand
operand_2 - source operand
opcode operand_1
(instruction with a single operand)
operand_1 - destination operand
These are the most common instructions in x86
mov
mov operand_1 operand_2
movsx
movsx operand_1 operand_2
movzx
movzx operand_1 operand_2
lea (load effective address)
lea operand_1 operand_2
pop
pop operand_1
push
push operand_1
sub
sub operand_1 operand_2
add
add operand_1 operand_2
div
div operand_1
mul
mul operand_1
dec
dec operand_1
inc
inc operand_1
and
and operand_1 operand_2
or
or operand_1 operand_2
not
not operand_1
xor
xor operand_1 operand_2
SHL
SHL operand_1,count
SHR
SHR operand_1,count
Note for SHL/SHR:
These instructions shift the bits in their first operand's contents left and right, padding the resulting empty bit positions with zeros. The shifted operand can be shifted up to 31 places. The number of bits to shift is specified by the second operand, which can be either an 8-bit constant or the register CL. In either case, shifts counts of greater then 31 are performed modulo 32.
cmp
cmp operand_1 operand_2
jmp
jmp operand_1
test
test operand_1 operand_2
nop
nop
When we were loading a 32-bit register, the assembler could infer that the region of memory we were referring to was 4 bytes wide. When we were storing the value of a one byte register to memory, the assembler could infer that we wanted the address to refer to a single byte in memory.
However, in some cases the size of a referred-to memory region is ambiguous. Consider the instruction mov [ebx], 2
.
Should this instruction move the value 2 into the single byte at address EBX? Perhaps it should move the 32-bit integer representation of 2 into the 4-bytes starting at address EBX. Since either is a valid possible interpretation, the assembler must be explicitly directed as to which is correct. The size directives BYTE
, WORD
, and DWORD
serve this purpose, indicating sizes of 1, 2, and 4 bytes respectively.
mov BYTE [ebx], 2 ; Move 2 into the single byte at the address stored in EBX.
mov WORD [ebx], 2 ; Move the 16-bit integer representation of 2 into the 2 bytes starting at the address in EBX.
mov DWORD [ebx], 2 ; Move the 32-bit integer representation of 2 into the 4 bytes starting at the address in EBX.
Let's understand the difference between the below instructions
mov eax, [ebp - 4]
In the above equation, The value of ebp is subtracted by 4 and the brackets indicate that the resulting value is taken as an address and the value residing at that address is stored in eax.
lea eax, [ebp - 4]
Here,The value of ebp is subtracted by 4 and the resulting value is stored in eax. This instruction would just calculate the address and store the calculated address in the destination register.
Endianness is the order in which bytes are stored in memory
There are two ways to doing it; big endian and little endian.
BIG ENDIAN
Storing data starting from the most significant byte
address = 0x1000 value = 0x10203040
0x10000 : 0x10
0x10001 : 0x20
0x10002 : 0x30
0x10003 : 0x40
LITTLE ENDIAN
Storing data starting from the least significant byte
address = 0x1000 value = 0x10203040
0x10000 : 0x40
0x10001 : 0x30
0x10002 : 0x30
0x10003 : 0x10
Endianness of a computer is specific to the architecture.
An intel x86 architecture follows the little endian
BITS 32 ;1
extern printf ;2
section .rodata ;3
hello_world: db "Hello, world!", 10, 0 ;4
section .text ;5
global main ;6
main: ;7
push ebp ;8
mov ebp, esp ;9
push hello_world ;10
call printf ;11
add esp, 4 ;12
mov eax, 0 ;13
mov esp, ebp ;14
pop ebp ;15
ret ;16
Any characters after ';' is considered a comment and never processed: they are simply discarded.
In line 1, we declare that this is a 32 bit assembly program(remember that 64 bit version also exists? 32 bit programs are not compatible with 64 bit and thus, it is good to declare upfront which version is the program compatible with).
In line 2, we declare the function printf for using in the program. If you are familiar with extern keyword in C, it is more or less the same. If you do not know, you can think of extern as similar to the #include directive in C. In C programming, before we could use any library function, we had to include certain header files. For instance, before using printf and scanf, we had to include the header file stdio.h and similarly for other library functions. By including the headers, we tell the compiler to check in the header files for the definition of many functions. Similarly, by using extern here, we tell the compiler that the function printf is not written by us and it is present elsewhere. If you wish to use any library function, you can declare it in the code similar to printf and invoke the function just like in C.
In line 3, we inform the compiler that the next one or more lines declare contents of the rodata section of the binary. The rodata section consists of global variables and constants that cannot be modified during execution(rodata stands for "Read Only Data"). Typically, most strings hardcoded into the program are placed in the rodata section. All this was done by the compiler during compilation of C programs so you did not have to worry about how this is performed.
Here, the rodata section in the example program has only 1 member named "hello_world". Also, by looking at the declaration, we can infer that it is a string since it is enclosed in double quotes. There are 3 directives to declare values in nasm:
The same declaration is used for declaring array of values of the above types as well: there is no distinction. This is apparent from line 4: we are declaring a string, which is an array of characters or bytes. Thus, hello_world is actually a pointer to an array of characters, which together form the string "Hello, world". Also, you can specify which characters to include by using their ASCII value instead of typing the characters out, as shown in the declaration of hello_world. Notice at the end of the string, there are two characters 10 and 0. These are ASCII values for the new line('\n' in C) and null character('\0' in C). Thus, line 4 is equivalent to the following C code:
char *hello_world = "Hello, world!\n";
Similar to line 3, line 5 is declaring that text section starts here. The text section contains the program code that you wrote i.e. all the functions that you wrote. In the following line 5, we see that the symbol main is being declared as global. This is necessary so that gcc can find the function main when compiling the object file to generate the executable. If this is missing, gcc will refuse to generate the executable file.
In line 6, we can see that the function main is declared. This is how function are defined in nasm assembly code: the name of the function followed by ':'. This main function is same as the C function main, which is the first function that gets executed. If you carefully analyse the lines that follow(lines 8 to 16), we see that the entire assembly code is divided into 3 groups of instructions. The first group is called the function prologue, the set of instructions executed at the start of the function and the second group contains the code(assembly instructions) and the final group is function epilogue, the set of instructions executed at the end of the function just before it returns to the caller. In order to understand them, we need to learn about the program stack and how it's used during execution.
The compilation is a two step process. In the first step, we convert the assembly source to an object file, a file that gcc understands, using nasm.
say file_name is hello_world.asm
nasm -f elf hello_world.asm
In the above command, we ask nasm to generate an ELF object file. ELF is a standard file format used in Linux based OS. The above command will generate a file called "hello-world.o". Now, we use gcc to convert the "hello_world.o" to an executable file as follows:
gcc -m32 hello_world.o -o hello_world.out
To run the executable we do
./hello_world.out
Calling Conventions in x86
syscall number | return value | arg1 | arg2 | arg3 | arg4 | arg5 | arg6 |
---|---|---|---|---|---|---|---|
eax | eax | ebx | ecx | edx | esi | edi | ebp |
syscall info to read and write a string
syscall name | syscall number | arg1 | arg2 | arg3 | arg4 | arg5 | arg6 |
---|---|---|---|---|---|---|---|
read | 3 | unsigned int fd | char *buf | size_t count | - | - | - |
write | 4 | unsigned int fd | const char *buf | size_t count | - | - | - |
here fd is file descriptor, *buf is the string , count is length of the string
Every UNIX program has three file streams opened for it when it starts up under normal conditions. One is for input, other is for output and another one is opened for printing error messages.
0
is assigned to stdin.1
is assigned to stdout.2
is assigned to stderr.BITS 32
section .data
msg db 'Hello, world!',10,0
len equ $ - msg ; gives length of the string msg
section .text
global main
main:
mov edx,len
mov ecx,msg
mov ebx,1
mov eax,4
int 0x80
mov eax,1
int 0x80
In this example we'll be using the functions in Standard C library
Before the arguments to a function are pushed on to the stack in reverse order
BITS 32
extern printf
section .data
hello_world: db "Hello, world!", 10, 0
section .text
global main
main:
push ebp
mov ebp, esp
push hello_world
call printf
add esp, 4
mov eax, 0
mov esp, ebp
pop ebp
ret
Non-stripped binaries have debugging information built into it, Whereas Strip binaries generally remove this debugging information from the binary which is not necessary for execution so as to reduce the size of the binary
These are the basic gdb commands
disas main (disassembles main)
set disassembly-flavor intel
r (runs the program)
b main (sets a breakpoint at first instruction of main)
ni (executes the next instruction)
c (continue execution until next breakpoint)
i r (info registers)
p $register (prints out the value in the register)
x/NFU (examine memory)
N - number
F - format
U - Unit
to get more info about a command type help <command>
inside gdb
Let's understand these commands with examples
After downloading the above file go to the directory where the file is present and type chmod +x crackme
in the terminal to make the file executable.
and to run a file we type ./<file_name>
, in this case ./crackme
First, load the file into gdb by typing gdb filename
in your terminal here it is gdb crackme
. For this, you have to be in the directory where the file is located. Here my filename is crackme. Once the file is loaded, to see all the functions defined in the file type info functions
We have the main function defined. Let's see the disassembly of the main function by typing disassemble main
. We get something like this
Now the disassembly of the main function is in AT&T format, we'll change that into Intel format by typing set disassembly-flavor intel
Now let's see the disassembly of the main function by typing disassemble main
. It will be like this.
On observing the disassembly of the main function, we can say that there are 4 compare statements. Let's see what is actually happening at those compare staetments by putting break points at those 4 locations. To put the break point type b*address
.The corresponding addresses are to the left side of the instruction(something like 0xaaaaaaaa).
Now let's run the program by giving some random input (say 'hello') . Type run hello
to run the program with 'hello' as the input.
Now the program is halted at the 1st breakpoint. Now type disas main
.
The small arrow indicates the next instruction to be executed. The comparison is between al and 0x62. Let's see what is in al. To see what is present in al type print $al.
It gives us 104 (0x68 hex ).It is nothing but the ASCII value of the 1st byte of our input (hello) which is 'h'. It is being compared with 0x62 (98) ASCII value of letter 'b'.Let's manually see what happens next.Type ni
to proceed to the next instruction. As our input doesn't match with 0x62, jne 0x8048464
this jump instruction is executed and the program control jumps to the instruction at 0x8048464.(Type x/x $eip
whenever you want to see what instruction is going to be executed next or where the program control is right now. or just type disas main
)
0x08048464 <+96> : mov DWORD PTR [esp],0x8048558
0x0804846b <+103>: call 0x8048310 <puts@plt>
0x08048470 <+108>: mov DWORD PTR [esp],0xffffffff
0x08048477 <+115>: call 0x8048330 <exit@plt>
These 4 instructions will be executed, "You Lose" is printed on to the screen and program exits normally.
Now we have concluded that our 1st letter must be 'b'. Similarly, there are 3 more compare statements.
Let's run the program again by giving 'b' as the 1st letter in our input(say 'bell'). Type r bell
.
Now as the 1st letter of our input is 'b', jne 0x8048464 this instruction is not executed and it successfully goes to the next breakpoint.
Type c continue and then see where our program control is.
Now our instruction pointer is at the 2nd breakpoint. Here lets see what is in al. Type print $al
.
We get 101 (0x65), that is the ASCII value of 'e' which is the 2nd letter of our input. It is being compared with 0x6c (108), ASCII value of 'l'. So our 2nd letter has got to be 'l'.
Similarly, there are 2 more compare statements which take the 3rd byte and the 4th byte of our input and check them with 0x61(97 ASCII value for 'a') and 0x68(104 ASCII value for 'h') respectively.
So our input must be "blah".
Now get out of gdb by pressing ctrl+z. Then run the file by giving 'blah' as the input. Type ./crackme blahto run the file with 'blah' as input.
The challenge is complete. "You win"
Since it is a stripped binary, this binary doesn't contain user-defined function names in it
For these type of binaries, we proceed as follows
we type the command info files
inside gdb, and then we search for the .text
section, then we disassemble the whole .text
section
Inside the whole .text
section we try finding out the main function, since main function usually prompts or takes input using one of puts/printf/scanf/fgets
these functions
If the binary only takes command line arguments like the above crackme, we search for the puts/printf
statements at the end of the function which are responsible to print the success/failure prompt such as you win
/you lose
Once we find the main function we proceed same as we do for a non-stripped binary