<style> h2 { counter-increment: h2; } h2:before { content: counter(h2) ". " } .markdown-body { font-family: -apple-system, BlinkMacSystemFont, "SFNS Display", "Roboto", "Helvetica Neue", Helvetica, Arial, sans-serif !important; } </style> # CS356: Compiling and Disassembling This note collects the commands presented in class to compile a C program to binary and then disassemble it. It also shows you how to compile C code to x86 assembly, and x86 assembly code to binary. For the adventurous, it even gives step-by-step instructions on how to edit a binary executable *directly*. ### A simple program Start the class VM, open your favorite editor and save this code to `program.c`: ```C #include <stdio.h> #include <stdbool.h> bool is_valid(int code) { return code == 42; } int main() { int c; printf("Enter the right code: "); scanf("%d", &c); if (is_valid(c)) { printf("Code is right.\n"); return 0; } else { printf("Code is wrong.\n"); return 1; } } ``` Note that `program.c` is just a text file. The program reads an integer from the user and checks whether the input equals `42`. You can compile this program with: ``` $ gcc -Wall -Wextra -Og -no-pie program.c -o program ``` This produces the binary file `program`. You can run it with: ``` $ ./program ``` A screenshot: ![](https://i.imgur.com/V4Z2xOF.png) ### Disassembling Let's say that you don't have the source code of this program, because a friend sent you just the binary file `program`, asking you to figure out the correct input value. Difficult? Not really. The binary file contains all of the CPU instructions of the program: among them, there must be some that check the validity of the input value. You can look at the instructions in the binary program by disassembling: ``` $ objdump -d program ``` The output is very long: ``` ... 0000000000401142 <is_valid>: 401142: 83 ff 2a cmpl $0x2a,%edi 401145: 0f 94 c0 sete %al 401148: c3 retq 0000000000401149 <main>: 401149: 48 83 ec 18 subq $0x18,%rsp 40114d: 48 8d 3d b0 0e 00 00 leaq 0xeb0(%rip),%rdi # 402004 <_IO_stdin_used+0x4> 401154: b8 00 00 00 00 movl $0x0,%eax 401159: e8 e2 fe ff ff callq 401040 <printf@plt> 40115e: 48 8d 74 24 0c leaq 0xc(%rsp),%rsi 401163: 48 8d 3d b1 0e 00 00 leaq 0xeb1(%rip),%rdi # 40201b <_IO_stdin_used+0x1b> 40116a: b8 00 00 00 00 movl $0x0,%eax 40116f: e8 dc fe ff ff callq 401050 <__isoc99_scanf@plt> 401174: 8b 7c 24 0c movl 0xc(%rsp),%edi 401178: e8 c5 ff ff ff callq 401142 <is_valid> 40117d: 84 c0 testb %al,%al 40117f: 74 16 je 401197 <main+0x4e> 401181: 48 8d 3d 96 0e 00 00 leaq 0xe96(%rip),%rdi # 40201e <_IO_stdin_used+0x1e> 401188: e8 a3 fe ff ff callq 401030 <puts@plt> 40118d: b8 00 00 00 00 movl $0x0,%eax 401192: 48 83 c4 18 addq $0x18,%rsp 401196: c3 retq 401197: 48 8d 3d 8f 0e 00 00 leaq 0xe8f(%rip),%rdi # 40202d <_IO_stdin_used+0x2d> 40119e: e8 8d fe ff ff callq 401030 <puts@plt> 4011a3: b8 01 00 00 00 movl $0x1,%eax 4011a8: eb e8 jmp 401192 <main+0x49> 4011aa: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) ... ``` **What the output means.** A lot of this probably doesn't make any sense to you. Don't worry! It will, by the end of the class. Note that each row starts with a memory address, like `401142`: ``` 0000000000401142 <is_valid>: 401142: 83 ff 2a cmpl $0x2a,%edi 401145: 0f 94 c0 sete %al 401148: c3 retq ``` This means that, when running this program, `0x401142` will be the memory address of the first instruction of the function `is_valid`, which is `cmpl $0x2a,%edi`. This instruction occupies 3 bytes in the binary program; its encoding is `83 ff 2a` (on the same line, right after the memory address); note that the instruction compares `%edi` with `0x2a`; `2a` is the last byte of the instruction encoding. If you look inside the `main` function, you can see that at some point there is a function call to `is_valid`: ``` 401174: 8b 7c 24 0c movl 0xc(%rsp),%edi 401178: e8 c5 ff ff ff callq 401142 <is_valid> 40117d: 84 c0 testb %al,%al 40117f: 74 16 je 401197 <main+0x4e> ``` The instruction `callq 401142` makes the CPU (1) save a return address on the stack, and (2) jump to the memory address `0x401142`, which is the beginning of `is_valid`. You may wonder where the encoding `e8 c5 ff ff ff` of `callq 401142` comes from, since it doesn't seem to include the address `0x401142`. Well, the first byte `0xe8` is the OPCODE of a *relative* jump; the rest (`c5 ff ff ff`) encodes the amount of this relative jump. We need to reverse the bytes of this 4-byte word because x86 processors are little-endian. That gives `0xffffffc5`, a negative number (because it starts with `1`; it's `-59` in decimal). If we add this negative number to the next instruction address (the address `0x40117d`, where the `test %al,%al` instruction is stored), we get `0x401142`, the address of `is_valid`: ``` 0x 40117d + 4198781 + 0xffffffc5 = or, in decimal: -59 = ------------ --------- 0x00401142 4198722 -> 0x401142 in hex ``` **Finding 42.** Given all this mess, can you find `42` within the function `is_valid`? Well, it's there but in hex: ``` 0000000000401142 <is_valid>: 401142: 83 ff 2a cmpl $0x2a,%edi 401145: 0f 94 c0 sete %al 401148: c3 retq ``` At `0x401142`, `cmpl $0x2a,%edi` compares the value in register `%edi` with the constant `0x2a`, which is $2 \times 16 + 10 = 42$ in decimal. Don't worry about the rest of the instructions (for now). ### A note on compilation options I used the following options: - `-Wall -Wextra` is to get all the possible warnings from GCC - `-Og` generates easier assembly - `-no-pie` generates a binary file with hardcoded memory addresses ### Compiling to assembly instead of binary To do this, use the `-S` switch. For example, ``` gcc -Wall -Wextra -Og -no-pie -S program.c ``` will produce `program.s`, a text file containing the following assembly code: ``` .file "program.c" .text .globl is_valid .type is_valid, @function is_valid: .LFB11: .cfi_startproc cmpl $42, %edi sete %al ret .cfi_endproc .LFE11: .size is_valid, .-is_valid .section .rodata.str1.1,"aMS",@progbits,1 .LC0: .string "Enter the right code: " .LC1: .string "%d" .LC2: .string "Code is right." .LC3: .string "Code is wrong." .text .globl main .type main, @function main: .LFB12: .cfi_startproc subq $24, %rsp .cfi_def_cfa_offset 32 leaq .LC0(%rip), %rdi movl $0, %eax call printf@PLT leaq 12(%rsp), %rsi leaq .LC1(%rip), %rdi movl $0, %eax call __isoc99_scanf@PLT movl 12(%rsp), %edi call is_valid testb %al, %al je .L3 leaq .LC2(%rip), %rdi call puts@PLT movl $0, %eax .L2: addq $24, %rsp .cfi_remember_state .cfi_def_cfa_offset 8 ret .L3: .cfi_restore_state leaq .LC3(%rip), %rdi call puts@PLT movl $1, %eax jmp .L2 .cfi_endproc .LFE12: .size main, .-main .ident "GCC: (Debian 8.3.0-6) 8.3.0" .section .note.GNU-stack,"",@progbits ``` Can you find `42` again? It's inside `is_valid`, in the instruction `cmpl $42, %edi` (this time, encoded in base 10 instead of base 16). ### Modifying the assembly code and compiling to binary Let's say that we want your program to check for the input `23` instead of `42`. We can modify the assembly code by replacing the line ``` cmpl $42, %edi ``` inside the section `is_valid` of `program.s` with ``` cmpl $23, %edi ``` Then, we compile the assembly code to binary with: ``` gcc -no-pie program.s -o program ``` And run it with `./program`, as usual. Now, it expects the input `23`: ![](https://i.imgur.com/uOpD7T8.png) ### Modifying the binary code directly Let's say that we'd like to change the expected value again, from `23` to `32`. This time, we want to modify the binary executable `program` directly, instead of recompiling from the source files `program.c` or `program.s` like before. First, we need to know what to modify. We disassemble `program` again with `objdump -d program` and we notice, inside `is_valid`, the instruction: ``` 401142: 83 ff 17 cmpl $0x17,%edi ``` Notice that `cmpl $0x17,%edi` is using the hex constant `0x17`, which is `23` in decimal. We turn our attention to the binary encoding of this instruction, which is `83 ff 17`. We take an educated guess and figure out that the `17` at the end must be the hex encoding of the constant used in the comparison. We'd like to change it to `0x20` (`32` in decimal). That's easy. We open the **binary** file with emacs: ``` emacs program ``` It will look like garbage. So, we start the "hexl mode" with: `ALT + x`, type `hexl-mode`, press `ENTER`. ![](https://i.imgur.com/8rJLRL7.png) ![](https://i.imgur.com/Gkt0qpg.png) Now we can search for the binary encoding of the instruction we want to modify, by pressing `CONTROL + s` and typing `83ff`: ![](https://i.imgur.com/xP4DnKP.png) Then we press the left/right arrows to move over the byte `0x17`. To change it, we press `ALT + x`, type `hexl-insert-hex-char`, press `ENTER`, type `20` and press `ENTER`. Now the byte looks changed to `0x20`: ![](https://i.imgur.com/TpZ4K8w.png) Then we save and exit with `CONTROL + x`, `CONTROL + c`, type `y` twice. If we start the binary again with `./program`, it will expect `32` instead of `23` or `42`. Success! ![](https://i.imgur.com/kJArvel.png)