<style>
h2 { counter-increment: h2; }
h2:before { content: counter(h2) ". " }
.markdown-body { font-family: -apple-system, BlinkMacSystemFont, "SFNS Display", "Roboto", "Helvetica Neue", Helvetica, Arial, sans-serif !important; }
</style>
# CS356: Compiling and Disassembling
This note collects the commands presented in class to compile a C program to binary and then disassemble it. It also shows you how to compile C code to x86 assembly, and x86 assembly code to binary. For the adventurous, it even gives step-by-step instructions on how to edit a binary executable *directly*.
### A simple program
Start the class VM, open your favorite editor and save this code to `program.c`:
```C
#include <stdio.h>
#include <stdbool.h>
bool is_valid(int code) {
return code == 42;
}
int main() {
int c;
printf("Enter the right code: ");
scanf("%d", &c);
if (is_valid(c)) {
printf("Code is right.\n");
return 0;
} else {
printf("Code is wrong.\n");
return 1;
}
}
```
Note that `program.c` is just a text file. The program reads an integer from the user and checks whether the input equals `42`.
You can compile this program with:
```
$ gcc -Wall -Wextra -Og -no-pie program.c -o program
```
This produces the binary file `program`. You can run it with:
```
$ ./program
```
A screenshot:
![](https://i.imgur.com/V4Z2xOF.png)
### Disassembling
Let's say that you don't have the source code of this program, because a friend sent you just the binary file `program`, asking you to figure out the correct input value.
Difficult? Not really. The binary file contains all of the CPU instructions of the program: among them, there must be some that check the validity of the input value.
You can look at the instructions in the binary program by disassembling:
```
$ objdump -d program
```
The output is very long:
```
...
0000000000401142 <is_valid>:
401142: 83 ff 2a cmpl $0x2a,%edi
401145: 0f 94 c0 sete %al
401148: c3 retq
0000000000401149 <main>:
401149: 48 83 ec 18 subq $0x18,%rsp
40114d: 48 8d 3d b0 0e 00 00 leaq 0xeb0(%rip),%rdi # 402004 <_IO_stdin_used+0x4>
401154: b8 00 00 00 00 movl $0x0,%eax
401159: e8 e2 fe ff ff callq 401040 <printf@plt>
40115e: 48 8d 74 24 0c leaq 0xc(%rsp),%rsi
401163: 48 8d 3d b1 0e 00 00 leaq 0xeb1(%rip),%rdi # 40201b <_IO_stdin_used+0x1b>
40116a: b8 00 00 00 00 movl $0x0,%eax
40116f: e8 dc fe ff ff callq 401050 <__isoc99_scanf@plt>
401174: 8b 7c 24 0c movl 0xc(%rsp),%edi
401178: e8 c5 ff ff ff callq 401142 <is_valid>
40117d: 84 c0 testb %al,%al
40117f: 74 16 je 401197 <main+0x4e>
401181: 48 8d 3d 96 0e 00 00 leaq 0xe96(%rip),%rdi # 40201e <_IO_stdin_used+0x1e>
401188: e8 a3 fe ff ff callq 401030 <puts@plt>
40118d: b8 00 00 00 00 movl $0x0,%eax
401192: 48 83 c4 18 addq $0x18,%rsp
401196: c3 retq
401197: 48 8d 3d 8f 0e 00 00 leaq 0xe8f(%rip),%rdi # 40202d <_IO_stdin_used+0x2d>
40119e: e8 8d fe ff ff callq 401030 <puts@plt>
4011a3: b8 01 00 00 00 movl $0x1,%eax
4011a8: eb e8 jmp 401192 <main+0x49>
4011aa: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
...
```
**What the output means.** A lot of this probably doesn't make any sense to you. Don't worry! It will, by the end of the class. Note that each row starts with a memory address, like `401142`:
```
0000000000401142 <is_valid>:
401142: 83 ff 2a cmpl $0x2a,%edi
401145: 0f 94 c0 sete %al
401148: c3 retq
```
This means that, when running this program, `0x401142` will be the memory address of the first instruction of the function `is_valid`, which is `cmpl $0x2a,%edi`. This instruction occupies 3 bytes in the binary program; its encoding is `83 ff 2a` (on the same line, right after the memory address); note that the instruction compares `%edi` with `0x2a`; `2a` is the last byte of the instruction encoding.
If you look inside the `main` function, you can see that at some point there is a function call to `is_valid`:
```
401174: 8b 7c 24 0c movl 0xc(%rsp),%edi
401178: e8 c5 ff ff ff callq 401142 <is_valid>
40117d: 84 c0 testb %al,%al
40117f: 74 16 je 401197 <main+0x4e>
```
The instruction `callq 401142` makes the CPU (1) save a return address on the stack, and (2) jump to the memory address `0x401142`, which is the beginning of `is_valid`.
You may wonder where the encoding `e8 c5 ff ff ff` of `callq 401142` comes from, since it doesn't seem to include the address `0x401142`. Well, the first byte `0xe8` is the OPCODE of a *relative* jump; the rest (`c5 ff ff ff`) encodes the amount of this relative jump. We need to reverse the bytes of this 4-byte word because x86 processors are little-endian. That gives `0xffffffc5`, a negative number (because it starts with `1`; it's `-59` in decimal). If we add this negative number to the next instruction address (the address `0x40117d`, where the `test %al,%al` instruction is stored), we get `0x401142`, the address of `is_valid`:
```
0x 40117d + 4198781 +
0xffffffc5 = or, in decimal: -59 =
------------ ---------
0x00401142 4198722 -> 0x401142 in hex
```
**Finding 42.** Given all this mess, can you find `42` within the function `is_valid`?
Well, it's there but in hex:
```
0000000000401142 <is_valid>:
401142: 83 ff 2a cmpl $0x2a,%edi
401145: 0f 94 c0 sete %al
401148: c3 retq
```
At `0x401142`, `cmpl $0x2a,%edi` compares the value in register `%edi` with the constant `0x2a`, which is $2 \times 16 + 10 = 42$ in decimal.
Don't worry about the rest of the instructions (for now).
### A note on compilation options
I used the following options:
- `-Wall -Wextra` is to get all the possible warnings from GCC
- `-Og` generates easier assembly
- `-no-pie` generates a binary file with hardcoded memory addresses
### Compiling to assembly instead of binary
To do this, use the `-S` switch. For example,
```
gcc -Wall -Wextra -Og -no-pie -S program.c
```
will produce `program.s`, a text file containing the following assembly code:
```
.file "program.c"
.text
.globl is_valid
.type is_valid, @function
is_valid:
.LFB11:
.cfi_startproc
cmpl $42, %edi
sete %al
ret
.cfi_endproc
.LFE11:
.size is_valid, .-is_valid
.section .rodata.str1.1,"aMS",@progbits,1
.LC0:
.string "Enter the right code: "
.LC1:
.string "%d"
.LC2:
.string "Code is right."
.LC3:
.string "Code is wrong."
.text
.globl main
.type main, @function
main:
.LFB12:
.cfi_startproc
subq $24, %rsp
.cfi_def_cfa_offset 32
leaq .LC0(%rip), %rdi
movl $0, %eax
call printf@PLT
leaq 12(%rsp), %rsi
leaq .LC1(%rip), %rdi
movl $0, %eax
call __isoc99_scanf@PLT
movl 12(%rsp), %edi
call is_valid
testb %al, %al
je .L3
leaq .LC2(%rip), %rdi
call puts@PLT
movl $0, %eax
.L2:
addq $24, %rsp
.cfi_remember_state
.cfi_def_cfa_offset 8
ret
.L3:
.cfi_restore_state
leaq .LC3(%rip), %rdi
call puts@PLT
movl $1, %eax
jmp .L2
.cfi_endproc
.LFE12:
.size main, .-main
.ident "GCC: (Debian 8.3.0-6) 8.3.0"
.section .note.GNU-stack,"",@progbits
```
Can you find `42` again? It's inside `is_valid`, in the instruction `cmpl $42, %edi` (this time, encoded in base 10 instead of base 16).
### Modifying the assembly code and compiling to binary
Let's say that we want your program to check for the input `23` instead of `42`. We can modify the assembly code by replacing the line
```
cmpl $42, %edi
```
inside the section `is_valid` of `program.s` with
```
cmpl $23, %edi
```
Then, we compile the assembly code to binary with:
```
gcc -no-pie program.s -o program
```
And run it with `./program`, as usual. Now, it expects the input `23`:
![](https://i.imgur.com/uOpD7T8.png)
### Modifying the binary code directly
Let's say that we'd like to change the expected value again, from `23` to `32`. This time, we want to modify the binary executable `program` directly, instead of recompiling from the source files `program.c` or `program.s` like before.
First, we need to know what to modify. We disassemble `program` again with `objdump -d program` and we notice, inside `is_valid`, the instruction:
```
401142: 83 ff 17 cmpl $0x17,%edi
```
Notice that `cmpl $0x17,%edi` is using the hex constant `0x17`, which is `23` in decimal. We turn our attention to the binary encoding of this instruction, which is `83 ff 17`.
We take an educated guess and figure out that the `17` at the end must be the hex encoding of the constant used in the comparison. We'd like to change it to `0x20` (`32` in decimal).
That's easy. We open the **binary** file with emacs:
```
emacs program
```
It will look like garbage. So, we start the "hexl mode" with: `ALT + x`, type `hexl-mode`, press `ENTER`.
![](https://i.imgur.com/8rJLRL7.png)
![](https://i.imgur.com/Gkt0qpg.png)
Now we can search for the binary encoding of the instruction we want to modify, by pressing `CONTROL + s` and typing `83ff`:
![](https://i.imgur.com/xP4DnKP.png)
Then we press the left/right arrows to move over the byte `0x17`. To change it, we press `ALT + x`, type `hexl-insert-hex-char`, press `ENTER`, type `20` and press `ENTER`. Now the byte looks changed to `0x20`:
![](https://i.imgur.com/TpZ4K8w.png)
Then we save and exit with `CONTROL + x`, `CONTROL + c`, type `y` twice. If we start the binary again with `./program`, it will expect `32` instead of `23` or `42`. Success!
![](https://i.imgur.com/kJArvel.png)