# Booting 4-6
Group2
---
### Transition to 64-bit mode
#### 張友維、楊竣喆、陳建嘉
----
### Overview of The Transition to 64-bit mode
- Transitioning CPU to protected mode
- Check CPU support for long mode and SSE
- Initialize page tables with paging
----
### The 32-bit entry point
----
#### 1. Source Files
- Within 'arch/x86/boot/compressed', there exist two files: `head_32.S` and `head_64.S`.
- `head_64.S` is relevant for x86_64 architecture.
----
#### 2. Makefile Configuration
- In the Makefile, the target `vmlinux-objs-y` selects the appropriate file (head_32.o or head_64.o) based on `$(BITS)`.
- `$(BITS)` is determined in `arch/x86/Makefile` based on kernel configuration (`CONFIG_X86_32`).
```makefile
vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
$(obj)/string.o $(obj)/cmdline.o \
$(obj)/piggy.o $(obj)/cpuflags.o
```
----
### Reload the segments if needed
----
#### 1. Segment Setup
- Segmentation organizes code and data into separate segments in x86 assembly.
- Special section attributes define code segments, such as `.head.text` and `.code32`.
- The `KEEP_SEGMENTS` flag determines whether segment registers need reloading
----
#### 2. Flag Checking
- The `KEEP_SEGMENTS` flag in the kernel setup header dictates segment register behavior.
- Unset flag requires resetting segment registers for consistent setup.
- Ensures uniform behavior regardless of bootloader variations.
----
#### 3. Calculating Kernel Load Address
- Kernel code is typically compiled to run at address 0.
- Computing the actual load address involves a pattern:
- Calling a label and popping the return address.
- Difference between return address and label gives the kernel's real physical address.
----
### Stack setup and CPU Verification
----
#### Get real address of boot_stack_end
[arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/head_64.S)
```x86asm=
movl $boot_stack_end, %eax
addl %ebp, %eax
movl %eax, %esp
```
- `ebp`: real address of `startup_32` label
- `eax`: address of `boot_stack_end` as it was linked at 0x0
- sum of `ebp` and `eax` → Real address of `boot_stack_end`
note: We need real address of startup_32 label and relative address of boot_stack_end to get the address where stack pointer should point to
----
[arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/head_64.S)
```x86asm=
.bss
.balign 4
boot_heap:
.fill BOOT_HEAP_SIZE, 1, 0
boot_stack:
.fill BOOT_STACK_SIZE, 1, 0
boot_stack_end:
```
note: boot_stack_end is in bss section
----
#### Verify CPU supporting long mode and SSE
```x86asm=
call verify_cpu
testl %eax, %eax
jnz no_longmode
```
- `verify_cpu`: return 0 if support long mode
- `no_longmode`: halt
note: After we have setup stack pointer, SSE stands for streaming SIMD extenstion, verify_cpu function checks whether cpu supports long mode and sse
----
[arch/x86/kernel/verify_cpu.S](https://github.com/torvalds/linux/blob/v4.16/arch/x86/kernel/verify_cpu.S)
- Check if CPU support long mode
```x86asm=
movl $0x1,%eax # Does the cpu have what it takes
cpuid
andl $REQUIRED_MASK0,%edx
xorl $REQUIRED_MASK0,%edx
jnz .Lverify_cpu_no_longmode
movl $0x80000000,%eax # See if extended cpuid is implemented
cpuid
cmpl $0x80000001,%eax
jb .Lverify_cpu_no_longmode # no extended cpuid
movl $0x80000001,%eax # Does the cpu have what it takes
cpuid
andl $REQUIRED_MASK1,%edx
xorl $REQUIRED_MASK1,%edx
jnz .Lverify_cpu_no_longmode
```
note: cpu_check
----
[arch/x86/kernel/verify_cpu.S](https://github.com/torvalds/linux/blob/v4.16/arch/x86/kernel/verify_cpu.S)
- Check if CPU support SSE
```x86asm=
movl $1,%eax
cpuid
andl $SSE_MASK,%edx
cmpl $SSE_MASK,%edx
je .Lverify_cpu_sse_ok
test %di,%di
jz .Lverify_cpu_no_longmode # only try to force SSE on AMD
movl $MSR_K7_HWCR,%ecx
rdmsr
btr $15,%eax # enable SSE
wrmsr
xor %di,%di # don't loop
jmp .Lverify_cpu_sse_test # try again
```
note: sse_test
----
### Calculate the relocation address
note: make sure everything is ok, verify_cpu would return 0, we now have to calculate relcation address
----
#### Get to know default kernel base address
- `CONFIG_PHYSICAL_START`: Default base address of Linux kernel
- Value of `CONFIG_PHYSICAL_START` is `0x1000000` (1MB)
----
#### Why kernel needs to be relocatable?
- If the Linux kernel crashes, a kernel developer must have a `rescue kernel` for `kdump` which is configured to load from a different address
----
#### Set the kernel to be relocatable
→ `CONFIG_RELOCATABLE`=y
```text
This builds a kernel image that retains relocation
information so it can be loaded someplace besides the
default 1MB.
Note: If CONFIG_RELOCATABLE=y, then the kernel runs from the
address it has been loaded at and the compile time physical
address (CONFIG_PHYSICAL_START) is used as the minimum location.
```
----
### Reload Segments if needed
----
Special section attribute is set in [arch/x86/boot/ compressed/head_64.S](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/head_64.S) before `startup_32`
```x86asm=
__HEAD
.code32
ENTRY(startup_32)
```
note: we now dive into the assembly code, we may found special section attribute is set in (file) before startup_32 entry point
----
```clike
#define __HEAD .section ".head.text","ax"
```
- `__HEAD` is a macro defined in the [include/linux/ init.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/init.h)
- `.head.text` indicates that following code contains executable instructions
- flag `a` : this section is allocatable
- flag `x` : this section can be executed by CPU
note: from the flags and macro seen in the assembly code, this kernel section can be booted from different address.
----
#### How is Linux Kernel be able to be booted from different address
→ Compile the decompressor as `position independent code` (PIC)
note: position independent code means not tied to a specific address, usually use offset instead of absolute addresses
----
[arch/x86/boot/compressed/Makefile](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/Makefile)
```x86asm=
KBUILD_CFLAGS += -fno-strict-aliasing -fPIC
```
----
Address is obtained by adding the address field of the instruction to the value of the `program counter`
----
#### Calculate an address where we can relocate the kernel for decompression
→ depending on `CONFIG_RELOCATABLE`
```x86asm
#ifdef CONFIG_RELOCATABLE
movl %ebp, %ebx
movl BP_kernel_alignment(%esi), %eax
decl %eax
addl %eax, %ebx
notl %eax
andl %eax, %ebx
cmpl $LOAD_PHYSICAL_ADDR, %ebx
jge 1f
#endif
movl $LOAD_PHYSICAL_ADDR, %ebx
```
note: to find the address, it has to be aligned
----
#### CONFIG_RELCATABLE
- align it to a multiple of 2MB
- compare it with the result of the `LOAD_PHYSICAL_ADDR` macro
```clike=
#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \
+ (CONFIG_PHYSICAL_ALIGN - 1)) \
& ~(CONFIG_PHYSICAL_ALIGN - 1))
```
[arch/x86/include/asm/boot.h](https://github.com/torvalds/linux/blob/v4.16/arch/x86/include/asm/boot.h)
note: LOAD_PHYSICAL_ADDR is just expanded to the aligned CONFIG_PHYSICAL_ALIGN value which represents the physical address where the kernel will be loaded
----
#### Move compressed kernel image to the end of the decompression buffer
```x86asm=
1:
movl BP_init_size(%esi), %eax
subl $_end, %eax
addl %eax, %ebx
```
- `BP_init_size` : the larger of the compressed and uncompressed `vmlinux` sizes
----
### Preparation before entering long mode
----
#### Update the Global Descriptor Table with 64-bit segments
```x86asm=
addl %ebp, gdt+2(%ebp)
lgdt gdt(%ebp)
```
- adjust the base address of the Global Descriptor table to the address where we actually loaded the kernel
- load the Global Descriptor Table with the lgdt instruction
----
#### GDT Definition
```x86asm=
.data
...
gdt:
.word gdt_end - gdt
.long gdt
.word 0
.quad 0x00cf9a000000ffff /* __KERNEL32_CS */
.quad 0x00af9a000000ffff /* __KERNEL_CS */
.quad 0x00cf92000000ffff /* __KERNEL_DS */
.quad 0x0080890000000000 /* TS descriptor */
.quad 0x0000000000000000 /* TS continued */
gdt_end:
```
----
#### Enable Pysical Address Extension
```x86asm=
movl %cr4, %eax
orl $X86_CR4_PAE, %eax
movl %eax, %cr4
```
→ put value of the `cr4` register into `eax`, set the 5th bit and load it back into `cr4`
----
### Long Mode
----
#### 64-bit mode provides the following features:
- 8 new general purpose registers from r8 to r15
- All general purpose registers are 64-bit now
- A 64-bit instruction pointer - RIP
- 64-Bit Addresses and Operands
- RIP Relative Addressing
----
#### Long mode is an extension of the legacy protected mode. It consists of two sub-modes:
- 64-bit mode
- allows the processor to operate in a fully 64-bit environment
- compatibility mode.
- ensures backward compatibility with legacy 32-bit software
----
#### To switch into 64-bit mode we need to do the following things:
- Enable PAE
- Build page tables and load the address of the top level page table into the cr3 register
- Enable EFER.LME
- Enable paging
----
### Early Page Table Initialization
----
#### The Linux kernel uses ```4-level``` paging, and we generally build 6 page tables:
- One ```PML4``` (Page Map Level 4 table) with one entry
- One ```PDP``` (Page Directory Pointer table) with four entries
- Four Page Directory tables with a total of 2048 entries
----
#### Clear the buffer for the page tables in memory
```x86asm=
leal pgtable(%ebx), %edi
xorl %eax, %eax
movl $(BOOT_INIT_PGT_SIZE/4), %ecx
rep stosl
```
----
#### [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/head_64.S)
- ```pgtable``` is defined here
```
.section ".pgtable","a",@nobits
.balign 4096
pgtable:
.fill BOOT_PGT_SIZE, 1, 0
```
----
#### PG's size depends on the ```CONFIG_X86_VERBOSE_BOOTUP``` kernel configuration option:
```
# ifdef CONFIG_X86_VERBOSE_BOOTUP
# define BOOT_PGT_SIZE (19*4096)
# else /* !CONFIG_X86_VERBOSE_BOOTUP */
# define BOOT_PGT_SIZE (17*4096)
# endif
# else /* !CONFIG_RANDOMIZE_BASE */
# define BOOT_PGT_SIZE BOOT_INIT_PGT_SIZE
# endif
```
----
#### Build the top level page table - ```PML4``` -
```x86asm=
leal pgtable + 0(%ebx), %edi
leal 0x1007 (%edi), %eax
movl %eax, 0(%edi)
```
----
#### Build four ```Page Directory``` entries in the ```Page Directory Pointer``` table
```x86asm=
leal pgtable + 0x1000(%ebx), %edi
leal 0x1007(%edi), %eax
movl $4, %ecx
1: movl %eax, 0x00(%edi)
addl $0x00001000, %eax
addl $8, %edi
decl %ecx
jnz 1b
```
----
#### Build the ```2048``` page table entries with ```2-MByte``` pages
```x86asm=
leal pgtable + 0x2000(%ebx), %edi
movl $0x00000183, %eax
movl $2048, %ecx
1: movl %eax, 0(%edi)
addl $0x00200000, %eax
addl $8, %edi
decl %ecx
jnz 1b
```
----
#### Put the address of the high-level page table - ```PML4``` - into the ```cr3``` control register:
```x86asm=
leal pgtable(%ebx), %eax
movl %eax, %cr3
```
----
### The Transition to 64-bit Mode
----
#### Set the ```EFER.LME``` flag in the MSR to 0xC0000080
```x86asm=
movl $MSR_EFER, %ecx
rdmsr
btsl $_EFER_LME, %eax
wrmsr
```
----
#### Push the address of the kernel segment code to the stack
```x86asm=
pushl $__KERNEL_CS
```
#### Put the address of the ```startup_64``` routine in ```eax```
```x86asm=
leal startup_64(%ebp), %eax
```
#### Push ```eax``` to the stack and enable paging
```x86asm=
pushl %eax
movl $(X86_CR0_PG | X86_CR0_PE), %eax
movl %eax, %cr0
```
#### Execute the lret instruction:
```x86asm=
lret
```
----
#### After all of these steps we're finally in 64-bit mode:
```x86asm=
.code64
.org 0x200
ENTRY(startup_64)
....
....
....
```
----
* What is the significance of the `KEEP_SEGMENTS` flag in the Linux boot protocol and its impact on segment register initialization during the boot process?
(a) It determines the boot protocol version
(b) It controls the loading of segment registers during boot
\(c) It sets up heap memory allocation
----
### Takeaway Question
* Is Linux Kernel able to be booted from different address?
(A). No, it would be loaded at a specific address.
(B). Yes, but cannot get address with offset and program counter.
\(C). Yes, the decompressor is compiled as position independent code.
----
### Takeaway Question
* To switch into 64-bit mode, which of the following is NOT one of the things we should do?
(A). Set CS.L = 0 and CS.D = 1.
(B). Build page tables and load the address of the top level page table into the cr3 register.
(C). Enable EFER.LME.
(D). Enable PAE.
---
### Kernel decompression
#### 林愉修
----
### Introduction
----
`startup_32` → `startup_64`
[arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/v4.17/arch/x86/boot/compressed/head_64.S#L208)
```x86asm
pushl $__KERNEL_CS
leal startup_64(%ebp), %eax
...
...
...
pushl %eax
...
...
...
lret
```
- `ebp`: the physical address of `startup_32`.
note: In the previous part, we've already covered the jump from `startup_32` to `startup_64`. The main purpose of the `startup_64` function is to decompress the compressed kernel image and jump right to it.
----
### Preparing to Decompress the Kernel
----
### Preparing to Decompress the Kernel
#### 1. Set up the segment registers
----
Why again?
- We have loaded a new `GDT`
- and the CPU has transitioned to a new mode (`64-bit` mode).
note: Remember we update the `Global Descriptor Table` with 64-bit segments in the previous part.
----
[arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/v4.17/arch/x86/boot/compressed/head_64.S#L253)
```x86asm
.code64
.org 0x200
ENTRY(startup_64)
xorl %eax, %eax
movl %eax, %ds
movl %eax, %es
movl %eax, %ss
movl %eax, %fs
movl %eax, %gs
```
All segment registers besides the `cs` register are now reset in long mode
----
### Preparing to Decompress the Kernel
#### 2. Calculate the relocation address
----
Why again?
- The bootloader can use the `64-bit boot protocol` now and `startup_32` is no longer being executed
note: We've done this before in the `startup_32` function, but we need to do this calculation again because the bootloader can use the `64-bit boot protocol` now and `startup_32` is no longer being executed.
----
```x86asm
#ifdef CONFIG_RELOCATABLE
leaq startup_32(%rip), %rbp
movl BP_kernel_alignment(%rsi), %eax
decl %eax
addq %rax, %rbp
notq %rax
andq %rax, %rbp
cmpq $LOAD_PHYSICAL_ADDR, %rbp
jge 1f
#endif
movq $LOAD_PHYSICAL_ADDR, %rbp
1:
movl BP_init_size(%rsi), %ebx
subl $_end, %ebx
addq %rbp, %rbx
```
- `rsi`: pointer to boot_params table
- `rbp`: decompressed kernel's start address
- `rbx`: address where the kernel code will be relocated to for decompression
note: Because we're now in long mode, so in the first line, we could just use `rip relative addressing` to get the physical address of `startup_32`.
----
![image](https://hackmd.io/_uploads/SygaZBfm0p.png)
----
### Preparing to Decompress the Kernel
#### 3. Set up `sp`, flags register and `GDT`
----
Set up stack pointer
```x86asm
leaq boot_stack_end(%rbx), %rsp
...
.bss
.balign4
boot_heap:
.fill BOOT_HEAP_SIZE, 1, 0
boot_stack:
.fill BOOT_STACK_SIZE, 1, 0
boot_stack_end:
```
----
Set up the Global Descriptor Table
```x86asm
leaq gdt(%rip), %rax
movq %rax, gdt64+2(%rip)
lgdt gdt64(%rip)
...
.data
gdt64:
.word gdt_end - gdt
.long 0
.word 0
.quad 0
gdt:
.word gdt_end - gdt
.long gdt
.word 0
.quad 0x00cf9a000000ffff /* __KERNEL32_CS */
.quad 0x00af9a000000ffff /* __KERNEL_CS */
.quad 0x00cf92000000ffff /* __KERNEL_DS */
.quad 0x0080890000000000 /* TS descriptor */
.quad 0x0000000000000000 /* TS continued */
gdt_end:
```
- to overwrite the 32-bit specific values with those from the 64-bit protocol
----
Zero EFLAGS
```x86asm
pushq $0
popfq
```
note: Zeroing the EFLAGS register is done to ensure a clean state of the processor's flags during the kernel boot process. This helps prevent any unintended side effects or undefined behavior by resetting the flags to a known state before executing the decompression process.
----
### Preparing to Decompress the Kernel
#### 4. Copy the compressed kernel to the relocation address
note: Since the stack is now correct, we can copy the compressed kernel to the address that we got above, when we calculated the relocation address of the decompressed kernel
----
Copy the compressed kernel to the end of our burffer where decompression in place becomes safe
```x86asm
pushq %rsi
leaq (_bss-8)(%rip), %rsi
leaq (_bss-8)(%rbx), %rdi
movq $_bss, %rcx
shrq $3, %rcx
std
rep movsq
cld
popq %rsi
```
----
[arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/v4.17/arch/x86/boot/compressed/vmlinux.lds.S)
![image](https://hackmd.io/_uploads/rJjXjMQ0p.png =50%x)
----
![image](https://hackmd.io/_uploads/SygaZBfm0p.png)
----
### Preparing to Decompress the Kernel
#### 5. Jump to the relocated address of the `.text` section
----
Jump to the relocated address
```x86asm
leaq relocated(%rbx), %rax
jmp *%rax
...
.text
relocated:
/* decompress the kernel */
```
----
### The final touches before kernel decompression
----
### The final touches before kernel decompression
#### 1. Initialize the `.bss` section
----
Clear BSS (stack is currently empty)
[arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/v4.17/arch/x86/boot/compressed/head_64.S#L510)
```x86asm
.text
relocated:
xorl %eax, %eax
leaq _bss(%rip), %rdi
leaq _ebss(%rip), %rcx
subq %rdi, %rcx
shrq $3, %rcx
rep stosq
```
----
### The final touches before kernel decompression
#### 2. Call the `extract_kernel` function
----
```x86asm
pushq %rsi
movq %rsi, %rdi
leaq boot_heap(%rip), %rsi
leaq input_data(%rip), %rdx
movl $z_input_len, %ecx
movq %rbp, %r8
movq $z_output_len, %r9
call extract_kernel
popq %rsi
```
[arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/v4.17/arch/x86/boot/compressed/misc.c#L339)
```clike
asmlinkage __visible void *extract_kernel(
void *rmode, memptr heap,
unsigned char *input_data,
unsigned long input_len,
unsigned char *output,
unsigned long output_len)
```
- All arguments will be passed through registers as per the [System V ABI](https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-1.0.pdf).
----
Arguments of `extract_kernel`
1. `rmode`: a pointer to the `boot_params` structure
1. `heap`: a pointer to boot_heap
1. `input_data`*: a pointer to the start of the compressed kernel
1. `input_len`*: the size of the compressed kernel
1. `output`: the start address of the decompressed kernel
1. `output_len`*: the size of the decompressed kernel
\*generated by [arch/x86/boot/compressed/mkpiggy.c](https://github.com/torvalds/linux/blob/v4.17/arch/x86/boot/compressed/mkpiggy.c)
----
### Kernel decompression
----
### Kernel decompression
#### 1. Video/Console initialization
note: The `extract_kernel` function starts with the video/console initialization that we already saw in the previous parts. We need to do this again because we don't know if we started in real mode or if a bootloader was used, or whether the bootloader used the `32` or `64-bit` boot protocol.
----
### Kernel decompression
#### 2. Initialize heap pointers
----
Initialize heap pointers
[arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/v4.17/arch/x86/boot/compressed/misc.c#L339)
```clike
free_mem_ptr = heap;
free_mem_end_ptr = heap + BOOT_HEAP_SIZE;
```
`heap` is the second parameter of the `extract_kernel` function.
[arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/v4.17/arch/x86/boot/compressed/head_64.S)
```x86asm
leaq boot_heap(%rip), %rsi
...
.bss
.balign 4
boot_heap:
.fill BOOT_HEAP_SIZE, 1, 0
```
----
### Kernel decompression
#### 3. Call the `choose_random_location` function
----
```clike
choose_random_location(
(unsigned long)input_data, input_len,
(unsigned long *)&output,
max(output_len, kernel_total_size),
&virt_addr);
```
defined in [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c#L842)
- choose a memory location to write the decompressed kernel to
- if `kASLR` is enabled
- security reasons
note: The `choose_random_location` function chooses a memory location to write the decompressed kernel to. The Linux kernel supports `kASLR` which allow decompression of the kernel into a random address, for security reasons. If the `kASLR` isn't enabled, we just use `output` parameter that we passed into `extract_kernel` as the start address of the decompression kernel.
----
### Kernel decompression
#### 4. Check the random address we got is correctly aligned
note: After getting the address for the kernel image, we need to check that the random address we got is correctly aligned, and in general, not wrong.
----
### Kernel decompression
#### 5. Call the `__decompress` function
note: Now, we call the `__decompress` function to decompress the kernel.
----
[arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/v4.17/arch/x86/boot/compressed/misc.c#L417)
```clike
debug_putstr("\nDecompressing Linux... ");
__decompress(input_data, input_len, NULL,
NULL, output, output_len, NULL, error);
```
- The implementation of the `__decompress` function depends on what decompression algorithm was choosen during kernel compilation
----
### Kernel decompression
#### 6. Call the `parse_elf` function
note: So now the kernel has been decompressed, there are two more functions are called: `parse_elf` and `handle_relocations`, the main point of these two functions is to move the decompressed kernel image to its correct place in memory. This is because decompression is done in-place, and we still need to move the kernel to the correct address.
----
Call the `parse_elf` function
- Move loadable segments to the `output` address we got from the `choose_random_location`
1. Check the ELF signature
2. If valid, go through all the program header and copy all loadable segments with correct 2MB aligned addresses to the output buffer
note: As we already know, the kernel image is an ELF executable. The main goal of the `parse_elf` function is to move loadable segements to the correct address, which is the `output` address we got from the `choose_random_location` function
----
### Kernel decompression
#### 7. Call the `handle_relocations` function
note: The next step after the `parse_elf` function is to call the `handle_relocaitons` function. The implementation of this function depends on the `CONFIG_X86_NEED_RELOCS` kernel configuration option and if it is enabled, this function adjusts addresses in the kernel image. This function is also only called if the `CONFIG_RANDOMIZE_BASE` configuration option was enabled during kernel configuration.
----
Call the `handle_relocations` function
- if `CONFIG_X86_NEED_RELOCS` is enabled
- update the relocation table which is at the end of the kernel image
```clike
static void handle_relocations(void *output,
unsigned long output_len,
unsigned long virt_addr)
{
...
delta = min_addr - LOAD_PHYSICAL_ADDR
...
for (reloc = output + output_len - sizeof(*reloc); *reloc; reloc--) {
...
*(uint32_t *) ptr += delta;
}
...
}
```
note: This function subtracts the value of LOAD_PHYSICAL_ADDR from the value of the base load address of the kernel and thus we obtain the difference between where the kernel was linked to load and where it was actually loaded. After this we can relocate the kernel since we know the actual address where the kernel was loaded, the address where it was linked to run and the relocation table which is at the end of the kernel image.
----
### Kernel decompression
#### 8. Return from `extract_kernel` and jump to kernel
----
[arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/v4.17/arch/x86/boot/compressed/misc.c#L422)
```clike
... void *extract_kernel(...)
{
...
return output;
}
```
[arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/v4.17/arch/x86/boot/compressed/head_64.S#L538)
```x86asm
relocated:
call extract_kernel
...
jmp *%rax
```
note: The address of the kernel will be in the `rax` register and we jump to it. That's all. Now we are in the kernel!
----
### Takeaway Question
* Why is the `parse_elf` function called during the Linux kernel decompression process ?
(A). To parse the compressed kernel image.
(B). To move loadable segments to the correct address.
\(C). To handle kernel relocations.
---
### Kernel load address randomization
#### 王禹博、高鈺鴻
----
### Introduction
Why randomization?
security reasons - exploitation of memory corruption
Enable by:
`CONFIG_RANDOMIZE_BASE`
note: Although there is a predefined load address determined by a kernel configuration for the entry point of the Linux kernel, this load address can also be configured to be a random value. The reason why we randomize the load address of the kernel is for security purposes. The randomization can help to prevent exploitations of memory corruption vulnerabilities. By doing so, it becomes more difficult for an attacker to predict the memory addresses of specific functions or data. So, if we want to randomize the load address of the kernel, a special kernel configuration option should be enabled, which is shown here.
----
### Page Table Initialization
note: So now we know why the kernel decompressor look for a random memory range to decompress and load the kernel. However, before the kernel decompression, the **page tables** should be initialized.
----
If bootloader boot with 16/32-bit boot protocol
→ Already have page tables
If the kernel decompressor selects a memory range which is valid only in a 64-bit context
→ Build new identity mapped page tables
note: In fact, the page tables have been initialized before the transition to 64-bit mode. So, if the bootloader uses the 16-bit or 32-bit boot protocol, we already have page tables. The reason why we have to initialize the page table again here is that the kernel decompressor may select a memory range which is only vaild in 64-bit mode. Therefore, we need to build new **identity mapped page tables**.
----
Called by `extract_kernel` function from arch/x86/boot/compressed/misc.c
```clike=
void choose_random_location(unsigned long input,
unsigned long input_size,
unsigned long *output,
unsigned long output_size,
unsigned long *virt_addr)
```
1. Page Initialization
2. Avoid using reserved memory ranges
3. Physical address randomization
4. Virtual address randomization
note: The randomization begins with the call to this **choose_random_location** function. The first parameter of this function **input** is a pointer to the compressed kernel image, while the second parameter is the size of the compressed kernel. The third and fourth parameters are the address of the decompressed kernel image and its length. As for the last parameter, it is the virtual address of the kernel load address. In general, this function basically handles everything about the randomization of load address, such as page initialization, avoid using reserved memory ranges, physical address randomization, and virtual address randomizatoin.
----
* Check the nokaslr option
```cpp=
if (cmdline_find_option_bool("nokaslr")) {
warn("KASLR disabled: 'nokaslr' on cmdline.");
return;
}
```
note: The **choose_random_location** will start by checking a kernel command line option shown here. If this option is set, the function will return and the kernel load address will be still unrandomized.
----
```clike=
initialize_identity_maps() // Called by choose_random_location function
struct x86_mapping_info {
void *(*alloc_pgt_page)(void *);
void *context;
unsigned long page_flag;
unsigned long offset;
bool direct_gbpages;
unsigned long kernpg_flag;
};
```
* Build new identity mapped page tables
* Initialize the memory for a new page table entry
note: After checking the privous option, the first job of the **choose_random_location** function is to initalize the page table, and it first calls this **initialize_identity_maps** function. This function will initialize the structure shown here, which provides information about the memory mapping. The first field of this structure, which is a callback function, will check if there is enough memory space for a new page and allocate it, while the second field **context** is used to track the allocated page tables. For the remaining fields, two of them are page flags, the boolean value is used to check if huge pages are supported and the **offset** field is the offset between the kernel's virtual addresses and its physical addresses.
----
### Avoiding Reserved Memory Ranges
note: After the page tables are initialized, the next thing is to choose a random memory location to extract the kernel image. However, certain memory regions have been used by other things such as kernel command like tool, and these regions should not be choosed. Therefore, there are some mechanisms to aviod that.
----
```cpp=
mem_avoid_init(input, input_size, *output);
```
```cpp=
// Unsafe memory regions will be collected in an array
struct mem_vector {
unsigned long long start;
unsigned long long size;
};
static struct mem_vector mem_avoid[MEM_AVOID_MAX];
```
* Store information of the reserved memory regions into an array
note: To address this issue, there is a **mem_aviod_init** function. This function is called after the page initialization, and its main goal is to store information about reserved memory regions with descriptors given by an enumeration, which will be shown in the next slide, and all information related to the memory region, such as the start address and the size of the region, will be stored in an array.
----
```clike=
enum mem_avoid_index {
MEM_AVOID_ZO_RANGE = 0,
MEM_AVOID_INITRD,
MEM_AVOID_CMDLINE,
MEM_AVOID_BOOTPARAMS,
MEM_AVOID_MEMMAP_BEGIN,
MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,
MEM_AVOID_MAX,
};
```
```cpp=
void add_identity_map(unsigned long start, unsigned long size)
```
* Build the identity mapped page tables
note: There are many different types of reserved memory regions. However, The **mem_avoid_init** function does the same thing for all elements in this enumeration. After storing the information, it also calls the **add_identity_map** function to build the identity mapped page for each of the reserved memory regions. The parameters of this function are the start address and the size of the memory region, both of which are stored in an array previously.
----
### Physical address randomization
----
```clike=
min_addr = min(*output, 512UL << 20);
```
```clike=
random_addr = find_random_phys_addr(min_addr, output_size);
```
```clike=
static unsigned long find_random_phys_addr(unsigned long minimum,
unsigned long image_size)
{
minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);
if (process_efi_entries(minimum, image_size))
return slots_fetch_random();
process_e820_entries(minimum, image_size);
return slots_fetch_random();
}
```
* Get a random physical address to decompress the kernel to
note:
1. The address should be within the first 512 megabytes. A limit of 512 megabytes was selected to avoid unknown things in lower memory.
2. The primary objective of the process_efi_entries function is to find all suitable memory ranges in fully accessible memory to load kernel. If the kernel is compiled and run on a system without EFI support, we continue to search for such memory regions in the e820 region.
3. The slots_fetch_random function serves the purpose of selecting a random memory range from the slot_areas array.
----
### Virtual address randomization
----
```clike=
random_addr = find_random_phys_addr(min_addr, output_size);
if (*output != random_addr) {
add_identity_map(random_addr, output_size);
*output = random_addr;
}
```
* Generate identity mapped pages
note: After selecting the random physical address, it will first generate identity mapped pages for the region.
----
```clike=
if (IS_ENABLED(CONFIG_X86_64))
random_addr = find_random_virt_addr(LOAD_PHYSICAL_ADDR, output_size);
*virt_addr = random_addr;
```
* Get the virtual address for the decompress kernel
note: After randomizing the physical address, we can also randomize the virtual address on the x86_64 architecture. For architectures other than x86_64, the function find_random_virt_addr calculates the number of virtual memory ranges needed to hold the kernel image. Now, both the base physical address (*output) and the virtual address (*virt_addr) for the decompressed kernel have been randomized.
----
### Takeaway Question
* What is the reason to randomize the kernel load address ?
(A). Save memory space.
(B). For security purposes.
\(C). Speed up the kernel decompression.
{"description":"內容…","contributors":"[{\"id\":\"d0a95804-f8e3-4791-a4c7-01ee95ed51b5\",\"add\":4740,\"del\":784},{\"id\":\"6458897a-3219-408b-abd3-1a4d48ac3ccb\",\"add\":7783,\"del\":3086},{\"id\":\"f733f481-9a4d-407d-a5cf-1fded18ef487\",\"add\":16707,\"del\":4060},{\"id\":\"ba0bbb2e-ab00-45ba-9c19-d5b404326b78\",\"add\":9068,\"del\":1944},{\"id\":\"2670ce92-e094-4d6b-a795-539b86bf66cd\",\"add\":2639,\"del\":622},{\"id\":\"2bc85ef8-bf00-4b4d-904a-cac1b0df4079\",\"add\":4115,\"del\":423}]","title":"Booting 4 - 6"}