Booting 4 - 6 - HackMD

Overview of The Transition to 64-bit mode

Transitioning CPU to protected mode
Check CPU support for long mode and SSE
Initialize page tables with paging

The 32-bit entry point

1. Source Files

Within 'arch/x86/boot/compressed', there exist two files: head_32.S and head_64.S.
head_64.S is relevant for x86_64 architecture.

2. Makefile Configuration

In the Makefile, the target vmlinux-objs-y selects the appropriate file (head_32.o or head_64.o) based on $(BITS).
$(BITS) is determined in arch/x86/Makefile based on kernel configuration (CONFIG_X86_32).

vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
	                $(obj)/string.o $(obj)/cmdline.o \
	                $(obj)/piggy.o $(obj)/cpuflags.o

Reload the segments if needed

1. Segment Setup

Segmentation organizes code and data into separate segments in x86 assembly.
Special section attributes define code segments, such as .head.text and .code32.
The KEEP_SEGMENTS flag determines whether segment registers need reloading

2. Flag Checking

The KEEP_SEGMENTS flag in the kernel setup header dictates segment register behavior.
Unset flag requires resetting segment registers for consistent setup.
Ensures uniform behavior regardless of bootloader variations.

3. Calculating Kernel Load Address

Kernel code is typically compiled to run at address 0.
Computing the actual load address involves a pattern:
- Calling a label and popping the return address.
- Difference between return address and label gives the kernel's real physical address.

Stack setup and CPU Verification

Get real address of boot_stack_end

arch/x86/boot/compressed/head_64.S

 	movl	$boot_stack_end, %eax
	addl	%ebp, %eax
	movl	%eax, %esp

ebp: real address of startup_32 label
eax: address of boot_stack_end as it was linked at 0x0
sum of ebp and eax → Real address of boot_stack_end

arch/x86/boot/compressed/head_64.S

 	.bss
	.balign 4
boot_heap:
	.fill BOOT_HEAP_SIZE, 1, 0
boot_stack:
	.fill BOOT_STACK_SIZE, 1, 0
boot_stack_end:

Verify CPU supporting long mode and SSE

 	call	verify_cpu
	testl	%eax, %eax
	jnz	no_longmode

verify_cpu: return 0 if support long mode
no_longmode: halt

arch/x86/kernel/verify_cpu.S

Check if CPU support long mode

 	movl    $0x1,%eax		# Does the cpu have what it takes
	cpuid
	andl	$REQUIRED_MASK0,%edx
	xorl	$REQUIRED_MASK0,%edx
	jnz	.Lverify_cpu_no_longmode
 
	movl    $0x80000000,%eax	# See if extended cpuid is implemented
	cpuid
	cmpl    $0x80000001,%eax
	jb      .Lverify_cpu_no_longmode	# no extended cpuid
 
	movl    $0x80000001,%eax	# Does the cpu have what it takes
	cpuid
	andl    $REQUIRED_MASK1,%edx
	xorl    $REQUIRED_MASK1,%edx
	jnz     .Lverify_cpu_no_longmode

arch/x86/kernel/verify_cpu.S

Check if CPU support SSE

 	movl	$1,%eax
	cpuid
	andl	$SSE_MASK,%edx
	cmpl	$SSE_MASK,%edx
	je	.Lverify_cpu_sse_ok
	test	%di,%di
	jz	.Lverify_cpu_no_longmode	# only try to force SSE on AMD
	movl	$MSR_K7_HWCR,%ecx
	rdmsr
	btr	$15,%eax		# enable SSE
	wrmsr
	xor	%di,%di			# don't loop
	jmp	.Lverify_cpu_sse_test	# try again

Calculate the relocation address

Get to know default kernel base address

CONFIG_PHYSICAL_START: Default base address of Linux kernel
Value of CONFIG_PHYSICAL_START is 0x1000000 (1MB)

Why kernel needs to be relocatable?

If the Linux kernel crashes, a kernel developer must have a rescue kernel for kdump which is configured to load from a different address

Set the kernel to be relocatable

→ CONFIG_RELOCATABLE=y

    This builds a kernel image that retains relocation  
information so it can be loaded someplace besides the 
default 1MB.
    Note: If CONFIG_RELOCATABLE=y, then the kernel runs from the 
address it has been loaded at and the compile time physical 
address (CONFIG_PHYSICAL_START) is used as the minimum location.

Reload Segments if needed

Special section attribute is set in arch/x86/boot/ compressed/head_64.S before startup_32

     __HEAD
    .code32
ENTRY(startup_32)

#define __HEAD		.section	".head.text","ax"

__HEAD is a macro defined in the include/linux/ init.h
- .head.text indicates that following code contains executable instructions
- flag a : this section is allocatable
- flag x : this section can be executed by CPU

How is Linux Kernel be able to be booted from different address

→ Compile the decompressor as position independent code (PIC)

arch/x86/boot/compressed/Makefile

 KBUILD_CFLAGS += -fno-strict-aliasing -fPIC

Address is obtained by adding the address field of the instruction to the value of the program counter

Calculate an address where we can relocate the kernel for decompression

→ depending on CONFIG_RELOCATABLE

#ifdef CONFIG_RELOCATABLE
	movl	%ebp, %ebx
	movl	BP_kernel_alignment(%esi), %eax
	decl	%eax
	addl	%eax, %ebx
	notl	%eax
	andl	%eax, %ebx
	cmpl	$LOAD_PHYSICAL_ADDR, %ebx
	jge	1f
#endif
	movl	$LOAD_PHYSICAL_ADDR, %ebx

CONFIG_RELCATABLE

align it to a multiple of 2MB
compare it with the result of the LOAD_PHYSICAL_ADDR macro

 #define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \
				+ (CONFIG_PHYSICAL_ALIGN - 1)) \
				& ~(CONFIG_PHYSICAL_ALIGN - 1))

arch/x86/include/asm/boot.h

Move compressed kernel image to the end of the decompression buffer

 1:
    movl	BP_init_size(%esi), %eax
    subl	$_end, %eax
    addl	%eax, %ebx

BP_init_size : the larger of the compressed and uncompressed vmlinux sizes

Preparation before entering long mode

Update the Global Descriptor Table with 64-bit segments

 	addl	%ebp, gdt+2(%ebp)
	lgdt	gdt(%ebp)

adjust the base address of the Global Descriptor table to the address where we actually loaded the kernel
load the Global Descriptor Table with the lgdt instruction

GDT Definition

 	.data
    ...
gdt:
	.word	gdt_end - gdt
	.long	gdt
	.word	0
	.quad	0x00cf9a000000ffff	/* __KERNEL32_CS */
	.quad	0x00af9a000000ffff	/* __KERNEL_CS */
	.quad	0x00cf92000000ffff	/* __KERNEL_DS */
	.quad	0x0080890000000000	/* TS descriptor */
	.quad   0x0000000000000000	/* TS continued */
gdt_end:

Enable Pysical Address Extension

 	movl	%cr4, %eax
	orl	$X86_CR4_PAE, %eax
	movl	%eax, %cr4

→ put value of the cr4 register into eax, set the 5th bit and load it back into cr4

Long Mode

64-bit mode provides the following features:

8 new general purpose registers from r8 to r15
All general purpose registers are 64-bit now
A 64-bit instruction pointer - RIP
64-Bit Addresses and Operands
RIP Relative Addressing

Long mode is an extension of the legacy protected mode. It consists of two sub-modes:

64-bit mode
- allows the processor to operate in a fully 64-bit environment
compatibility mode.
- ensures backward compatibility with legacy 32-bit software

To switch into 64-bit mode we need to do the following things:

Enable PAE
Build page tables and load the address of the top level page table into the cr3 register
Enable EFER.LME
Enable paging

Early Page Table Initialization

The Linux kernel uses `4-level` paging, and we generally build 6 page tables:

One PML4 (Page Map Level 4 table) with one entry
One PDP (Page Directory Pointer table) with four entries
Four Page Directory tables with a total of 2048 entries

Clear the buffer for the page tables in memory

 	leal	pgtable(%ebx), %edi
	xorl	%eax, %eax
	movl	$(BOOT_INIT_PGT_SIZE/4), %ecx
	rep	stosl

arch/x86/boot/compressed/head_64.S

pgtable is defined here

	.section ".pgtable","a",@nobits
	.balign 4096
pgtable:
	.fill BOOT_PGT_SIZE, 1, 0

PG's size depends on the `CONFIG_X86_VERBOSE_BOOTUP` kernel configuration option:

#  ifdef CONFIG_X86_VERBOSE_BOOTUP
#   define BOOT_PGT_SIZE	(19*4096)
#  else /* !CONFIG_X86_VERBOSE_BOOTUP */
#   define BOOT_PGT_SIZE	(17*4096)
#  endif
# else /* !CONFIG_RANDOMIZE_BASE */
#  define BOOT_PGT_SIZE		BOOT_INIT_PGT_SIZE
# endif

Build the top level page table - `PML4` -

 	leal	pgtable + 0(%ebx), %edi
	leal	0x1007 (%edi), %eax
	movl	%eax, 0(%edi)

Build four `Page Directory` entries in the `Page Directory Pointer` table

 	leal	pgtable + 0x1000(%ebx), %edi
	leal	0x1007(%edi), %eax
	movl	$4, %ecx
1:  movl	%eax, 0x00(%edi)
	addl	$0x00001000, %eax
	addl	$8, %edi
	decl	%ecx
	jnz	1b

Build the `2048` page table entries with `2-MByte` pages

 	leal	pgtable + 0x2000(%ebx), %edi
	movl	$0x00000183, %eax
	movl	$2048, %ecx
1:  movl	%eax, 0(%edi)
	addl	$0x00200000, %eax
	addl	$8, %edi
	decl	%ecx
	jnz	1b

Put the address of the high-level page table - `PML4` - into the `cr3` control register:

 	leal	pgtable(%ebx), %eax
	movl	%eax, %cr3

The Transition to 64-bit Mode

Set the `EFER.LME` flag in the MSR to 0xC0000080

 	movl	$MSR_EFER, %ecx
	rdmsr
	btsl	$_EFER_LME, %eax
	wrmsr

Push the address of the kernel segment code to the stack

 	pushl	$__KERNEL_CS

Put the address of the `startup_64` routine in `eax`

 	leal	startup_64(%ebp), %eax

Push `eax` to the stack and enable paging

 	pushl	%eax
	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
	movl	%eax, %cr0

Execute the lret instruction:

 	lret

After all of these steps we're finally in 64-bit mode:

 	.code64
	.org 0x200
ENTRY(startup_64)
....
....
....

What is the significance of the KEEP_SEGMENTS flag in the Linux boot protocol and its impact on segment register initialization during the boot process?

(a) It determines the boot protocol version
(b) It controls the loading of segment registers during boot
(c) It sets up heap memory allocation

Takeaway Question

Is Linux Kernel able to be booted from different address?
(A). No, it would be loaded at a specific address.
(B). Yes, but cannot get address with offset and program counter.
(C). Yes, the decompressor is compiled as position independent code.

Takeaway Question

To switch into 64-bit mode, which of the following is NOT one of the things we should do?
(A). Set CS.L = 0 and CS.D = 1.
(B). Build page tables and load the address of the top level page table into the cr3 register.
©. Enable EFER.LME.
(D). Enable PAE.

	void choose_random_location(unsigned long input,
	unsigned long input_size,
	unsigned long *output,
	unsigned long output_size,
	unsigned long *virt_addr)

	initialize_identity_maps() // Called by choose_random_location function

	struct x86_mapping_info {
	void (alloc_pgt_page)(void *);
	void *context;
	unsigned long page_flag;
	unsigned long offset;
	bool direct_gbpages;
	unsigned long kernpg_flag;
	};

	// Unsafe memory regions will be collected in an array
	struct mem_vector {
	unsigned long long start;
	unsigned long long size;
	};
	static struct mem_vector mem_avoid[MEM_AVOID_MAX];

	enum mem_avoid_index {
	MEM_AVOID_ZO_RANGE = 0,
	MEM_AVOID_INITRD,
	MEM_AVOID_CMDLINE,
	MEM_AVOID_BOOTPARAMS,
	MEM_AVOID_MEMMAP_BEGIN,
	MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,
	MEM_AVOID_MAX,
	};

	static unsigned long find_random_phys_addr(unsigned long minimum,
	unsigned long image_size)
	{
	minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);

	if (process_efi_entries(minimum, image_size))
	return slots_fetch_random();

	process_e820_entries(minimum, image_size);
	return slots_fetch_random();
	}

	.bss
	.balign 4
	boot_heap:
	.fill BOOT_HEAP_SIZE, 1, 0
	boot_stack:
	.fill BOOT_STACK_SIZE, 1, 0
	boot_stack_end:

	movl $0x1,%eax # Does the cpu have what it takes
	cpuid
	andl $REQUIRED_MASK0,%edx
	xorl $REQUIRED_MASK0,%edx
	jnz .Lverify_cpu_no_longmode

	movl $0x80000000,%eax # See if extended cpuid is implemented
	cpuid
	cmpl $0x80000001,%eax
	jb .Lverify_cpu_no_longmode # no extended cpuid

	movl $0x80000001,%eax # Does the cpu have what it takes
	cpuid
	andl $REQUIRED_MASK1,%edx
	xorl $REQUIRED_MASK1,%edx
	jnz .Lverify_cpu_no_longmode

	movl $1,%eax
	cpuid
	andl $SSE_MASK,%edx
	cmpl $SSE_MASK,%edx
	je .Lverify_cpu_sse_ok
	test %di,%di
	jz .Lverify_cpu_no_longmode # only try to force SSE on AMD
	movl $MSR_K7_HWCR,%ecx
	rdmsr
	btr $15,%eax # enable SSE
	wrmsr
	xor %di,%di # don't loop
	jmp .Lverify_cpu_sse_test # try again

	#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \
	+ (CONFIG_PHYSICAL_ALIGN - 1)) \
	& ~(CONFIG_PHYSICAL_ALIGN - 1))

	1:
	movl BP_init_size(%esi), %eax
	subl $_end, %eax
	addl %eax, %ebx

	.data
	...
	gdt:
	.word gdt_end - gdt
	.long gdt
	.word 0
	.quad 0x00cf9a000000ffff /* __KERNEL32_CS */
	.quad 0x00af9a000000ffff /* __KERNEL_CS */
	.quad 0x00cf92000000ffff /* __KERNEL_DS */
	.quad 0x0080890000000000 /* TS descriptor */
	.quad 0x0000000000000000 /* TS continued */
	gdt_end:

	leal pgtable(%ebx), %edi
	xorl %eax, %eax
	movl $(BOOT_INIT_PGT_SIZE/4), %ecx
	rep stosl

	leal pgtable + 0(%ebx), %edi
	leal 0x1007 (%edi), %eax
	movl %eax, 0(%edi)

	leal pgtable + 0x1000(%ebx), %edi
	leal 0x1007(%edi), %eax
	movl $4, %ecx
	1: movl %eax, 0x00(%edi)
	addl $0x00001000, %eax
	addl $8, %edi
	decl %ecx
	jnz 1b

	leal pgtable + 0x2000(%ebx), %edi
	movl $0x00000183, %eax
	movl $2048, %ecx
	1: movl %eax, 0(%edi)
	addl $0x00200000, %eax
	addl $8, %edi
	decl %ecx
	jnz 1b

	if (cmdline_find_option_bool("nokaslr")) {
	warn("KASLR disabled: 'nokaslr' on cmdline.");
	return;
	}

	random_addr = find_random_phys_addr(min_addr, output_size);

	if (*output != random_addr) {
	add_identity_map(random_addr, output_size);
	*output = random_addr;
	}

	if (IS_ENABLED(CONFIG_X86_64))
	random_addr = find_random_virt_addr(LOAD_PHYSICAL_ADDR, output_size);

	*virt_addr = random_addr;

	pushl %eax
	movl $(X86_CR0_PG \| X86_CR0_PE), %eax
	movl %eax, %cr0

	void add_identity_map(unsigned long start, unsigned long size)

Booting 4-6