Try   HackMD

Memory Addressing in x64 Linux

contributed by < Tao Chiu >

tags: linux x64 Memory Addressing

IA-32e System Overview

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

x86 Long Mode (IA-32e)

  • Extends general registers to 64 bits, 8 additional registers for both integer and SSE.
  • Does not support hardware task switching.
  • the size of the physical address range is implementation-specific and indicated by CPUID.80000008H:EAX[bits 7-0] (3.3.1)
  • A logical processor is in IA-32e mode whenever CR0.PG = 1 and IA32_EFER.LME = 1. This fact is reported in IA32_EFER.LMA[bit 10]. Software cannot set this bit directly; it is always the logical-AND of CR0.PG and IA32_EFER.LME.

IA-32e Segmentation

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • The processor treats the segment base of CS, DS, ES, SS as zero, creating a linear address that is equal to the effective address.
  • Only FS and GS are used for memory addressing. These segment registers (which hold the segment base) can be used as additional base registers in linear address calculations
  • Processor does not perform segment limit checks at runtime in 64-bit mode. (3.2.4)
  • In 64-bit mode, memory accesses using FS-segment and GS-segment overrides are not checked for a runtime limit nor subjected to attribute-checking. Normal segment loads into FS and GS load a standard 32-bit base value in the hidden portion of the segment register. The base address bits above the standard 32 bits are cleared to 0 to allow consistency for implementations that use less than 64 bits(3.4.4).
  • There are some ways to set fs/gs base, mentioned here.

Segment Selectors and Descriptors Table

  • segment selector:
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

    Every segment register has a “visible” part and a “hidden” part. (The hidden part is sometimes referred to as a “descriptor cache” or a “shadow register.”). Note that it is the responsibility of software to reload the segment registers when the descriptor tables are modified.
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • Segment Descriptors Table: Base address in GDTR is extended to 64-bit.
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Segment Descriptors

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

1. Code- and Data-Segment Descriptor Types
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Code segment:
    • L-bit (64-bit code segment): Each code segment descriptor provides an L bit. This bit allows a code segment to execute 64-bit code or legacy 32-bit code by code segment. When not in IA-32e mode or for non-code segments, bit 21 is reserved and should always be set to 0.
    • D-bit (default operation size): If IA-32e mode is active (A32_EFER.LMA = 1)
      • if L-bit = 0 (CS.L), the processor is running in compatability mode (as if IA-32). In this case, D-bit (CS.D) select the default size for data and addresses. if CS.D = 0, the default address size is 16-bit. Otherwise, it is 32-bit.
      • if CS.L = 1, the only valid setting is CS.D = 0.
    • Conforming: A transfer of execution into a more-privileged conforming segment allows execution to continue at the current privilege level. A transfer into a nonconforming segment at a different privilege level results in a general-protection exception (#GP), unless a call gate or task gate is used.

2. System Descriptor Types: System descriptors such as call gates, interrupt gates, and task gates on IA-32e are extended to 16 bytes long. The Type field for a system descriptor define whether it is a call gate, interrupt(trap) gate or a task gate, described below:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Call Gates:
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • Interrupt Gates(IDT Gate):
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • Task Gates: In 64-bit mode, task switching is not supported, but TSS descriptors still exist. An attempt to trigger hardware task switching results in a general-protection exception (#GP).
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

IA-32e Paging

There are four paging modes supported by intel, including basic 32-bit paging, PAE paging, 4-level paging, and 5-level paging.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Controls Bits

  • CR0.WP, bit[16]: If CR0.WP = 0, supervisor-mode write accesses are allowed to linear addresses with read-only access rights.
  • CR0.PG, bit[31]: Enables paging.
  • CR4.PAE, bit[5]: Determines paging mode together with LME in IA32_EFER MSR.
  • CR4.PSE, bit[4]: Enables 4-MByte pages for 32-bit paging
  • CR4.PGE, bit[7]: If CR4.PGE = 1, specified translations may be shared across address space.
  • CR4.LA57, bit[12]: Determinses whether 4-level or 5-level paging is used for IA-32e paging. A #GP will be triggered if software try to change this bit when CR4.PG is set.
  • CR4.PCIDE, bit[17]: Enables process-context identifiers (PCIDs) for 4-level paging. PCIDs allow a logical processor to cache information for multiple linear-address spaces
  • CR4.SMEP, bit[20]: If CR4.SMEP = 1, software operating in supervisor mode cannot fetch instructions from linear addresses that are accessible in user mode.
  • CR4.SMAP, bit[21]: If CR4.SMAP = 1, software operating in supervisor mode cannot access data at linear addresses that are accessible in user mode.
  • CR4.PKE, bit[22]: Allows each linear address to be associated with a protection key.
  • CR4.CET, bit[23]: If CR4.CET = 1, certain memory accesses are identified as shadow-stack accesses.
  • CR4.PKS, bit[24]: Protection keys for supervisor-mode pages.
  • LME in IA32_EFER MSR, bit[8]: Determines paging mode together with CR4.PAE.
  • NXE in IA32_EFER MSR, bit[11]: Enables non-executable (NX) pages.
  • AC in EFLAGS, bit[18]:
  • Enable 4-level Paging:
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

    CR3 and the corresponding paging structure should be properly initiated before enabling paging. Also, the page fault handler should be correctly set and enabled in IDT, or any paging related exceptions may cause processor to reset due to tripple faults.

4 & 5 Level Paging

  • 4-Level Paging: A logical processor uses 4-level paging if CR0.PG = 1, CR4.PAE = 1, and IA32_EFER.LME = 1. 4-level paging translates 48-bit linear addresses to 52-bit physical addresses. However, at most 256 TBytes (48-bit, 9*4 + 12) of linear-address space may be accessed at any given time.
  • 5-Level Paging: A logical processor uses 5-level paging if CR0.PG = 1, CR4.PAE = 1, IA32_EFER.LME = 1, and CR4.LA57 = 1. 5-level paging translates 57-bit linear addresses to 52-bit physical addresses.

Overview

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

CR3

CR3 is used to locate the first paging-structure, the PML4 or PML5 table. Use of CR3 with 4(5)-level paging depends on whether process-context identifiers (PCIDs) have been enabled by setting CR4.PCIDE:

  • Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Paging-Structure

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • XD-bit: If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the region controlled by this entry); otherwise, reversed.
  • Prot. Key: Protection key. If CR4.PKE = 1 or CR4.PKS = 1, this may control the page’s access rights.
  • PS-bit (7): Page size. If the entry maps either a 1-GB, or 2-MB page, this bit must be 1. Otherwise, this entry references a next-level page table.
  • PAT-bit: Indirectly determines the memory type used to access the 4-KB page controlled by this entry.
  • G-bit: Global; if CR4.PGE = 1, determines whether the translation is global.
  • D-bit: Dirty; indicates whether software has written to the page referenced by this entry.
  • A-bit: Accessed; indicates whether software has accessed to the page referenced by this entry.
  • PCD-bit and PWT-bit: Page-level cache disable and page-level write through; indirectly determines the memory type used to access either a next-level page table, or a page.
  • U/S-bit: User/supervisor; if 0, user-mode accesses are not allowed to the region controlled by this entry.
  • R/W-bit: Read/write; if 0, writes may not be allowed to the region controlled by this entry.
  • P-bit: Present; must be 1 to reference a page or map a page table.

Translations

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

For a page table at each level,

  • it is 4KB in size.
  • it has 512 entries pointing toward either a next-level page table, or a final page (1G, 2M, or 4K). The entry is selected by a specific portion of the requesting linear address.

Access Rights

Supervisor-Mode Accesses

  • Data read from user-mode pages:
    ​​​​​if (CR4.SEMP == 0)
    ​​​​​    return MPK_access_granted(addr);
    ​​​​​else {
    ​​​​​    if (EFLAGS.AC == 1) // access is explicit
    ​​​​​        return MPK_access_granted(addr);
    ​​​​​    else 
    ​​​​​        return ACCESS_DENIED;
    ​​​​​}
    
  • Data writes to supervisor-mode addresses:
    ​​​​​if (CR0.WP == 0)
    ​​​​​    return MPK_access_granted(addr);
    ​​​​​else {
    ​​​​​    if (rw_flag_on_every_controlling_pgtbl(addr) == 1)
    ​​​​​        return MPK_access_granted(addr);
    ​​​​​    else
    ​​​​​        return ACCESS_DENIED;
    ​​​​​}
    
  • Data writes to user-mode addresses:
    ​​​​if (CR0.WP == 0) {
    ​​​​    if (CR4.SMAP == 0)
    ​​​​        return MPK_access_granted(addr);
    ​​​​    else {
    ​​​​        if (EFLAGS.AC == 1) // access is explicit
    ​​​​            return MPK_access_granted(addr);
    ​​​​         else 
    ​​​​             return ACCESS_DENIED;
    ​​​​    }
    ​​​​} else {
    ​​​​    if (CR4.SMAP == 0) {
    ​​​​        if (rw_flag_on_every_controlling_pgtbl(addr) == 1)
    ​​​​            return MPK_access_granted(addr);
    ​​​​        else
    ​​​​            ACCESS_DENIED;
    ​​​​    } else {
    ​​​​        if (EFLAGS.AC == 1) // access is explicit
    ​​​​            return MPK_access_granted(addr);
    ​​​​        else 
    ​​​​            return ACCESS_DENIED;
    ​​​​    }
    ​​​​}
    
  • Instruction fetches from supervisor-mode addresses:
    ​​​​​if (IA32_EFER.NXE == 0)
    ​​​​​    return ACCESS_GRANTED;
    ​​​​​else {
    ​​​​​    if (xd_on_every_controlling_pgtbl(addr) == 0)
    ​​​​​        return ACCESS_GRANTED;
    ​​​​​    else
    ​​​​​        return ACCESS_DENIED;
    ​​​​​}
    
  • Instruction fetches from user-mode addresses:
    ​​​​if (CR4.SMEP == 0) {
    ​​​​    if (IA32_EFER.NXE == 0)
    ​​​​        return ACCESS_GRANTED;
    ​​​​     else {
    ​​​​         if (xd_on_every_controlling_pgtbl(addr) == 0)
    ​​​​             return ACCESS_GRANTED;
    ​​​​         else
    ​​​​             return ACCESS_DENIED;
    ​​​​     }
    ​​​​} else 
    ​​​​    return ACCESS_DENIED;
    

User-Mode Accesses

  • Data reads:
    ​​​​if (is_user_mode_addr(addr))
    ​​​​    return MPK_access_granted(addr);
    ​​​​else
    ​​​​    return ACCESS_DENIED;
    
  • Data writes:
    ​​​​if (is_user_mode_addr(addr)) {
    ​​​​    if (rw_flag_on_every_controlling_pgtbl(addr) == 1)
    ​​​​        return MPK_access_granted(addr);
    ​​​​    else
    ​​​​        ACCESS_DENIED;
    ​​​​} else
    ​​​​    return ACCESS_DENIED;
    
  • Instruction fetches:
    ​​​​if (is_user_mode_addr(addr)) {
    ​​​​    if (IA32_EFER.NXE == 0)
    ​​​​        return ACCESS_GRANTED;
    ​​​​    else {
    ​​​​        if (xd_on_every_controlling_pgtbl(addr) == 0)
    ​​​​             return ACCESS_GRANTED;
    ​​​​         else
    ​​​​             return ACCESS_DENIED;
    ​​​​    }
    ​​​​} else
    ​​​​    return ACCESS_DENIED;
    

Note: A processor may cache information from the paging-structure entries in TLBs and paging-structure caches (see Section 4.10). These structures may include information about access rights. The processor may enforce access rights based on the TLBs and paging-structure caches instead of on the paging structures in memory. See section 4.10.4.2 for more information on invalidating TLBs.

Protection Keys

It is not covered by this topic at this time. Please refer to the manual, or my seminar pressentation (Oct. 2019) for more informations.

Cache Control

The menual use the term "memory type" of a memory access as the type of caching used for that access. Such behavior may be jointly controlled by bits on paging structures, memory-type range registers (MTRRs), and a 64-bit MSR table called page attribute table (IA32_PAT) if supported. PAT is supported by all processors that support 4-level or 5-level, thus we will skip cache control mechanism without PAT support.

Methods of Caching Available

MTRR (Phasing out in Linux)

Encoding in MTRR Memory Type
0x00 UC
0x01 WC
0x04 WT
0x05 WP
0x06 WB
0x2, 0x3, 0x7-0xFF Reserved

MTRRs control caching of selected regions of physical memory. They can be divided into two categories. One controls first 1MB of physical addresses from 0x0 to 0xFFFFF with 11 fixed range MTRRs. While the other may control any physical pages with m number of variable range MTRRs, where m is reported by field VCNT of MTRRCAP register.


To define a memory type for a region of physical addresses, we corperate with IA32_MTTR_PHYSBASEn and IA32_MTTR_MASKn registers. PhysBase and PhysMask fields are used to define boundaries of the memory region, where

Address_Within_Range & PhysMask = PhysBase & PhysMask

Then, the field Type are used to encode an actual memory type of that region.

PAT

Encoding Memory Type
0x00 UC
0x01 WC
0x04 WT
0x05 WP
0x06 WB
0x07 UC-
0x2, 0x3, 0x8-0xFF Reserved
The PAT is a companion feature to the MTRRs; that is, the MTRRs allow mapping of memory types to regions of the physical address space, where the PAT allows mapping of memory types to pages within the linear address space. PAT is more flexibale than MTRRs in the way that it does not have hardware limitation on number of such attribute settings allowed.

  • Should not alais!

Process Context Identifier (PCID)

Processors may cache data from paging structures to accelerate address translation process. For a cache entry in either TLB or paging-structure caches, the processor may associate current PCID with the translation info to test if it belongs to current address space.

PCID is a 12-bit identifier with following properties.

if (CR4.PCIDE == 0)
    PCID = 0x000;
else
    PCID = CR3 & 0xFFF;

IA-32e with Linux (5.8.7)

Test Environment:

  • Hardware:
    • CPU: QEMU Virtual CPU version 2.0.0 (40 bits physical, 48 bits virtual)
    • Memory: 256MB
  • Kernel Configurations:
    • DEBUG_KERNEL
    • DEBUG_INFO
    • X86_5LEVEL
    • PREEMPT

Setting up qemu:
building
qemu option lists

Booting

Memory Layout

maps
x86-specific

Fast System Calls in 64-Bit Mode

Reference