ccs100203
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Note of kernel-scheduler-internals ###### tags: `linux2021q3` ## monolithic vs micro kernel > p8 - reduce the program in kernel | | monolithic | micro | | -------- | -------- | -------- | | Bugs | fewer | more | | recover bugs | hard(reboot) | easy(re-execute) | | portable | hard | easy | | maintain | hard | easy | | test a program | hard</br>(recompile whole system & reboot) | easy(recompile this program) | | performance | better | worse(heavy use of IPC)</br>(context switch & system call) | | modularity | hard | easy | --- ## process vs thread > p10 - Threads share the same address space, while processes do not. - The kernel cannot distinguish them. ### PID & TGID > p11 > PID: process ID > TGID: thread group ID - single-threaded process - PID == TGID - multi-threaded process - Each thread in one group has the same TGID and a unique PID. - thread group leader: PID == TGID - `getpid()`: return ==TGID== - `gettid()`: return ==PID== --- ## container_of > read-around: [Linux 核心原始程式碼巨集: container_of](https://hackmd.io/@sysprog/linux-macro-containerof) ```c #define container_of(ptr, type, member) \ ({ \ void *__mptr = (void *)(ptr); \ ((type *)(__mptr - offsetof(type, member))); \ }) // An alias that is used everywhere #define list_entry(ptr, type, member) container_of(ptr, type, member) ``` give a member, return the begin of the struct. So, we can get the data we want in the structure without a pointer in the list node. ==aaaaaaaamazed== usage: One data structure can contain many different list_heads. In other words, it can exist in the different linked lists at the same time, and not consume more spaces for duplicate data. - OFFSETOF ```c // File include/linux/stddef.h #define offsetof(TYPE, MEMBER) ((size_t)&((TYPE *)0)->MEMBER) ``` TYPE is the struct we are considering, MEMBER is the name of the field, what it does is: 1. Take the address 0, the first in the address space of the process 2. Cast it to a TYPE pointer 3. Dereference the pointer and access the MEMBER field 4. Take the address of the field and cast it to a size, now it is no longer an address --- ## Tasks lifetime >P24 ### Zombie Process A zombie process is a process that is terminated, but its process descriptor and entry in the pid hash table are still present in memory and accessible (for example, by ps -aux). Tasks’ resources are not deallocated immediately because the parent process may want to access some of this information, most likely the exit status, or may want to synchronize with the child process termination via wait() or waitpid(). Zombie processes are impossible to kill externally: they can’t receive signals as they no longer exist, so a wait by the parent is the only way to clean the memory occupied by the zombie data structure. The ancestor process (init) has a routine that waits periodically to reap possible zombie processes. ### Task States ```cpp /* Used in tsk->state: */ #define TASK_RUNNING 0x0000 #define TASK_INTERRUPTIBLE 0x0001 #define TASK_UNINTERRUPTIBLE 0x0002 #define __TASK_STOPPED 0x0004 #define __TASK_TRACED 0x0008 /* Used in tsk->exit_state: */ #define EXIT_DEAD 0x0010 #define EXIT_ZOMBIE 0x0020 #define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD) ``` ## Response time and throughput cannot maintain both --- ## The Linux scheduler >p30 ### Scheduler Concept **In fact, all scheduling on Linux is preemptive.** ![](https://i.imgur.com/yG6hx6Z.png) | Scheduling classes | Scheduling policies | | -------- | -------- | | stop sched class | | | dl sched class | SCHED DEADLINE | | rt sched class | SCHED FIFO </br>SCHED RR | | fair sched class | SCHED NORMAL</br>SCHED BATCH</br>SCHED IDLE | | idle sched class | | - SCHED_NORMAL SCHED_NORMAL is the **default policy** that is used for regular tasks and uses **CFS** (the Completely Fair Scheduler, implemented in fair.c) - SCHED_BATCH SCHED_BATCH is similar to SCHED_NORMAL but it will **preempt less frequently**, so every process will run longer. For this reason, it is **more suited for non-interactive workloads**, typically on **servers**. ### Terminology - affinity ### Commands #### Process Priority ```shell ps -eo pid,rtprio,time,comm ps -el ``` report a snapshot of the current processes with real-time priority ranging from 0 to 99. - **nice** ---- ### O(1) Scheduler ![](https://i.imgur.com/HdwuCRw.png) If a process does not complete its full timeslice before it is preempted, then it goes back in the ready queue. If it does run to the end of the timeslice, it is placed in the expired queue instead. All scheduling takes place from the active queues. The highest priority queue is chosen; if there are multiple tasks in that queue, they are scheduled in round-robin fashion. This continues until the active queue structure is empty. When that happens, the active and expired queues change places, and execution (scheduling) continues. #### Pros - O(1) it executed in constant time (O(1)) under all circumstances. - work stealing O(1) scheduler keeps a global scheduler that can rebalance per-CPU queues. If a CPU is idle, O(1) scheduler takes a process from another CPU. #### Cons - different effect from nice value This means that with the O(1) scheduler, calling nice() to increment the nice level of a task has different effects depending on the initial value. - BAD INTERACTIVE IN REALTIME This approach caused another major problem because SCHED_FIFO, as we stated earlier, **is not starvation proof**. **Tasks in O(1) scheduler had to wait until all of the other tasks, in all of active runqueues at all of the levels of higher priority, exhausted their timeslices.** A task is also marked as interactive depending on its dynamic priority and its nice value. ### Rotating Staircase DeadLine The **”multi-level”** queues correspond to **different levels of time quanta** on the CPU. The **highest queue has the shortest quanta** on the CPU, with each subsequent queue having longer quanta. With several queues, processes placed onto these queues will run for the timeslice specified by its queue. **The number of queues and the range of time quanta will vary**, but this structure allows for processes to be classified into groups based on their needs. Using this model, we can **prioritize interactive and I/O bound processes by introducing priorities, preemption, and feedback across the queues.** ### Completely Fair Scheduler (CFS) #### virtual runtime `t` × `weight` (based on priority i.e. nice value) a monotonic increasing value accumulates 1 ms of virtual runtime for each elapsed millisecond during which it runs. CFS will choose the lowest virtual runtime task for the fairness. #### target time CFS starts with a target time for **how long it should take to make one complete round-robin** through the runnable threads. The value of 6 ms used in the examples is the default for uniprocessor systems. #### Running the next task Being as fair as possible with all the tasks means keeping **all the tasks’ vruntimes as close as possible** to each other. Following this logic, the **task** that **deserves** more than anyone to be **executed** next is the one **with the smallest vruntime**. #### Runqueue & minimum virtual runtime the **minimum virtual runtime** is **the virtual runtime of the most deserving active task** (in the run queue). Every time that a task is chosen for execution, the minimum is updated. **When new tasks are added** to the runqueue their virtual runtime is updated to keep things fair. **When a sleeping task is woken up,** the scheduler checks that its virtual runtime is at least equal to the current minimum. When a new task is created via **fork()** and inserted into the runqueue, it **inherits the virtual runtime from the parent**: this **prevents** the exploit where a task can **take control of the CPU by continuously forking itself**. #### Pros - Nice Value handle nice values better than the previous scheduler increasing the nice value by one has the same effect regardless of the starting value nice value seemed to a weight, so the CPU proportion is determined only by the relative difference in nice values - the preemption time is no longer fixed like in the O(1) scheduler, but it is variable #### Cons Too many context switch cause more overheads. ### Multiprocessing #### Load balancing **The load of a task** becomes **a combination of its weight and its average CPU utilization**, and the load of the core is the sum of the loads of its tasks. Since the CPU utilization of a task can vary, its load is **constantly updated**. #### Migration *non-uniform memory access* (NUMA) cause different cost of migrating tasks to another CPU. ### Energy-Aware Scheduling (EAS) **CFS will always put a new task on an idle CPU if available (promoting throughput)**. However, this is not always the most energy-efficient decision. Because evaluating all possible options would impact performance, **EAS simply evaluates the CPU the task last ran on and the CPU chosen by a simple heuristic** which finds where the task best fits. #### Arm big.LITTLE architecture The name “big.LITTLE” refers to the two CPU/core types, big and LITTLE. The **big cores are more powerful but consume more power**, whereas the **LITTLE cores are less powerful but also consume less power**. *asymmetric multi-core* (AMC) adds complexity to the scheduling problem. #### Dynamic Voltage and Frequency Scaling (DVFS) It became possible to change a CPU’s frequency dynamically, either through the BIOS or the operating system. i.e. DVFS techniques adjust the frequency of the little cores from 200MHz up to 1.4GHz. DVFS adds complexity to the scheduling problem. #### Summarization in order to have both good performance and power efficient, a scheduler could take into account: - What type of core the task(s) run on (big or LITTLE), - whether it is worth migrating a task between cores and between core types, - and whether it is worth running the cores at full speed and voltage or if some cores could be throttled to save power at a slight performance hit. The existing CFS has a throughput based policy, but it doesn't fit the energy usage. --- ## Ftrace Ftrace uses a **ring buffer** to store all the events that are happening at runtime ### Function tracing VS. Event tracing #### Function tracing - using code instrumentation mechanism by gcc, enabled by compiling with the -pg flag - **dynamic profiling**: toggled at runtime in the binary executable, without the need to recompile the code. - gcc adds extra NOP (“No OPeration”) assembly instructions at the beginning of every function, so that it will be possible to change these NOPs into something else, if needed. - advantage: - tracing at runtime, **zero overhead** when it is disabled - **filter what is being traced.** We could dynamically activate tracing only on functions from a single subsystem, or on one function alone. #### Event tracing - less efficient than function tracing. It uses tracepoints directly in the C code, which makes it static. - Since this mechanism is static, **the whole kernel must be recompiled to toggle the tracepoints.** ### Interfacing Creates **a sort of shared memory between the user and the kernel**. This is done through the **procfs** filesystem, which is found in **/proc** The approach used by Linux is more straightforward: the information is (mostly) in human-readable form, **so you simply read the files in /proc and parse the results as strings.** By doing so, **no syscall is needed**, except, of course, open() and read() to interact with the filesystem. **On Linux, when commands such as ps, top or pgrep are invoked, they internally query the procfs filesystem.** You could always do the same operation manually by doing something like cat /proc/1337/info_that_you_need | grep specific_info, but it would be tedious: this is why utilities like ps are convenient front-ends for the user. --- ## Questions - quiz15 1. 在我填好選項且執行時,65 行的 "No second ELF header found.\n" 總是會跑出來, 我認為這是因為程式找不到所謂的 real elf,也就是利用 cat 內嵌進去的 payload。 我不確定是我執行操作不當還是環境錯誤 (kernel 版本為 5.4.0-73), 但這部份貌似不是選項填答會影響到的,我也嘗試過用 objcopy 但目前還未成功。 2. 選項 AAA 的部份,我認為做加減運算時是需要對 newelf 做型態轉換的。 如同上面在 63 行 size - (intptr_t) newelf - 6 這樣, 然後目前還在研究為什麼加法的話只會出現 warning, 而減法則會出現 error, 這部份應該要從 c 語言規格書查閱嘛?

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully