AArch64 scheduling in Linux

###### tags: `Linux` `底層` `ARM` # AArch64 scheduling in Linux Linux version: 5.8 # schedule * `kernel\sched\core.c` ```c asmlinkage __visible void __sched schedule(void) { struct task_struct *tsk = current; sched_submit_work(tsk); do { preempt_disable(); __schedule(false); sched_preempt_enable_no_resched(); } while (need_resched()); sched_update_worker(tsk); } EXPORT_SYMBOL(schedule); ``` * `__schedule`: ```c /* * __schedule() is the main scheduler function. * * The main means of driving the scheduler and thus entering this function are: * * 1. Explicit blocking: mutex, semaphore, waitqueue, etc. * * 2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return * paths. For example, see arch/x86/entry_64.S. * * To drive preemption between tasks, the scheduler sets the flag in timer * interrupt handler scheduler_tick(). * * 3. Wakeups don't really cause entry into schedule(). They add a * task to the run-queue and that's it. * * Now, if the new task added to the run-queue preempts the current * task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets * called on the nearest possible occasion: * * - If the kernel is preemptible (CONFIG_PREEMPTION=y): * * - in syscall or exception context, at the next outmost * preempt_enable(). (this might be as soon as the wake_up()'s * spin_unlock()!) * * - in IRQ context, return from interrupt-handler to * preemptible context * * - If the kernel is not preemptible (CONFIG_PREEMPTION is not set) * then at the next: * * - cond_resched() call * - explicit schedule() call * - return from syscall or exception to user-space * - return from interrupt-handler to user-space * * WARNING: must be called with preemption disabled! */ static void __sched notrace __schedule(bool preempt) { struct task_struct *prev, *next; unsigned long *switch_count; unsigned long prev_state; struct rq_flags rf; struct rq *rq; int cpu; cpu = smp_processor_id(); rq = cpu_rq(cpu); prev = rq->curr; schedule_debug(prev, preempt); if (sched_feat(HRTICK)) hrtick_clear(rq); local_irq_disable(); rcu_note_context_switch(preempt); /* * Make sure that signal_pending_state()->signal_pending() below * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) * done by the caller to avoid the race with signal_wake_up(): * * __set_current_state(@state) signal_wake_up() * schedule() set_tsk_thread_flag(p, TIF_SIGPENDING) * wake_up_state(p, state) * LOCK rq->lock LOCK p->pi_state * smp_mb__after_spinlock() smp_mb__after_spinlock() * if (signal_pending_state()) if (p->state & @state) * * Also, the membarrier system call requires a full memory barrier * after coming from user-space, before storing to rq->curr. */ rq_lock(rq, &rf); smp_mb__after_spinlock(); /* Promote REQ to ACT */ rq->clock_update_flags <<= 1; update_rq_clock(rq); switch_count = &prev->nivcsw; /* * We must load prev->state once (task_struct::state is volatile), such * that: * * - we form a control dependency vs deactivate_task() below. * - ptrace_{,un}freeze_traced() can change ->state underneath us. */ prev_state = prev->state; if (!preempt && prev_state) { if (signal_pending_state(prev_state, prev)) { prev->state = TASK_RUNNING; } else { prev->sched_contributes_to_load = (prev_state & TASK_UNINTERRUPTIBLE) && !(prev_state & TASK_NOLOAD) && !(prev->flags & PF_FROZEN); if (prev->sched_contributes_to_load) rq->nr_uninterruptible++; /* * __schedule() ttwu() * prev_state = prev->state; if (p->on_rq && ...) * if (prev_state) goto out; * p->on_rq = 0; smp_acquire__after_ctrl_dep(); * p->state = TASK_WAKING * * Where __schedule() and ttwu() have matching control dependencies. * * After this, schedule() must not care about p->state any more. */ deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); if (prev->in_iowait) { atomic_inc(&rq->nr_iowait); delayacct_blkio_start(); } } switch_count = &prev->nvcsw; } next = pick_next_task(rq, prev, &rf); clear_tsk_need_resched(prev); clear_preempt_need_resched(); if (likely(prev != next)) { rq->nr_switches++; /* * RCU users of rcu_dereference(rq->curr) may not see * changes to task_struct made by pick_next_task(). */ RCU_INIT_POINTER(rq->curr, next); /* * The membarrier system call requires each architecture * to have a full memory barrier after updating * rq->curr, before returning to user-space. * * Here are the schemes providing that barrier on the * various architectures: * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC. * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC. * - finish_lock_switch() for weakly-ordered * architectures where spin_unlock is a full barrier, * - switch_to() for arm64 (weakly-ordered, spin_unlock * is a RELEASE barrier), */ ++*switch_count; psi_sched_switch(prev, next, !task_on_rq_queued(prev)); trace_sched_switch(preempt, prev, next); /* Also unlocks the rq: */ rq = context_switch(rq, prev, next, &rf); } else { rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); rq_unlock_irq(rq, &rf); } balance_callback(rq); } ``` * 註解講述呼叫 `schedule` 的時機, 其中第二點為在 interrupt return 或 userspace return 的過程會呼叫到, 並提到了要驅動 scheduler, 要設定 timer, 使其定期呼叫 `scheduler_tick()` 去計時, 一段時間過後就設定 `TIF_NEED_RESCHED` 使其在第二點所述過程中被重新排程 # IRQ handler * `arch/arm64/kernel/entry.S` * `el1_irq` ```c SYM_CODE_START_LOCAL_NOALIGN(el1_irq) kernel_entry 1 gic_prio_irq_setup pmr=x20, tmp=x1 enable_da_f ... irq_handler #ifdef CONFIG_PREEMPTION ldr x24, [tsk, #TSK_TI_PREEMPT] // get preempt count alternative_if ARM64_HAS_IRQ_PRIO_MASKING /* * DA_F were cleared at start of handling. If anything is set in DAIF, * we come back from an NMI, so skip preemption */ mrs x0, daif orr x24, x24, x0 alternative_else_nop_endif cbnz x24, 1f // preempt count != 0 || NMI return path bl arm64_preempt_schedule_irq // irq en/disable is done inside 1: #endif ... kernel_exit 1 SYM_CODE_END(el1_irq) ``` * 可以看到在結束 irq_handler 後, 在 `kernel_exit` 之前 (內部會呼叫 eret 從 exception context 返回), 有一段判斷是否要呼叫 `arm64_preempt_schedule_irq` 的邏輯 # arm64_preempt_schedule_irq * `arch/arm64/kernel/process.c`: ```c asmlinkage void __sched arm64_preempt_schedule_irq(void) { lockdep_assert_irqs_disabled(); /* * Preempting a task from an IRQ means we leave copies of PSTATE * on the stack. cpufeature's enable calls may modify PSTATE, but * resuming one of these preempted tasks would undo those changes. * * Only allow a task to be preempted once cpufeatures have been * enabled. */ if (system_capabilities_finalized()) preempt_schedule_irq(); } ``` # preempt_schedule_irq * `kernel/sched/core.c`: ```c /* * This is the entry point to schedule() from kernel preemption * off of irq context. * Note, that this is called and return with irqs disabled. This will * protect us against recursive calling from irq. */ asmlinkage __visible void __sched preempt_schedule_irq(void) { enum ctx_state prev_state; /* Catch callers which need to be fixed */ BUG_ON(preempt_count() || !irqs_disabled()); prev_state = exception_enter(); do { preempt_disable(); local_irq_enable(); __schedule(true); local_irq_disable(); sched_preempt_enable_no_resched(); } while (need_resched()); exception_exit(prev_state); } ``` * 與 `schedule` 類似, 幾個差別: * 在執行 `__schedule` 時啟用中斷, 由於進入此函數時, 中斷是關閉的, 要另外打開 * 多了 `exception_enter`、`exception_exit` 的部分