Try   HackMD

2024q1 Homework6 (integration)

撰寫 Linux 核心模組

注意 .c 檔案的檔案名稱要與 Makefile 裡面的 obj-m .o 檔名一致,否則 make 時會見到以下錯誤訊息

~/g/m/hello-1 ❯❯❯ ls
Makefile  module.c
~/g/m/hello-1 ❯❯❯ make
make -C /lib/modules/`uname -r`/build M=/home/idoleat/tmp/module/hello-1 modules
make[1]: Entering directory '/usr/lib/modules/6.8.9-arch1-1/build'
make[3]: *** No rule to make target '/home/idoleat/tmp/module/hello-1/hello-1.o', needed by '/home/idoleat/tmp/module/hello-1/'.  Stop.
make[2]: *** [/usr/lib/modules/6.8.9-arch1-1/build/Makefile:1921: /home/idoleat/tmp/module/hello-1] Error 2
make[1]: *** [Makefile:240: __sub-make] Error 2
make[1]: Leaving directory '/usr/lib/modules/6.8.9-arch1-1/build'
make: *** [Makefile:8: all] Error 2

原因在於 kernel build system 會依據 object file (.o) 的名字尋找對應的 source file (.c),使用此 naming convention 就不用額外的規則去尋找 source file,避免不必要的複雜與錯誤,kernel build system 已經夠複雜了

注意到如果是要編譯多個模組,對於 obj-m 必須用 += 串接標的 object file,而不是使用 := (simple assignment) 蓋掉先前的值

核心模組運作機制

Linux 核心模組使用的記憶體地址空間為 kernel virtual memory,因為他的實體地址不需要是連續的,insmod 的時候會使用 vmalloc() 配置。MMIO 也是。不過要注意的是在 32-bit 的機器上 virtual area 比較小

如果 kernel 直接使用 user space 的虛擬記憶體地址會發生什麼事情?
虛擬記憶體幾個要注意的地方:

  1. 虛擬記憶體地址對應到的實體記憶體地址有可能還沒被 swapped in,甚至還沒有被配置
  2. 虛擬記憶體地址對應到的實體記憶體地址不一定連續
  3. Context switch 的時候會發生什麼事?Kernel 會知道該虛擬記憶體地址只適用於另一個 page 嗎?
  4. 作業說明提到:The address to which a pointer from that space points and the address in the kernel address space may have different values.

Observability 是 Linux kernel 一個強項,所以其實都可以透過 tracing tool 來理解運作機制

  • ftrace
    echo 1 > /sys/kernel/tracing/events/vmalloc/enable
    echo 1 > /sys/kernel/tracing/events/module/module_load/enable
  • /proc/softirqs and /proc/interrupts
    • 在多核環境下可讀性差,可以提交 patch 與開發者討論合適的改進方式,或是製作小工具 parse 成更好閱讀的形式
  • RTLA
  • drgn
  • eBPF
  • SystemTap

Modify simrupt implementation

  • Replace tasklet APIs with workqueue APIs
  • Replace deprecated context detection macro
    • in_irq, in_softirq and in_interrupt are deprecated

The reason to deprecate in_softirq and in_interrupt is that when bottom half lock is held (local_bh_disable()), it can give false positive (mentioned in Unreliable hacking guide by Rusty). Also commit 7c47889 added a note (but now removed) to mentioned this as well:

Note: due to the BH disabled confusion: in_softirq(),in_interrupt() really should not be used in new code.

When to disable soft interrupts on the local CPU? What's the difference between

Trace the behavior of simrupt

drgn tools/workqueue/wq_dump.py


Questions on hold

CMWQ 裡面的連續 work item 如果都是用同一個 function,可以讓 function 連續執行而不是結束再重新呼叫嗎?(確定會結束再重新呼叫嗎?)

An MT wq could provide only one execution context per CPU while an ST wq one for the whole system. Work items had to compete for those very limited execution contexts leading to various problems including proneness to deadlocks around the single execution context.

A work item can be executed in either a thread or the BH (softirq) context

execution context 是不是就是指那個 function?
就是指他在什麼前提或環境下執行?
目前看到的 context: interrupt context, process context, atomic context

/*
 * Macros to retrieve the current execution context:
 *
 * in_nmi()		- We're in NMI context
 * in_hardirq()		- We're in hard IRQ context
 * in_serving_softirq()	- We're in softirq context
 * in_task()		- We're in task context
 */
#define in_nmi()		(nmi_count())
#define in_hardirq()		(hardirq_count())
#define in_serving_softirq()	(softirq_count() & SOFTIRQ_OFFSET)
#ifdef CONFIG_PREEMPT_RT
# define in_task()		(!((preempt_count() & (NMI_MASK | HARDIRQ_MASK)) | in_serving_softirq()))
#else
# define in_task()		(!(preempt_count() & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
#endif

/*
 * The following macros are deprecated and should not be used in new code:
 * in_irq()       - Obsolete version of in_hardirq()
 * in_softirq()   - We have BH disabled, or are processing softirqs
 * in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
 */
#define in_irq()		(hardirq_count())
#define in_softirq()		(softirq_count())
#define in_interrupt()		(irq_count())

...
   
/*
 * Are we running in atomic context?  WARNING: this macro cannot
 * always detect atomic context; in particular, it cannot know about
 * held spinlocks in non-preemptible kernels.  Thus it should not be
 * used in the general case to determine whether sleeping is possible.
 * Do not use in_atomic() in driver code.
 */
#define in_atomic()	(preempt_count() != 0)

/*
 * Check whether we were atomic before we did preempt_disable():
 * (used by the scheduler)
 */
#define in_atomic_preempt_off() (preempt_count() != PREEMPT_DISABLE_OFFSET)    
int myintarray[2]; 
module_param_array(myintarray, int, NULL, 0); /* not interested in count */ 

 
short myshortarray[4]; 
int count; 
module_param_array(myshortarray, short, &count, 0); /* put count into "count" variable */

what is meaning of count if some count and some don't?
try it

register_chrdev() would occupy a range of minor numbers

static inline int register_chrdev(unsigned int major, const char *name,
				  const struct file_operations *fops)
{
    /* extern void __unregister_chrdev(unsigned int major, 
     * unsigned int baseminor, unsigned int count, const char *name)
     */
    return __register_chrdev(major, 0, 256, name, fops);
}

佔了 256 個?
why? Do we have doc for this?

Avoiding Linuxisms:
https://people.freebsd.org/~olivierd/porters-handbook/dads-avoiding-linuxisms.html

Do not use /proc if there are any other ways of getting the information. For example, setprogname(argv[0]) in main() and then getprogname(3) to know the executable name>.

what will happen if kernel holds a pointer point to a piece of memory in user process segment and dereferences it after context switched?

perf: interrupt took too long (2668 > 2500), lowering kernel.perf_event_max_sample_rate to 74700

/proc/modules vs /sys/modules

兩個看似毫不相干的東西都提到同個概念可以高度加深理解與印象,兩份不同來源的教材也有一樣的效果,例如在教材上讀的東西與 xxx in the wild。在論壇上、群組裡與別人討論新聞、專案或奇怪的想法可以遇到野生的 xxx。
so subscribe to LKML is important as well. Or at least LWN or Phoronix. Just don't spend too much time on them. 因為在初期看不到事情全貌與關鍵點,就算深入探索也只是在走馬看花

local_bh_disable(), the big softirq lock
https://www.youtube.com/watch?v=rmv40f5K8AI
Live demo of it is pretty comprehensive

spinlock 的範例看不出來有誰來競爭但因為 lock 被擋下而自旋了一下
寫一個好了
An spinlock in userspace incident in 2020
https://news.ycombinator.com/item?id=21959692
https://www.phoronix.com/news/Linux-2020-Scheduler-Bugs-Stadia
https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-and-how-bad-the-linux-scheduler-really-is/

需要把範例程式碼都跑一遍

試試看只 local_irq_disable 但是不 enable,lockdep 會追蹤 disable/enable 但是會阻止沒有成對的使用嗎?

work stealing

What's difference of following functions?

void disable_irq(unsigned int irq)

  • disable an irq and wait for completion
    Parameters: unsigned int irq: Interrupt to disable

    Disable the selected interrupt line. Enables and Disables are nested. This function waits for any pending IRQ handlers for this interrupt to complete before returning. If you use this function while holding a resource the IRQ handler may need you will deadlock.
    Can only be called from preemptible code as it might sleep when an interrupt thread is associated to irq.

bool disable_hardirq(unsigned int irq)

  • disables an irq and waits for hardirq completion
    Parameters: unsigned int irq: Interrupt to disable

    Disable the selected interrupt line. Enables and Disables are nested. This function waits for any pending hard IRQ handlers for this interrupt to complete before returning. If you use this function while holding a resource the hard IRQ handler may need you will deadlock.
    When used to optimistically disable an interrupt from atomic context the return value must be checked.

    Return:
    false if a threaded handler is active.

    This function may be called - with care - from IRQ context.

local_irq_save()/local_irq_restore()
Defined in include/linux/irqflags.h

  • These routines disable hard interrupts on the local CPU, and restore them. They are reentrant; saving the previous state in their one unsigned long flags argument. If you know that interrupts are enabled, you can simply use local_irq_disable() and local_irq_enable().

local_bh_disable()/local_bh_enable()
Defined in include/linux/bottom_half.h

  • These routines disable soft interrupts on the local CPU, and restore them. They are reentrant; if soft interrupts were disabled before, they will still be disabled after this pair of functions has been called. They prevent softirqs and tasklets from running on the current CPU.
  • atomics on armv5 would local_bh_diable before operation and enable again

local_irq_disable()/local_irq_enable()

  • shuts off local interrupt delivery without saving the state
  • 能 local 就 local 嗎?不知道全域關閉的 disable_irq 會用在什麼時候
  • 去看看別人 driver 長怎樣好了

Details explained here
除了關閉有更好的作法以增進 Real time 能力

Preventing preemption using interrupt disabling

送了一個 patch 修正 generic irq 文件的多餘括號
https://lore.kernel.org/lkml/20240619160057.128208-1-idoleat@taiker.tw/T/#u

Applied by Jon Corbet.

Understanding Linux Interrupt Subsystem - Priya Dixit, Samsung Semiconductor India Research
https://www.youtube.com/watch?v=LOCsN3V1ECE

in __init: request_irq to register handler to irq number
in __exit: free_irq to unregister from the irq number