# Linux Kernel Deferred Work Note 傳統上,interrupt分成 - top half: receives the hardware interrupt - handle immediate, time-critical part of interrupt processing - run with some (or all) interrupt levels disabled - are often time-critical and they deal with HW - do not run in process context and cannot block - bottom half (本篇主要內容) - handle less time-critical part of interrupt processing - Types of bottom halves - Softirq - Tasklets - Workqueues ## softirq - statically allocated - fixed number of irq, for current linux, 10 irq vectors defined ```c enum { HI_SOFTIRQ=0, TIMER_SOFTIRQ, NET_TX_SOFTIRQ, NET_RX_SOFTIRQ, BLOCK_SOFTIRQ, BLOCK_IOPOLL_SOFTIRQ, TASKLET_SOFTIRQ, SCHED_SOFTIRQ, HRTIMER_SOFTIRQ, RCU_SOFTIRQ, NR_SOFTIRQS }; ``` - [ ] [IRQs: the Hard, the Soft, the Threaded and the Preemptible](http://she-devel.com/Chaiken_ELCE2016.pdf) ![image](https://hackmd.io/_uploads/rJ9TzvY8R.png) which can check with /proc/softirqs ``` # cat /proc/softirqs CPU0 CPU1 CPU2 CPU3 HI: 34 0 0 0 TIMER: 494 26428 52205 54076 NET_TX: 0 0 0 0 NET_RX: 0 0 0 0 BLOCK: 0 0 0 0 IRQ_POLL: 0 0 0 0 TASKLET: 68 1 0 0 SCHED: 3099 27294 100455 62819 HRTIMER: 0 0 0 0 RCU: 3799 10153 81440 45881 ``` ### Example in kernel - Network Packet Processing - 當network interface card收到packet,trigger HW interrupt,interrupt的處理便會細分成兩項,也就是需要緊急處理的top half,以及可以延後處理的bottom half (using softirqs) - The deferred processing is handled by `NET_RX_SOFTIRQ` softirq - Top half (Interrupt handler) ```c static irqreturn_t nic_interrupt_handler(int irq, void *dev_id) { struct net_device *dev = dev_id; struct sk_buff *skb; // Acknowledge the interrupt to the NIC ack_interrupt(dev); // Read the packet into an skb (socket buffer) skb = dev_alloc_skb(dev->mtu + NET_IP_ALIGN); if (!skb) { return IRQ_HANDLED; // Drop packet if no memory } skb_reserve(skb, NET_IP_ALIGN); nic_read_packet(dev, skb); // Schedule the softirq for network processing netif_rx(skb); return IRQ_HANDLED; } ``` + `netif_rx`: to schedule the softirq for further processing of the packet (-->`netif_rx_internal`) + The function eventually leads to calling `__napi_schedule` (--> `____napi_schedule`), which schedules the softirq to run ```c static inline void ____napi_schedule(struct softnet_data *sd, struct napi_struct *napi) { __raise_softirq_irqoff(NET_RX_SOFTIRQ); } ``` - Bottom half (Softirq handler) *linux/net/core/dev.c* ```c static int __init net_dev_init(void) { open_softirq(NET_RX_SOFTIRQ, net_rx_action); } static __latent_entropy void net_rx_action(struct softirq_action *h) { struct softnet_data *sd = this_cpu_ptr(&softnet_data); struct napi_struct *napi; sd->in_net_rx_action = true; local_irq_disable(); list_splice_init(&sd->poll_list, &list); local_irq_enable(); for (;;) { struct napi_struct *n; ... //遍歷每個設備的napi,直到所有設備都處理完 if (list_empty(&list)) { if (list_empty(&repoll)) { sd->in_net_rx_action = false; barrier(); if (!list_empty(&sd->poll_list)) goto start; if (!sd_has_rps_ipi_waiting(sd)) goto end; } break; } n = list_first_entry(&list, struct napi_struct, poll_list); budget -= napi_poll(n, &repoll); ... /*重新放入list*/ list_splice_tail_init(&sd->poll_list, &list); list_splice_tail(&repoll, &list); list_splice(&list, &sd->poll_list); } } ``` + `net_rx_action`是softirq `NET_RX_SOFTIRQ`的處理函數。會把`sd->poll_list`中所有設備取出來並依次處理 + 如果超時或超過預算,將`napi`重新加入`sd->poll_list`等到下次的`NET_RX_SOFTIRQ`調度 ## tasklets - 是softirq的其中一個方法 - ***Because tasklets are implemented on top of softirqs, they are softirqs.*** - 在kernel中比softirq更常使用 - run in the software interrupt context - support priority: 初始化兩個vector `tasklet_hi_vec`, `tasklet_vec` ![image](https://hackmd.io/_uploads/S1yY5_-IR.png) - craete tasklets + `DECLARE_TASKLET(name, func, data)` + `void tasklet_init(struct tasklet_struct *t, void (*func)(unsigned long), unsigned long data)` - schedule tasklets `void tasklet_schedule(struct tasklet_struct *t);` `void tasklet_hi_schedule(struct tasklet_struct *t);` - [ ] tasklet schedule flow ![image](https://hackmd.io/_uploads/r1tbM3jIA.png) ### Example in kernel - ASPEED Cryptographic Engine * structure define *linux/drivers/crypto/aspeed/aspeed-rsss.h* ```c typedef int (*aspeed_rsss_fn_t)(struct aspeed_rsss_dev *); struct aspeed_engine_rsa { struct tasklet_struct done_task; /* callback func */ aspeed_rsss_fn_t resume; }; struct aspeed_rsss_dev { struct aspeed_engine_rsa rsa_engine; }; ``` - initial tasklets *linux/drivers/crypto/aspeed/aspeed-rsss-rsa.c* ```c int aspeed_rsss_rsa_init(struct aspeed_rsss_dev *rsss_dev) { tasklet_init(&rsa_engine->done_task, aspeed_rsa_done_task, (unsigned long)rsss_dev); } ``` - schedule task *linux/drivers/crypto/aspeed/aspeed-rsss.c* ```c static irqreturn_t aspeed_rsss_irq(int irq, void *dev) { struct aspeed_rsss_dev *rsss_dev = (struct aspeed_rsss_dev *)dev; struct aspeed_engine_rsa *rsa_engine = &rsss_dev->rsa_engine; ... if (rsa_engine->flags & CRYPTO_FLAGS_BUSY) tasklet_schedule(&rsa_engine->done_task); else dev_err(rsss_dev->dev, "RSA no active requests.\n"); ... } ``` - done task *linux/drivers/crypto/aspeed/aspeed-rsss-rsa.c* ```c //resume callback assign給aspeed_rsa_transfer static int aspeed_rsa_trigger(struct aspeed_rsss_dev *rsss_dev) { ... rsa_engine->resume = aspeed_rsa_transfer; ... } ``` ```c // Process the completed cryptographic operation static void aspeed_rsa_done_task(unsigned long data) { struct aspeed_rsss_dev *rsss_dev = (struct aspeed_rsss_dev *)data; struct aspeed_engine_rsa *rsa_engine = &rsss_dev->rsa_engine; (void)rsa_engine->resume(rsss_dev); } ``` ## workqueue - run in the context of kernel process - must not be "atomic" - worker_thread - work queue are matained by `work_struct` ```c struct work_struct { atomic_long_t data; struct list_head entry; work_func_t func; #ifdef CONFIG_LOCKDEP struct lockdep_map lockdep_map; #endif }; ``` + func: the function that will be scheduled by the workqueue + data: parameter of this function - create work * static creation ```c #define DECLARE_WORK(n, f) \ struct work_struct n = __WORK_INITIALIZER(n, f) ``` + n: name of workqueue + f: workqueue function * runtime creation ```c #define INIT_WORK(_work, _func) \ __INIT_WORK((_work), (_func), 0) #define __INIT_WORK(_work, _func, _onstack) \ do { \ __init_work((_work), _onstack); \ (_work)->data = (atomic_long_t) WORK_DATA_INIT(); \ INIT_LIST_HEAD(&(_work)->entry); \ (_work)->func = (_func); \ } while (0) /* delay version */ #define INIT_DELAYED_WORK(_work, _func) \ __INIT_DELAYED_WORK(_work, _func, 0) #define __INIT_DELAYED_WORK(_work, _func, _tflags) \ do { \ INIT_WORK(&(_work)->work, (_func)); \ __init_timer(&(_work)->timer, \ delayed_work_timer_fn, \ (_tflags) | TIMER_IRQSAFE); \ } while (0) ``` + n: name of workqueue + f: workqueue function - create work queue + `create_workqueue(name)`: 會在每個CPU上創建`worker_thread` + `create_singlethread_workqueue(name)`:只負責在一個CPU上創建一個`worker_thread` + linux也會有預設的work queue + `system_wq` + `system_highpri_wq` - schedule work + `int schedule_work(struct work_struct *work);` + `static inline bool schedule_delayed_work(struct delayed_work *dwork, unsigned long delay)` - After a work was created, we can put it into workqueue, use `queue_work` or `queue_delay_work` ```c static inline bool queue_work(struct workqueue_struct *wq, struct work_struct *work) { return queue_work_on(WORK_CPU_UNBOUND, wq, work); } ``` + `queue_work` -> `queue_work_on` -> `__queue_work` - Summary table | Feature | `INIT_WORK` | `INIT_DELAYED_WORK` | | --------------------------| ---------------------| -------------------------| | **Execution Timing** | As soon as scheduled |After a specified delay | | **Queue Function** | `schedule_work()` |`schedule_delayed_work()` | | **Purpose** | Immediate tasks |Deferred or periodic tasks| | **Struct Used** | `struct work_struct` |`struct delayed_work` | ### Example in kernel - Ethernet MAC driver - structure ```c struct ftgmac100 { struct work_struct reset_task; } ``` - initial work *linux/drivers/net/ethernet/farday/ftgmac100.c* ```c static int ftgmac100_probe(struct platform_device *pdev) { struct ftgmac100 *priv; ... /* setup private data */ priv = netdev_priv(netdev); priv->netdev = netdev; priv->dev = &pdev->dev; INIT_WORK(&priv->reset_task, ftgmac100_reset_task); } ``` - schedule work *linux/drivers/net/ethernet/farday/ftgmac100.c* ```c static irqreturn_t ftgmac100_interrupt(int irq, void *dev_id) { /* AHB error -> Reset the chip */ if (status & FTGMAC100_INT_AHB_ERR) { if (net_ratelimit()) netdev_warn(netdev, "AHB bus error ! Resetting chip.\n"); iowrite32(0, priv->base + FTGMAC100_OFFSET_IER); schedule_work(&priv->reset_task); return IRQ_HANDLED; } } ``` - reset task *linux/drivers/net/ethernet/farday/ftgmac100.c* ```c static void ftgmac100_reset_task(struct work_struct *work) { struct ftgmac100 *priv = container_of(work, struct ftgmac100, reset_task); ftgmac100_reset(priv); } ``` ## Conclusion | Context | Can Sleep? | Typical Use | Example API | | ---------- | -------- | -------------------------------| ------------------------- | | Hard IRQ | No | Immediate interrupt work |IRQ handler (`request_irq`) | | Softirq | No | Deferred, fast, atomic |`open_softirq` | | Tasklet | No | Even more deferred, serialized |`tasklet_init`, `tasklet_schedule`| | Workqueue | Yes | 不能在IRQ做,但又不需要專屬kernel thread的工作 |`schedule_work` | Threaded IRQ | Yes | Deferred, may sleep |`devm_request_threaded_irq` |