Sterownik ethernetu

# Sterownik ethernetu Krótki wstęp na [bugzilla.kernel.org](https://bugzilla.kernel.org/show_bug.cgi?id=102171). TLDR: [ta funkcja](https://code.woboq.org/linux/linux/drivers/net/ethernet/atheros/alx/main.c.html#alx_intr_handle_misc) czasem zaspamowuje logi (`fatal interrupt 0x4001607, resetting`), i jak się spróbuje wyłączyć ten sterownik, to system zamiera proces po procesie. Winny jest pewnie sprzęt, ale świrująca karta sieciowa powinna wywalać co najwyżej swój sterownik, a nie stos sieciowy, czy całe jądro. Co robić jak żyć? Na szczęście jest [Kdump](https://www.kernel.org/doc/html/latest/admin-guide/kdump/kdump.html) i [crash](https://people.redhat.com/~anderson/) (dostępny pakiet na redhatach i gentoo), żeby obejrzeć, co się dzieje. ## Co mówi nam `crash` Pierwszy zrzut pamięci zrobiłem wykonując `modprobe -r alx` (usunięcie modułu) i doprowadzając tym do jakiegoś `OOPS`, a że ustawiłem ```sh CONFIG_PANIC_ON_OOPS=y CONFIG_PANIC_ON_OOPS_VALUE=1 ``` (można to zrobić równie dobrze w linii poleceń jądra), to wysypało się. uruchamiamy: ```sh $ crash vmlinux alx-rmmod.core KERNEL: vmlinux DUMPFILE: alx-rmmod.core CPUS: 4 DATE: Sat Apr 11 12:38:00 2020 UPTIME: 19:08:47 LOAD AVERAGE: 2.51, 2.80, 2.51 TASKS: 240 NODENAME: arusekk RELEASE: 5.6.2-gentoo VERSION: #4 SMP Wed Apr 8 15:40:31 CEST 2020 MACHINE: x86_64 (2494 Mhz) MEMORY: 7.7 GB PANIC: "Oops: 0000 [#1] SMP" (check log for details) PID: 2418 COMMAND: "kworker/3:4" TASK: ffff888255d62d80 [THREAD_INFO: ffff888255d62d80] CPU: 3 STATE: TASK_RUNNING (PANIC) ``` Możemy sprawdzić, na co czekało `modprobe -r alx` (pid 6884 poznałem wpisując ps) ``` crash> bt 6884 PID: 6884 TASK: ffff888264a91080 CPU: 0 COMMAND: "modprobe" #0 [ffffc90001c6fc40] __schedule at ffffffff8191a8c4 #1 [ffffc90001c6fcc0] schedule at ffffffff8191ad25 #2 [ffffc90001c6fcd8] schedule_timeout at ffffffff8191df34 #3 [ffffc90001c6fd50] wait_for_completion at ffffffff8191bdbe #4 [ffffc90001c6fda8] rcu_barrier at ffffffff811096b0 #5 [ffffc90001c6fdc0] netdev_run_todo at ffffffff81830358 #6 [ffffc90001c6fe20] alx_remove at ffffffffa01a4b59 [alx] #7 [ffffc90001c6fe38] pci_device_remove at ffffffff814ca041 #8 [ffffc90001c6fe58] device_release_driver_internal at ffffffff816f6d5c #9 [ffffc90001c6fe78] driver_detach at ffffffff816f6e8a ##10 [ffffc90001c6fe90] bus_remove_driver at ffffffff816f5c6a ##11 [ffffc90001c6fea0] pci_unregister_driver at ffffffff814c873f ##12 [ffffc90001c6fed0] __x64_sys_delete_module at ffffffff81129dea ##13 [ffffc90001c6ff38] do_syscall_64 at ffffffff81001b21 ##14 [ffffc90001c6ff50] entry_SYSCALL_64_after_hwframe at ffffffff81a00068 RIP: 00007f3376509f57 RSP: 00007ffd5b3e7e48 RFLAGS: 00000206 RAX: ffffffffffffffda RBX: 0000555d882d5b70 RCX: 00007f3376509f57 RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000555d882d5bd8 RBP: 0000555d882d5b70 R8: 0000000000000000 R9: 0000000000000000 R10: fffffffffffffcf9 R11: 0000000000000206 R12: 0000555d882d5bd8 R13: 0000000000000000 R14: 0000000000000000 R15: 0000555d882d5b70 ORIG_RAX: 00000000000000b0 CS: 0033 SS: 002b ``` No fajnie, to teraz sprawdźmy, skąd się wziął popłoch: ``` crash> bt -p PID: 2418 TASK: ffff888255d62d80 CPU: 3 COMMAND: "kworker/3:4" #0 [ffffc90002697bf8] machine_kexec at ffffffff81040103 #1 [ffffc90002697c40] __crash_kexec at ffffffff8112f008 #2 [ffffc90002697d08] crash_kexec at ffffffff8112fc80 #3 [ffffc90002697d18] oops_end at ffffffff8101514e #4 [ffffc90002697d38] no_context at ffffffff81047f43 #5 [ffffc90002697da0] page_fault at ffffffff81a00ecf [exception RIP: alx_reinit_rings+39] RIP: ffffffffa01a6177 RSP: ffffc90002697e58 RFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888262589880 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888262589880 RBP: ffff888262589880 R8: ffff888266000240 R9: ffff888266000300 R10: 0000000000000000 R11: ffffffff822241c8 R12: ffff8882674e1600 R13: 0000000000000000 R14: ffff8882674e5400 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #6 [ffffc90002697e60] alx_activate at ffffffffa01a62a9 [alx] #7 [ffffc90002697e70] alx_reset at ffffffffa01a66d5 [alx] #8 [ffffc90002697e80] process_one_work at ffffffff810be9b7 #9 [ffffc90002697eb8] worker_thread at ffffffff810bf3d5 ##10 [ffffc90002697f10] kthread at ffffffff810c5c5a ##11 [ffffc90002697f50] ret_from_fork at ffffffff81a0015f ``` Jak widać, jest to wątek jądra, jakiś `worker_thread` obsługujący reset urządzenia. Przekonajmy się, co to za kod się wyłożył. ```c crash> add-symbol-file /usr/src/linux-5.6.2-gentoo/drivers/net/ethernet/atheros/alx/alx.ko 0xffffffffa01a4000 add symbol table from file "/usr/src/linux-5.6.2-gentoo/drivers/net/ethernet/atheros/alx/alx.ko" at .text_addr = 0xffffffffa01a4000 Reading symbols from /usr/src/linux-5.6.2-gentoo/drivers/net/ethernet/atheros/alx/alx.ko...done. crash> disassemble /m alx_reinit_rings Dump of assembler code for function alx_reinit_rings: 451 static void alx_init_ring_ptrs(struct alx_priv *alx) 452 { 453 struct alx_hw *hw = &alx->hw; 454 u32 addr_hi = ((u64)alx->descmem.dma) >> 32; 455 struct alx_napi *np; 456 int i; 457 458 for (i = 0; i < alx->num_napi; i++) { 0xffffffffa01a6165 <+21>: test %eax,%eax 0xffffffffa01a6167 <+23>: jle 0xffffffffa01a6221 <alx_reinit_rings+209> 0xffffffffa01a616d <+29>: xor %edx,%edx 0xffffffffa01a6211 <+193>: add $0x1,%rdx 0xffffffffa01a6215 <+197>: cmp %edx,0x278(%rbp) 0xffffffffa01a621b <+203>: jg 0xffffffffa01a616f <alx_reinit_rings+31> 459 np = alx->qnapi[i]; 0xffffffffa01a616f <+31>: mov 0x230(%rbp,%rdx,8),%rax 460 if (np->txq) { 0xffffffffa01a6177 <+39>: mov 0x190(%rax),%rcx ``` Rejestr `rax` (wskazujący na np) zawiera zero, stąd też nie możemy odczytać, co w nim jest. ## Prawdziwa przyczyna Wnioskuję łatwo, że rejestr `rbp` wskazuje na `struct alx_priv` ```c crash> ptype (struct alx_priv*)0xffff888262589880 type = struct alx_priv { struct net_device *dev; struct alx_hw hw; int num_vec; struct { dma_addr_t dma; void *virt; unsigned int size; } descmem; struct alx_napi *qnapi[8]; int num_txq; int num_rxq; int num_napi; spinlock_t irq_lock; u32 int_mask; unsigned int tx_ringsz; unsigned int rx_ringsz; unsigned int rxbuf_size; struct work_struct link_check_wk; struct work_struct reset_wk; u16 msg_enable; spinlock_t stats_lock; } * crash> p ((struct alx_priv*)0xffff888262589880)->qnapi $1 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0} ``` No to widzimy, że ktoś wyczyścił tablicę `qnapi`, kiedy my jeszcze wykonujemy `alx_activate`. I w tym problem, że w ogóle je wykonujemy. ## Kod sterownika ```sh $ cd /usr/src/linux-5.6.2-gentoo/drivers/net/ethernet/atheros/alx ``` Interesuje nas tu tylko `main.c`. Zajmijmy się póki co wątkiem przerwania (tym, który dokonał dereferencji zera). Ze stosu wywołań widać, że wykonuje się w kontekście zadania `alx_reset`. Jedyne odwołanie do `alx_reset` to przypisanie go do obsługi zadania `reset_wk`, ustawianego w `alx_probe`. ```c=1830 INIT_WORK(&alx->reset_wk, alx_reset); ``` Ta jest kolejkowana przez ```c=212 static void alx_schedule_reset(struct alx_priv *alx) { schedule_work(&alx->reset_wk); } ``` Wywołaną przez ```c=331 static bool alx_intr_handle_misc(struct alx_priv *alx, u32 intr) { struct alx_hw *hw = &alx->hw; if (intr & ALX_ISR_FATAL) { netif_warn(alx, hw, alx->dev, "fatal interrupt 0x%x, resetting\n", intr); alx_schedule_reset(alx); return true; } ``` w kontekście przerwania sprzętowego. Teraz przejdźmy do wątku `modprobe -r alx`. Wyzerował on `qnapi`, które jest zerowane tylko w `alx_free_napis`, które wykonywane jest tylko przez `alx_stop` i podczas błędów inicjalizacji. ```c=1856 static void alx_remove(struct pci_dev *pdev) { struct alx_priv *alx = pci_get_drvdata(pdev); struct alx_hw *hw = &alx->hw; cancel_work_sync(&alx->link_check_wk); cancel_work_sync(&alx->reset_wk); /* restore permanent mac address */ alx_set_macaddr(hw, hw->perm_addr); unregister_netdev(alx->dev); /* dalej nie dochodzi */ ``` Tu pojawiają się kolejne pliki z kodu źródłowego jądra: `net/core/dev.c` ```c=9985 void unregister_netdev(struct net_device *dev) { rtnl_lock(); unregister_netdevice(dev); rtnl_unlock(); } ``` Tu wisi proces czekając na nigdy: ```c=9571 /* The sequence is: * * rtnl_lock(); * ... * register_netdevice(x1); * register_netdevice(x2); * ... * unregister_netdevice(y1); * unregister_netdevice(y2); * ... * rtnl_unlock(); * free_netdev(y1); * free_netdev(y2); * * We are invoked by rtnl_unlock(). * This allows us to deal with problems: * 1) We can delete sysfs objects which invoke hotplug * without deadlocking with linkwatch via keventd. * 2) Since we run with the RTNL semaphore not held, we can sleep * safely in order to wait for the netdev refcnt to drop to zero. * * We must not return until all unregister events added during * the interval the lock was held have been completed. */ void netdev_run_todo(void) { struct list_head list; /* Snapshot list, allow later requests */ list_replace_init(&net_todo_list, &list); __rtnl_unlock(); /* Wait for rcu callbacks to finish before next phase */ if (!list_empty(&list)) rcu_barrier(); /* <==== */ ``` Więc interesuje nas poprzednia funkcja (z `include/linux/netdevice.h`) ```c=2713 static inline void unregister_netdevice(struct net_device *dev) { unregister_netdevice_queue(dev, NULL); } ``` znów `net/core/dev.c` ```c=9940 void unregister_netdevice_queue(struct net_device *dev, struct list_head *head) { ASSERT_RTNL(); if (head) { list_move_tail(&dev->unreg_list, head); } else { rollback_registered(dev); /* Finish processing unregister after unlock */ net_set_todo(dev); } } ``` dalej: ```c=8850 static void rollback_registered(struct net_device *dev) { LIST_HEAD(single); list_add(&dev->unreg_list, &single); rollback_registered_many(&single); list_del(&single); } ``` czyli ```c=8759 static void rollback_registered_many(struct list_head *head) { struct net_device *dev, *tmp; LIST_HEAD(close_head); BUG_ON(dev_boot_phase); ASSERT_RTNL(); list_for_each_entry_safe(dev, tmp, head, unreg_list) { /* Some devices call without registering * for initialization unwind. Remove those * devices and proceed with the remaining. */ if (dev->reg_state == NETREG_UNINITIALIZED) { pr_debug("unregister_netdevice: device %s/%p never was registered\n", dev->name, dev); WARN_ON(1); list_del(&dev->unreg_list); continue; } dev->dismantle = true; BUG_ON(dev->reg_state != NETREG_REGISTERED); } /* If device is running, close it first. */ list_for_each_entry(dev, head, unreg_list) list_add_tail(&dev->close_list, &close_head); dev_close_many(&close_head, true); /* dalej nieistotne */ ``` co z kolei okazuje się wykonywać `ndo_stop()` ```c=1483 static void __dev_close_many(struct list_head *head) { struct net_device *dev; ASSERT_RTNL(); might_sleep(); list_for_each_entry(dev, head, close_list) { /* Temporarily disable netpoll until the interface is down */ netpoll_poll_disable(dev); call_netdevice_notifiers(NETDEV_GOING_DOWN, dev); clear_bit(__LINK_STATE_START, &dev->state); /* Synchronize to scheduled poll. We cannot touch poll list, it * can be even on different cpu. So just clear netif_running(). * * dev->stop() will invoke napi_disable() on all of it's * napi_struct instances on this device. */ smp_mb__after_atomic(); /* Commit netif_running(). */ } dev_deactivate_many(head); list_for_each_entry(dev, head, close_list) { const struct net_device_ops *ops = dev->netdev_ops; /* * Call the device specific close. This cannot fail. * Only if device is UP * * We allow it to be called even after a DETACH hot-plug * event. */ if (ops->ndo_stop) ops->ndo_stop(dev); /* <===== TUTAJ */ dev->flags &= ~IFF_UP; netpoll_poll_enable(dev); } } static void __dev_close(struct net_device *dev) { LIST_HEAD(single); list_add(&dev->close_list, &single); __dev_close_many(&single); list_del(&single); } void dev_close_many(struct list_head *head, bool unlink) { struct net_device *dev, *tmp; /* Remove the devices that don't need to be closed */ list_for_each_entry_safe(dev, tmp, head, close_list) if (!(dev->flags & IFF_UP)) list_del_init(&dev->close_list); __dev_close_many(head); list_for_each_entry_safe(dev, tmp, head, close_list) { rtmsg_ifinfo(RTM_NEWLINK, dev, IFF_UP|IFF_RUNNING, GFP_KERNEL); call_netdevice_notifiers(NETDEV_DOWN, dev); if (unlink) list_del_init(&dev->close_list); } } ``` Za to nasze `ndo_stop` to `alx_stop` (powrót do drivers/net/ethernet/atheros/alx/main.c): ```c=1673 static const struct net_device_ops alx_netdev_ops = { .ndo_open = alx_open, .ndo_stop = alx_stop, ``` ```c=1340 static int alx_stop(struct net_device *netdev) { __alx_stop(netdev_priv(netdev)); return 0; } ``` ```c=1250 static void __alx_stop(struct alx_priv *alx) { alx_halt(alx); alx_free_irq(alx); alx_free_rings(alx); alx_free_napis(alx); /* <=== to on dealokuje rzeczy */ } ``` Uff. Gdzie zatem jest błąd? W pewnym sensie w `alx_remove`, w pewnym w `alx_(schedule_)reset`. Przypomnijmy kod `alx_remove`: ```c=1856 static void alx_remove(struct pci_dev *pdev) { struct alx_priv *alx = pci_get_drvdata(pdev); struct alx_hw *hw = &alx->hw; cancel_work_sync(&alx->link_check_wk); cancel_work_sync(&alx->reset_wk); /* restore permanent mac address */ alx_set_macaddr(hw, hw->perm_addr); unregister_netdev(alx->dev); /* dalej nie dochodzi */ ``` Otóż programista prawdobodobnie nie był świadomy, że `cancel_work_sync` nie chroni nas przed przyszłym `schedule_work`. ```c=212 static void alx_schedule_reset(struct alx_priv *alx) { schedule_work(&alx->reset_wk); } ``` Należałoby wprowadzić jakieś pole, w którym zapisywalibyśmy, że sterownik jest w trakcie wyłączania. [FreeBSD](http://bxr.su/FreeBSD/sys/dev/alc/if_alc.c#alc_int_task) rozwiązuje ten problem nieco inaczej. ## Rzeczywista przyczyna spamu `fatal int fatal interrupt 0x4001607, resetting` To jest trudniejsze, nie pokażę, dlaczego, bo nie wiem. ```sh $ crash vmlinux alx-sysrq.core KERNEL: vmlinux DUMPFILE: alx-sysrq.core CPUS: 4 DATE: Mon Apr 13 21:07:33 2020 UPTIME: 10:09:46 LOAD AVERAGE: 1.89, 3.07, 3.54 TASKS: 111 NODENAME: arusekk RELEASE: 5.6.2-gentoo VERSION: #4 SMP Wed Apr 8 15:40:31 CEST 2020 MACHINE: x86_64 (2494 Mhz) MEMORY: 7.7 GB PANIC: "Kernel panic - not syncing: sysrq triggered crash" PID: 0 COMMAND: "swapper/0" TASK: ffffffff8220f4c0 (1 of 4) [THREAD_INFO: ffffffff8220f4c0] CPU: 0 STATE: TASK_RUNNING (PANIC) ``` ```c crash> add-symbol-file /usr/src/linux-5.6.2-gentoo/drivers/net/ethernet/atheros/alx/alx.ko 0xffffffffa001a000 add symbol table from file "/usr/src/linux-5.6.2-gentoo/drivers/net/ethernet/atheros/alx/alx.ko" at .text_addr = 0xffffffffa001a000 Reading symbols from /usr/src/linux-5.6.2-gentoo/drivers/net/ethernet/atheros/alx/alx.ko...done. ``` Wygląda na to, że urządzenie zgłasza co chwilę NMI, że jest w złym stanie, a wszystkie rejestry ma wypełnione jedynkami. Jednak z jakiegoś powodu automatyczny reset zaprogramowany w module nie pomaga. ## Błędy (?) znalezione przy okazji Być może jest to jakiś błąd jądra, bo `cp /proc/vmcore /mnt/dane/krasze/alx-rmmod` nie działa, read zawodzi z ENOMEM (dziwne, bo czyta tylko po 4KiB na raz do pliku). Napisałem na szybko program, który robi `mem = mmap(NULL, size, PROT_READ, MAP_SHARED, vmcore_fd, 0)` i `write(STDOUT_FILENO, mem, size)`, zadziałał. Inny błąd, który znalazłem (akurat robiąc CTFa i próbując uruchomić Netscape 0.9 z 1994), to że moduł `ia32_aout` mapuje kod binarek ZMAGIC (format a.out) bez prawa do wykonania. Jak znajdę, o co chodzi w którejś z tych spraw, i naprawię, to może zgłoszę pierwszego patcha do jądra linuksa :)