One may find his/her Linux machine always has 1 CPU under full load. In system monitor such as top
or htop
, a command called kworker
takes up nearly 100% of a CPU.
This problem mostly happens on laptops. On my MSI GS60 6QD, with kernel Linux 6.8.7-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 17 Apr 2024 15:20:28 +0000 x86_64
, top two CPU usage listed in top
are kworker/0:1+kacpid
and irq/9-acpi
kworker/0:1+kacpid
constantly takes high CPU usage, and irq/9-acpi
constantly takes around 15% as well. This drains the battery fast and drives CPU fans to high load. It happens right from booting up.
kworker refers to worker threads managed in worker-pool that consume work items in workqueue (CMWQ). Subsystems and drivers can create and queue work items through workqueue API functions as they see fit. The number after kworker/
denotes the CPU core (in this case, core 0) and the specific worker id. These numbers are determined when create_worker()
is called (defined in kernel/workqueue.c
). kacpid
stands for Kernel ACPI Daemon. ACPI is a standard for handling power management and hardware configuration. The kacpid process is responsible for ACPI-related tasks such as managing power events and handling ACPI interrupts.
irq/9-acpi
indicates that it's an interrupt request associated with IRQ line 9 and ACPI. In x86 IRQ, IRQ 9 is on Slave PIC, for ACPI interrupts on Intel Chipsets.
The handling of interrupts has two parts: top half and bottom half. Top half, which receives the hardware interrupt, needs to run as quickly as possible. Bottom half, where corresponding tasks to the interrupt are executed, is not as time-critical in non-real-time system and can be deferred for later execution. There are several options for this deferral: softirq, tasklet, workqueue and threaded interrupts. In this case, ACPI uses workqueue to do so.
So far we can figure out that my computer is probably handling excessive numbers ACPI interrupts. We have two options:
Stop handling the interrupts seems keep a fairly easy option. UEFI firmware on my laptop is closed-source software. Open-source ones such as Core Boot don't support my laptop, either. This document in kernel explains how to use sysfs to check ACPI firmware behavior, including interrupts.
However, one of the main functions of ACPI is to make the platform understand random hardware without special driver support. So while the SCI handles a few well known (fixed feature) interrupts sources, such as the power button, it can also handle a variable number of a "General Purpose Events" (GPE).
A GPE vectors to a specified handler in AML, which can do a anything the BIOS writer wants from OS context. GPE 0x12, for example, would vector to a level or edge handler called _L12 or _E12. The handler may do its business and return. Or the handler may send send a Notify event to a Linux device driver registered on an ACPI device, such as a battery, or a processor.
To figure out where all the SCI's are coming from, /sys/firmware/acpi/interrupts contains a file listing every possible source, and the count of how many times it has triggered
Besides this, user can also write specific strings to these files to enable/disable/clear ACPI interrupts in user space, which can be used to debug some ACPI interrupt storm issues.
As suggested, use grep . -r /sys/firmware/acpi/interrupts/
to list possible sources and triggered count:
As listed, GPE61 is triggered 2461007 times in only a few minutes. Disable GPE61 by echo disable > /sys/firmware/acpi/interrupts/gpe61
. To disable it everytime system booted, add as a cron job in root user.
One may think how about disable the interrupt right at PIC? This may disable all the other normal ACPI interrupts.
Along the above investigation, this could be the stepping stone of exploring interrupt mechanism in Linux kernel.
tools/workqueue/wq_monitor.py
or tools/workqueue/wq_dump.py
to examine what GPE61 on my laptop really does. Find if disabling it poses any impact on the system. Not sure if my battery died pretty quick because of this.
Article:
…The ACPI code had bound a workqueue thread to CPU 0 because some operations corrupt the system if run anywhere else; …Comment:
(reply to someone being shocked) On some HPs, at least, certain ACPI operations trigger SMIs that then appear to be run on the CPU that triggered the SMI. HP's SMI handler seems to fail to restore CPU state if it runs on anything other than CPU 0.