contributed by < Julian-Chu >
$ sudo insmod khttpd.ko port=1999
這命令是如何讓 port=1999 傳遞到核心,作為核心模組初始化的參數呢?首先會看到 main.c 裡面對 port 參數進行宣告跟初始化
module_param
的細節可以閱讀 Linux 核心模組掛載機制 或是展開以下的程式碼
巨集完全展開後如下:
在 ELF __param section 宣告型態為 kernel_param 的變數, 同時也在 .modinfo section 加入 parmtyp=port:ushort
的資訊
利用 objdump 可以看到不只是 port 還有 backlog 也是列在其中
接著觀察 load_module
簡略流程如下:
module_param
將相關的參數放入到 ELF 的 __param section -> 讀取 ELF __param
section, 取得 module 的 kernel param -> 與 args 做比對, 更新數值
khttpd 的主要流程如下
比對 CS:APP 主要的流程 socket -> bind -> listen -> accept 看起來是一致的, 只是使用的 API 不同。
改進:針對每一個 request 都需要新建一個 kthread 的成本可能太大, 也許可以嘗試 worker pool
Q: kernel network 相關的 api 都是直接使用 socket, 並沒有提到 file descriptor, 為什麼會有這種區別?
這是 Linux 沒有貫徹 UNIX 設計哲學 "Everything is a file" 的案例:network device 是過於特別,以至於既非 character device 也非 block device,要用完全不同的方式來存取。
jservImage Not Showing Possible ReasonsLearn More →
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
htstress.c
用到 epoll 系統呼叫,其作用為何?這樣的 HTTP 效能分析工具原理為何?main function 主要處理初始化參數、新建 worker(pthread)、印出結果。
可以看到許多的參數可以選擇
epoll 主要使用在 worker 跟 init_conn 之中。 而 epoll 本身是用以監聽 file descriptor,在 htstress 中,主要用以監聽 socket FD 然後根據不同的事件像是 EPOLLIN、 EPOLLOUT 來作相對應的處理 request 和 response。 簡略的處理流程如下圖:
這樣的性能測試工具主要需要考量的點如下:
思考:
目前的推論是使用場景不同,考慮到 ET 只有在狀態變化的狀態下才會觸發, 所以 ET 的優勢在於可以減少 epoll_wait 的返回 fd, 對 epoll_wait 迴圈的性能有所提升, 對 epoll_ctl 的呼叫就會變少,但並非時間複雜度上像是 O(n) -> O(log n)的變化,只是單純 n 的減少。此外 ET 也會比 LT 少掉 EPOLLIN/EPOLLOUT 的切換。
但是除了 epoll loop 本身外還需要考慮另一方面, 由於 ET 觸發一次後需要等到資料清空後才能在觸發, 所以拿到 fd 後就必需一口氣把 buffer 上的資料處理完, 單執行緒不需長時間等待資料的情況跟 LT 相近, 需要等待資料的情況可能會有重複嘗試讀取或是阻塞的情況在單一 fd 上, 多執行緒且採用 reactor pattern 的情況, 將讀寫的任務分配到其他執行緒上, 採用 ET 可以得到較為高效的 event loop。
回到 htstress 的情況, 單執行緒設計加上只有監聽 256 個事件,配合 LT 已經可以滿足性能需求, 程式碼實作上也較為簡單,沒有採用 ET 的必要性
在Level trigger 模式下,EPOLLOUT 會持續觸發,需要變更監聽的事件才能避免這個問題
read 讀出的 buffer 不會被儲存在 econn 上帶到下一個 epoll , 為了辨別 4xx 和 5xx 最簡單的方式就是在同一個 epoll 迴圈內讀完 buffer 後辨別,不需要額外的儲存空間
kecho
已使用 CMWQ,請陳述其優勢和用法優勢:
解耦 kthread(work thread) 與 work queue, 開發者可以依據不同的設定建立需要 workqueue, 簡化開發流程, 大多數的場景都不需考慮自行管理 kthread(work thread) 或是自行開發 work queue 導致不必要的 context switch 和 dead locking 的風險.
用法:
根據 git log 的紀錄當時是將 __create_workqueue
更名爲 alloc_workqueue
但參數在當時並沒有更動,而在現行版本可看到
create_workqueue
create_freezeable_workqueue
create_singlethread_workqueue
時至今日仍然沒有移除。
猜想:核心開發者提供一個 public 的 API 可以讓使用者做細節跟客製化的控制, 但是已經有許多的地方仍然使用舊有的 API,全部都一口氣修改可能會造成許多問題,所以能做的就是保持現有 API 的狀況下,鼓勵開發者使用新的 API, 提供一個過渡期讓開發者可以修改現有的程式碼, 而不會造成太大的衝擊
There are two worker-pools, one for normal work items and the other for high priority ones, for each possible CPU and some extra worker-pools to serve work items queued on unbound workqueues - the number of these backing pools is dynamic.
Subsystems and drivers can create and queue work items through special workqueue API functions as they see fit. They can influence some aspects of the way the work items are executed by setting flags on the workqueue they are putting the work item on. These flags include things like CPU locality, concurrency limits, priority and more. To get a detailed overview refer to the API description of alloc_workqueue() below.
When a work item is queued to a workqueue, the target worker-pool is determined according to the queue parameters and workqueue attributes and appended on the shared worklist of the worker-pool. For example, unless specifically overridden, a work item of a bound workqueue will be queued on the worklist of either normal or highpri worker-pool that is associated to the CPU the issuer is running on.
user-echo-server
運作原理,特別是 epoll
系統呼叫的使用user-echo-server 的主要流程與 CS:APP 的 echo server 類似, 加上 epoll 用來處理多個連線
epoll 系統呼叫主要用於監聽 listenfd 與 connfd
大致流程如下
kecho
和 user-echo-server
表現?佐以製圖bench 的主要原理涵蓋在 bench
和 bench_worker
兩個函式中, 對比 htstress 在固定數量的 worker 中使用 epoll 內處理所有連線, bench 針對每一個連線都會使用新建一個 pthread 來處理相關的連線與請求
CMWQ實作
尚未綁定 CPU
1000 threads
的條件下 kthread 實作與 user-echo-server 都會有隨著 thread 上升性能下滑的趨勢, 性能中間的落差可能爲 user space 與 kernel space 之間資料拷貝的成本kthread 實作
比較 kthread 與 cmwq 的性能差異, kthread 實作版本會對每一個 conn socket 新建一個 kthread, 相較於 cmwq 複用綁定 CPU 的 worker kthread, 會額外增加 kthread 的新建、排程、銷燬、以及 context switch 的成本
增加 bench 設定 thread 的數量到 10000,cmwq 的性能仍穩定
drop-tcp-socket
核心模組運作原理。TIME-WAIT sockets
又是什麼drop-tcp-socket
核心模組運作原理echo "127.0.0.1:36986 127.0.0.1:12345" | sudo tee /proc/net/drop_tcp_sock
inet_twsk_deschedule_put
殺掉 socket以下兩個函式根據給定 network namespace 的資訊回傳 socket
有兩種 droptcp_proc_fops 的原因是因為 linux 5.6 起提供針對 proc file 提供了 proc_ops structure, 之前使用的 file_operations 提供了對於 VFS 不必要的方法(LKMPG: 7.1 The proc Structure)
register_pernet_subsys vs register_pernet_device:
references:
The /proc File System
In Linux, there is an additional mechanism for the kernel and kernel modules to send information to processes — the /proc file system. Originally designed to allow easy access to information about processes (hence the name), it is now used by every bit of the kernel which has something interesting to report, such as /proc/modules which provides the list of modules and /proc/meminfo which gathers memory usage statistics.
The method to use the proc file system is very similar to the one used with device drivers — a structure is created with all the information needed for the /proc file, including pointers to any handler functions (in our case there is only one, the one called when somebody attempts to read from the /proc file). Then, init_module registers the structure with the kernel and cleanup_module unregisters it.
Normal file systems are located on a disk, rather than just in memory (which is where /proc is), and in that case the index-node (inode for short) number is a pointer to a disk location where the file’s inode is located. The inode contains information about the file, for example the file’s permissions, together with a pointer to the disk location or locations where the file’s data can be found.
使用 telnet 傳遞字串長度超過 4095 會 cut off, 仍找不到原因..
利用 bench 傳遞 4096 的字串給 kecho
bench 的執行時間明顯以倍數增加, 電腦的負載提高非常明顯
值得注意的訊息:(待研究)
/dev/kmsg buffer overrun, some messages lost.
-104 #errorno
會產生不必要的的長期資源佔用:
比對這四組,可以發現一些有趣的現象
在 GitHub 上 fork khttpd,目標是提供檔案存取功能和修正 khttpd 的執行時期缺失。過程中應一併完成以下:
issue
在 line 10 把 buf 印出來
然後利用 telnet 來測試
利用 telnet 發出 head, 可以看到第一個 TEST head 的資料留存在 buf 中在讀取 TEST2 head 的時候被印出, 可能會導致 parse 結果不正確
修正:
加入 line 3 , 清空 buf
結果正常
使用 go 內建的 http client 作測試的時候發現的問題
會出現警告訊息
查閱 RFC-2616
發現 message body 的結尾沒有規定需要 CRLF
且有以下的敘述
Certain buggy HTTP/1.0 client implementations generate extra CRLF's after a POST request. To restate what is explicitly forbidden by the BNF, an HTTP/1.1 client MUST NOT preface or follow a request with an extra CRLF.
移除 HTTP_RESPONSE_200_KEEPALIVE_DUMMY
的 CRLF
後, 沒有出現警告訊息
todo: go http client 有內建連接池可以複用 connection, 但是這個例子沒有生效, 待研究
stream socket 的 kernel_recvmsg, recvmsg在 blocking 模式下, 沒有資料可讀的情況阻塞, 而不是回傳 0, 在 blocking 和 nonblocking 的情況下回傳 0 都代表連線被關閉。
datagram sockets 的情況是有可能回傳資料爲 0
These calls return the number of bytes received, or -1 if an error occurred. In the event of an error, errno is set to indicate the error.
When a stream socket peer has performed an orderly shutdown, the return value will be 0 (the traditional "end-of-file" return).
Datagram sockets in various domains (e.g., the UNIX and Internet domains) permit zero-length datagrams. When such a datagram is received, the return value is 0.
The value 0 may also be returned if the requested number of bytes to receive from a stream socket was 0.
read: 0 代表 EOF
On success, the number of bytes read is returned (zero indicates end of file)
kthread 版本,200000 requests
netstat -atn | grep 1999
目前只有 LISTEN 的socket
查看程式碼與使用 curl 測試, 看訊息貌似 HTTP 1.1 keep-alive 在當前版本已實作
上述的錯誤在於 curl 兩次會分別開啓新的連線,所以上述的訊息雖然是正確, 但是測試方式不對
可以看到 Re-using existing connection! (#0) with host 127.0.0.1
,
連線有被重複使用
利用 curl 測試爲短時間的連續測試
改寫 htstress.c 成client.c 變成單一執行緒, 每秒發送一次 request 持續 20s
可以看到連線重複使用
tool:
ps auxf | grep khttpd
sudo netstat -tnope | grep keepalive
sudo netstat -tnope | grep 1999
netstat -atn | grep 1999
The normal Unix close function is also used to close a socket and terminate a TCP connection.
The default action of close with a TCP socket is to mark the socket as closed and return to the process immediately. The socket descriptor is no longer usable by the process: It cannot be used as an argument to read or write. But, TCP will try to send any data that is already queued to be sent to the other end, and after this occurs, the normal TCP connection termination sequence takes place
釐清誤解,之前誤以為每個 request 都需要經過 accept, 但實際上 accept 只處理 connection request, 當連線建立後未終止前, accept 不會回傳同樣的 socket(因為是資料傳輸的 request 並非 connect request), 直到 close 會關閉 socket 同時中斷連線後, accpet 才會接收同一客戶端來的 connection request
pthread_detatch()
which tells the operating system that it can free this thread’s resources without another thread having to call pthread_join() to collect details about this thread’s termination. This is much like how the parent process needs to call wait()
man pthread_detach
The pthread_detach() function marks the thread identified by
thread as detached. When a detached thread terminates, its
resources are automatically released back to the system without
the need for another thread to join with the terminated thread.Attempting to detach an already detached thread results in unspecified behavior.
Once a thread has been detached, it can't be joined with pthread_join(3) or be made joinable again.
Either pthread_join(3) or pthread_detach() should be called for
each thread that an application creates, so that system resources
for the thread can be released. (But note that the resources of
any threads for which one of these actions has not been done will
be freed when the process terminates.)
The Thundering Herd
a problem in computer science when many processes, all waiting for the same event are woken up by the operating system. A good example is our pre-forked server in which there are several child processes all blocked on the same socket in accept(). This problem occurs when the part of the kernel that wakes up all processes or threads leaves it to another part of the kernel, the scheduler, to figure out which process or thread will actually get to run, while all others go back to blocking. When the number of processes/threads increases, this determination wastes CPU cycles every time an event that potentially turns processes or threads runnable occurs. While this problem doesn’t seem to affect Linux anymore
Serializing accept(), AKA Thundering Herd, AKA the Zeeg Problem
INADDR_ANY
This is an IP address that is used when we don't want to bind a socket to any specific IP. Basically, while implementing communication, we need to bind our socket to an IP address. When we don't know the IP address of our machine, we can use the special IP address INADDR_ANY. It allows our server to receive packets that have been targeted by any of the interfaces.
https://www.kernel.org/doc/html/v4.10/core-api/workqueue.html
https://lwn.net/Articles/355700/
https://lwn.net/Articles/393171/
https://lwn.net/Articles/394084/
https://events.static.linuxfound.org/sites/events/files/slides/Async execution with wqs.pdf
https://eklitzke.org/blocking-io-nonblocking-io-and-epoll
man epoll_ctl
EPOLLIN The associated file is available for read(2) operations.
EPOLLOUT The associated file is available for write(2) operations.