2020q1 Homework4 (khttpd)

contributed by < OscarShiang >

tags: `linux2020`

測試環境

$ cat /etc/os-release 
NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
$ cat /proc/version
Linux version 5.3.0-40-generic (buildd@lcy01-amd64-024) 
(gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) 

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               142
Model name:          Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz
Stepping:            9
CPU MHz:             3092.228
CPU max MHz:         3500.0000
CPU min MHz:         400.0000
BogoMIPS:            6199.99
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            4096K
NUMA node0 CPU(s):   0-3

TODO: 列出硬體資訊，特別是 microarchitecture，這樣才能讓下方的效能分析數據得以對照
:notes: jserv

已利用 lscpu 印出訊息
Oscar[color= orange]

作業要求

參照 fibdrv 作業說明裡頭的「Linux 核心模組掛載機制」一節，解釋 $ sudo insmod khttpd.ko port=1999 這命令是如何讓 port=1999 傳遞到核心，作為核心模組初始化的參數呢？
參照 CS:APP 第 11 章，給定的 kHTTPd 和書中的 web 伺服器有哪些流程是一致？又有什麼是你認為 kHTTPd 可改進的部分？
htstress.c 用到 epoll 系統呼叫，其作用為何？這樣的 HTTP 效能分析工具原理為何？
當來自 web 客戶端 (即網頁瀏覽器或類似的軟體) 請求的網址為 /fib/N 時 (
$N$ 是自然數，最大可到
$2^{31} - 1$ )，例如 $ wget http://localhost:8081/fib/10 應該要讓 web 客戶端得到
$F i b o n a c c i (10)$ 即 55 這個字串，這個實作需要考慮到大數運算
自 Facebook 討論串找出合適的 Fibonacci 運算檢驗程式，用以比對上述 (1) 的運算結果，並且要能詳實記錄和分析處理時間 (從 HTTP 請求到 kHTTPd 回應);
指出 kHTTPd 實作的缺失 (特別是安全疑慮) 並予以改正;
引入 Concurrency Managed Workqueue (cmwq)，改寫 kHTTPd，分析效能表現和提出改進方案，可參考 kecho;
參照 in-kernel HTTP server，學習其中分析和實作手法，從而改進 kHTTPd;

核心模組初始化參數傳遞過程

根據 fibdrv 作業說明裡關於 Linux 核心模組掛載機制的說明中提到，當我們執行在終端機執行 insmod 命令的時候，實際上會讓核心執行到 finit_module 函式，其參數的結構如下：

int finit_module(int fd, const char *param_values,
                 int flags);

而在 finit_module 函式的 Manual 中有提到：

The param_values argument is a string containing space-delimited specifications of the values for module parameters (defined inside the module using module_param() and module_param_array()). The kernel parses this string and initializes the specified parameters. Each of the parameter specifications has the form:

name[=value[,value…]]

在核心將參數分析完後，會將其以 char *param_values 參數形式傳進 finit_module 函式中，並進一步將這些參數傳進 load_module 等函式之中。

在 khttpd 的 main.c 中我們可以看到：

static ushort port = DEFAULT_PORT;
module_param(port, ushort, S_IRUGO);
static ushort backlog = DEFAULT_BACKLOG;
module_param(backlog, ushort, S_IRUGO);

就是在使用 kernel 所提供的 marco 設定參數的使用，讓我們能夠利用 insmod 傳遞 port 與 backlog 數值到核心模組之中。

從 <linux/moduleparam.h> 中可以看到 module_param 的相關定義，可以看到其參數結構如下：

#define module_param(name, type, perm)				\
	module_param_named(name, name, type, perm)

name ：就是使用 insmod 傳遞參數時使用的名稱，在本作業中是 port
type ：定義變數的資料型態
perm ：設定該變數在 sysfs 中的存取權限

查看 S_IRUGO 在 <linux/stat.h> 中的定義：

#define S_IRUGO		(S_IRUSR|S_IRGRP|S_IROTH)

對應 Permission Bits (The GNU C Library) 的說明可以知道其提供 user, group owner 以及 other users 的讀取權限

epoll 在 htstress.c 中的作用

根據 epoll(7) - Linux Manual Page 的描述：

The epoll API performs a similar task to poll(2): monitoring multiple file descriptors to see if I/O is possible on any of them. The epoll API can be used either as an edge-triggered or a level-triggered interface and scales well to large numbers of watched file descriptors.

知道 epoll 可以用來作為監聽 file descriptor 的介面

在 khttpd/htstress.c 中的 init_conn 有使用到 epoll 的進行操作。

static void init_conn(int efd, struct econn *ec)
{
    ...

    struct epoll_event evt = {
        .events = EPOLLOUT,
        .data.ptr = ec,
    };

    if (epoll_ctl(efd, EPOLL_CTL_ADD, ec->fd, &evt)) {
        perror("epoll_ctl");
        exit(1);
    }
}

在該函式中，因為要建立連線，所以將傳入的 file descriptor 包裝成 struct epoll_event 後，將其透過 epoll_ctl 加進 epoll 監測的列表之中。

接著在建立連線後，在 worker 的函式中：

static void *worker(void *arg)
{
    int ret, nevts;
    struct epoll_event evts[MAX_EVENTS];
    char inbuf[INBUFSIZE];
    struct econn ecs[concurrency], *ec;

    (void) arg;

    int efd = epoll_create(concurrency);
    if (efd == -1) {
        perror("epoll");
        exit(1);
    }

    for (int n = 0; n < concurrency; ++n)
        init_conn(efd, ecs + n);

    for (;;) {
        do {
            nevts = epoll_wait(efd, evts, sizeof(evts) / sizeof(evts[0]), -1);
        } while (!exit_i && nevts < 0 && errno == EINTR);

        if (exit_i != 0) {
            exit(0);
        }        
        ...
    }
    ...
}

利用 epoll 監聽 evts 中的 file descriptor

HTTP 效能分析工具原理

首先我們先從 make check 的指令呼叫開始看起：

./htstress -n 100000 -c 1 -t 4 http://localhost:8081/

分別以參數的形式將 max_requests, concurrency, num_threads 以及 request URL 變數進行設定。

在分析完 URL 之後，接著呼叫 start_time() 函式開始計時，並根據設定的 thread 數量執行 pthread_create，因為現在正在使用的也是一個 thread，所以只需要建立 num_thread - 1 個就可以了

pthread_create(3) - Linux manual page 中提到 thread 的終止條件如下：

The new thread terminates in one of the following ways:

It calls pthread_exit(3), specifying an exit status value that is
available to another thread in the same process that calls
pthread_join(3).

It returns from start_routine(). This is equivalent to calling
pthread_exit(3) with the value supplied in the return statement.

It is canceled (see pthread_cancel(3)).

Any of the threads in the process calls exit(3), or the main thread
performs a return from main(). This causes the termination of all
threads in the process.

所以在建立完 thread 後，呼叫 worker(0)，讓 main thread 也開始執行測試。

而測試的操作是被定義在 worker 之中，使用 send 函式傳送資料，並利用 recv 系統呼叫取得 socket 的回傳資料，進行 request 結果的統計。

因為使用多個執行緒並行處理，在執行的過程中除了要傳送 requests 之外也要統計 request 總次數和 good requests 與 bad requests 等資料，為了避免因為多執行序同時計算而導致的錯誤產生，我們需要使用 GCC Atomic Builtins 所提供的 __sync_fetch_and_add 與 __sync_fetch_and_sub 等等函式來進行運算。

你可以試著用 __atomic 改寫上述 __sync*，然後送 pull request 過來。
:notes: jserv

已提交 pull requests
Oscar[color= orange]

void worker(void *args) {
    ...
                if (!ret) {
                    close(ec->fd);

                    int m = __sync_fetch_and_add(&num_requests, 1);

                    if (max_requests && (m + 1 > (int) max_requests))
                        __sync_fetch_and_sub(&num_requests, 1);
                    else if (ec->flags & BAD_REQUEST)
                        __sync_fetch_and_add(&bad_requests, 1);
                    else
                        __sync_fetch_and_add(&good_requests, 1);

                    if (max_requests && (m + 1 >= (int) max_requests)) {
                        end_time();
                        return NULL;
                    }

                    if (ticks && m % ticks == 0)
                        printf("%d requests\n", m);

                    init_conn(efd, ec);
                }
    ...
}

當 requests 的總數達到 max_requests 時，呼叫 end_time() 停止計時，並關閉執行序。

最後將統計的結果印出，結束分析。

處理客戶端請求 Fibonacci 數計算

首先為了運算 Fibinacci 的數值，我將 sysprog21/bignum 從 User Space 移植到 Kernel Space

修改編譯器參數加上 -msse2 讓程式可以利用浮點數計算 bignum 數字的長度。

使用 kernel_fpu_begin() / kernel_fpu_end() 來保護 FPU context
:notes: jserv

已在 bignum/format.c 中加入 kernel_fpu_begin() 與 kernel_fpu_end()
Oscar[color= orange]

最後實作計算 fib 的介面 eval_fib 並定義在 bignum/fibonacci.h 中，供 http_server.c 呼叫。

接著修改 http_server.c:http_server_response 函式的實作，並將 macro 中的 HTTP_RESPONSE_200_DUMMY 與 HTTP_RESPONSE_200_KEEPALIVE_DUMMY 修改並加上可以填上變數的欄位，如下所示：

#define HTTP_RESPONSE_200_KEEPALIVE_DUMMY                      \
    ""                                                         \
    "HTTP/1.1 200 OK" CRLF "Server: " KBUILD_MODNAME CRLF      \
    "Content-Type: text/plain" CRLF "Content-Length: %lu" CRLF \
    "Connection: Keep-Alive" CRLF CRLF "%s" CRLF

接著針對 request->method != HTTP_GET 的情況實作 response_msg 函式來分析 URL 與回傳訊息的製作：

static char *response_msg(char *url, int keep_alive)
{
    char *msg;
    char *path = strnstr(url, "fib", strlen(url));
    if (!path) {
        msg = kstrdup("Hello World", GFP_KERNEL);
    } else {
        char **ptr = &path;
        strsep(ptr, "/");
        uint32_t n;
        kstrtou32(*ptr, 10, &n);
        msg = eval_fib(n);
    }

    char *connect = keep_alive ? "keep_Alive" : "Close";
    size_t msg_len = strlen(msg);
    size_t res_len = strlen(HTTP_RESPONSE_200) + get_log10(msg_len) + 1 +
                     strlen(connect) + msg_len;
    char *buf = kmalloc(res_len, GFP_KERNEL);
    snprintf(buf, res_len, HTTP_RESPONSE_200, msg_len, connect, msg);

    kfree(msg);
    return buf;
}

在實作完 response_msg 後將 http_server_response 做對應的修改：

static int http_server_response(struct http_request *request, int keep_alive)
{
    char *response;

    pr_info("requested_url = %s\n", request->request_url);
    if (request->method != HTTP_GET)
       response = keep_alive ? HTTP_RESPONSE_501_KEEPALIVE : HTTP_RESPONSE_501;
    else
-       response = keep_alive ? HTTP_RESPONSE_200_KEEPALIVE_DUMMY
-                             : HTTP_RESPONSE_200_DUMMY;
+       response = response_msg(request->request_url, keep_alive);

    http_server_send(request->socket, response, strlen(response));
    return 0;
}

處理 recv error

在使用 wget 取得 fibonacci 的計算結果時，發現 khttpd 會回傳 recv error: -104 的錯誤。

根據 wget(1) 的文件說明：

Normally, Wget asks the server to keep the connection open so that, when you download more than one document from the same server, they get transferred over the same TCP connection. This saves time and at the same time reduces the load on the server.

為了在使用 wget 從相同的伺服器下載時提高效能，wget 預設連線會 Keep-Alive，而根據 http_server_worker 的實作

static int http_server_worker(void *arg)
{
    ...
    
    while (!kthread_should_stop()) {
        int ret = http_server_recv(socket, buf, RECV_BUFFER_SIZE - 1);
        if (ret <= 0) {
            if (ret)
                pr_err("recv error: %d\n", ret);
            break;
        }
        http_parser_execute(&parser, &setting, buf, ret);

        if (request.complete && !http_should_keep_alive(&parser))
            break;
    }
    
    ...
}

在用戶使用 wget 取得資料之後，會根據 Keep-Alive 保持與用戶的連線，但在呼叫 http_server_recv 時卻會收到 -104 的 errno，導致 server 無法正常接收資料，進而關閉連線。所以在用戶第二次使用 wget 時，仍會開啟新的連線，而無法達到以 Keep-Alive 進行連線的目的。

根據 linux/errno.h 的定義，104 所對應的 error 是 ECONNRESET，表示在 server 端所看到的是 client 已經斷線而無法取得資料。

解決的方法就是在使用 wget 的時候可以搭配參數 --no-http-keep-alive 來取得資料，避免 server 因為等待 client 的資料而導致錯誤。

Fibonacci 驗證工具

在參考討論串的實作後，我使用 python 搭配 requests, bs4 等套件進行實作：

#!/usr/bin/env python3
from requests import get
from bs4 import BeautifulSoup
import re
import sys
import time

pattern = re.compile('Decimal')
fib_url = 'http://www.protocol5.com/Fibonacci/{}.htm'
local_url = 'http://localhost:8081/fib/{}'

RED = '\033[91m'
GREEN = '\033[92m'
WHITE = '\033[0m'

def printInColor(content, color = WHITE, end = '\n'):
    print(color + content, end = end)

def getFib(n):
    content = get(fib_url.format(n))
    soup = BeautifulSoup(content.text, 'lxml')
    fib = soup.find('h4', text = pattern).findNext('div')
    return fib.text

# main function
try:
    n = int(sys.argv[1])

    start = time.time()
    data = get(local_url.format(n))
    end = time.time()

    ans = getFib(n)

    printInColor('+++\tVerify the result of Fibonacci({})\n'.format(n));

    if data.text == ans:
        printInColor('---\tCorrect!', GREEN)
    else:
        printInColor('---\tYour answer: \t' + data.text)
        printInColor('---\tAnswer: \t' + ans)
        printInColor('---\tWrong', RED)

    print()
    printInColor('time cost: {}'.format(end - start))
    
except:
    printInColor('[ERROR] Failed to connect to the server.', RED)

以 Fibonacci(100) 為例進行測試：

+++	Verify the result of Fibonacci(100)

---	Correct!

time cost: 0.0028007030487060547

引入 Concurrency Managed Workqueue

根據 cmwq 說明文件指出 Multiple Thread Workqueue 的實作有可能會導致無法充分使用系統資源的情況發生：

An MT wq could provide only one execution context per CPU while an ST wq one for the whole system. Work items had to compete for those very limited execution contexts leading to various problems including proneness to deadlocks around the single execution context.

而 cmwq 運用 unbounded thread pool 的概念，讓程式碼能夠在 CPU 之間切換，讓資源的利用可以最大化。

並且為了避免因為新增過多的執行緒所造成的問題，cmwq 透過利用 thread pool 管理執行緒，當要執行 workqueue 之中的程式碼時，就會讓閒置的執行緒來執行，而沒有任務的執行緒就讓其保持 idle 的狀態

而因為透過 cmwq 來管理執行緒的運作，所以不用考慮移除核心模組之後，會有執行緒未被釋放的問題產生。

根據 cmwq 的說明文件可以得知其實作方法。

創建一個 cmwq 的方法其實很單純，就是呼叫定義在 <linux/workqueue.h> 的 alloc_workqueue 來進行即可。

同時為了管理在執行過程中所建立的 work，使用在 kernel API 中所提供的 linked list 來管理，在 daemon 結束的時候將為執行到的 work 從 list 中刪除。

為了管理相關的資料，在 http_server.h 中建立對應的資料結構：

struct http_service {
    bool is_stopped;
    struct list_head worker;
};

struct khttpd {
    struct socket *sock;
    struct list_head list;
    struct work_struct http_work;
};

接著在 http_server.c 中加入管理 work list 的函式 create_work 與 free_work

static struct work_struct *create_work(struct socket *sk)
{
    struct khttpd *work;

    if (!(work = kmalloc(sizeof(struct khttpd), GFP_KERNEL)))
        return NULL;

    work->sock = sk;

    INIT_WORK(&work->http_work, http_server_worker);

    list_add(&work->list, &daemon.worker);

    return &work->http_work;
}

static void free_work(void)
{
    struct khttpd *l, *tar;
    /* cppcheck-suppress uninitvar */

    list_for_each_entry_safe (tar, l, &daemon.worker, list) {
        kernel_sock_shutdown(tar->sock, SHUT_RDWR);
        flush_work(&tar->http_work);
        sock_release(tar->sock);
        kfree(tar);
    }
}

而因為要改用 cmwq 來管理執行序的分配，所以將 daemon 函式中對應的 kthread_create 修改為以 create_work 建立新的任務，並利用 queue_work 將新增的任務加入 workqueue 中等待執行。

在 daemon 函式要結束之前，呼叫 free_work 將所有儲存在 work list 中的任務釋放掉，最後使用 destroy_workqueue 將 workqueue 刪除。

修改為 cmwq 後，執行 htstress 進行測試的結果如下：

requests:      100000
good requests: 100000 [100%]
bad requests:  0 [0%]
socker errors: 0 [0%]
seconds:       2.364
requests/sec:  42303.761

Complete

與未使用 cmwq 的版本進行對照比較：

requests:      100000
good requests: 100000 [100%]
bad requests:  0 [0%]
socker errors: 0 [0%]
seconds:       4.506
requests/sec:  22194.292

Complete

發現在使用 cmwq 改寫後，server 的執行效率得到了大幅度的提昇，在 requests/sec 的項目上得到了接近 2 倍的改善。

2020q1 Homework4 (khttpd)

tags: linux2020

測試環境

作業要求

核心模組初始化參數傳遞過程

epoll 在 htstress.c 中的作用

HTTP 效能分析工具原理

處理客戶端請求 Fibonacci 數計算

處理 recv error

Fibonacci 驗證工具

引入 Concurrency Managed Workqueue

參考資料

Read more

Interesting Facts of Pthread

OS HW1: Benchmark Your Computer Black Box

程式設計(二) Lab 3 講解

Qt Graphics 系列介紹

tags: `linux2020`