Linux 專題: 透過 Netfilter 自動過濾廣告

# Linux 專題: 透過 Netfilter 自動過濾廣告 > 執行人: ItisCaleb > [GitHub](https://github.com/ItisCaleb/Netfilter-Adblock) :::success :question: 提問清單 * ? ::: 儘管我們可在網頁瀏覽器中透過像是 [AdBlock](https://getadblock.com/) 這類的 extension 來過濾廣告，但需要額外的設定和佔用更多系統資源，倘若我們能透過 [netfilter](https://www.netfilter.org/)，直接在核心層級過濾網路廣告，那所有應用程式都有機會受益。參考資訊: * [How to drop 10 million packets per second](https://blog.cloudflare.com/how-to-drop-10-million-packets/) * [Use the iptables firewall to block ads on your Linux machine](https://securitronlinux.com/debian-testing/use-the-iptables-firewall-to-block-ads-on-your-linux-machine/) * [2020 年開發紀錄](https://hackmd.io/@ZhuMon/2020q1_final_project) 相關專案: * [netfilter-blocking](https://github.com/kritpals/netfilter-blocking) * [netfilter_block](https://github.com/tr0y-kim/netfilter_block) * [adriver](https://github.com/Jongy/adriver) ## 說明 [netfilter](https://www.netfilter.org/) 阻擋特定來源封包的原理 [netfilter](https://www.netfilter.org/) 的原理就是對每個 [Protocol Stack](https://en.wikipedia.org/wiki/Protocol_stack) 都有一系列的 hook 在每個節點上在官方文件中提供的 IPv4 範例是長這樣 ```graphviz digraph "IPv4 Diagram" { rankdir=LR "Remote IN" -> "[NF_IP_PRE_ROUTING]" -> "ROUTE" ROUTE -> "[NF_IP_LOCAL_IN]" "[NF_IP_LOCAL_OUT]" -> "\ROUTE" -> "[NF_IP_POST_ROUTING]" -> "Remote Out" ROUTE -> "[NF_IP_FORWARD]" -> "[NF_IP_POST_ROUTING]" } ``` * `NF_IP_PRE_ROUTING` 是在做 Routing 之前的 hook * `NF_IP_LOCAL_IN` 是在把封包傳遞到 local process 之前的 hook * `NF_IP_FORWARD` 是在把封包傳遞到別的 network inteface 之前的 hook * `NF_IP_POST_ROUTING` 是在做 Routing 之後的 hook * `NF_IP_LOCAL_OUT` 是從 local 往外傳遞的封包的 hook 而開發者便能透過 Netfilter 提供的這些 hook 來對封包進行過濾，甚至是修改。在 Netfilter 的框架之中，便已經提供叫做 IP Tables 的系統來讓使用者針對封包的 IP 位置進行過濾開發者提供的 hook function 可以回傳以下幾種回應： * `NF_DROP`:丟棄封包。 * `NF_ACCEPT`:允許封包通過。 * `NF_STOLEN`:將封包的所有權轉移給這個 hook function，同時也意味著需要自己管理封包佔用的資源。 * `NF_QUEUE`:將封包送往 nfqueue。 * `NF_REPEAT`:重新呼叫這個 hook function。 ## 以 Netfilter 阻擋特定的網址參考 [netfilter_block](https://github.com/tr0y-kim/netfilter_block) 的程式碼，我們可撰寫應用程式來阻擋特定網址。同時也可參考〈[Use the iptables firewall to block ads on your Linux machine](https://securitronlinux.com/debian-testing/use-the-iptables-firewall-to-block-ads-on-your-linux-machine/)〉，該文提到的常見廣告網址來進行阻擋。如果要針對特定的網域去做過濾，我們可以直接丟棄 DNS Query，但可能會造成其他協定都無法連線。要在 user space 中處理封包，可以使用 Netfilter 提供的 nfqueue，我們使用 [GNU gperf](https://www.gnu.org/software/gperf/) 來為要阻擋的 host 建立 hash table，只要從封包提取出來的 host 在我們的 block list 裡面，便可以將封包直接拋棄值得注意的是，現今幾乎所有的廣告網站都使用 HTTPS 進行連接。某些瀏覽器如 Chrome 和 Firefox 會阻擋從 HTTPS 網站載入的 HTTP 資源： * [Mixed Content Block in Chrome](https://chromestatus.com/feature/6263395770695680) * [Mixed Content Block in Firefox](https://support.mozilla.org/en-US/kb/mixed-content-blocking-firefox) 由於大部分網站都遵循使用 HTTPS，如果廣告商未使用 HTTPS，則它們投放的廣告將無法在大部分網站上顯示。這個特性將有助於減少廣告的出現，提供更乾淨的瀏覽體驗。如果我們要更進一步的進行過濾，像是針對路徑，我們就必須要有辦法讀取未加密的內容其中一個方法便是利用類似[中間人攻擊](https://en.wikipedia.org/wiki/Man-in-the-middle_attack)的手法，建立一個 proxy server，當客戶端發起連線之時得到的會是自己的憑證，並且加密解密都是使用自己的公私鑰對另一種方法則是使用 [eBPF](https://ebpf.io/applications/) 來追蹤動態函式庫，像是 OpenSSL 的 SSL_write，在 user space 獲取加密前的封包 :::warning 若能及早在使用者層級取得 SSL 資訊，或許就能追蹤封包，參見 [Debugging with eBPF Part 3: Tracing SSL/TLS connections](https://blog.px.dev/ebpf-openssl-tracing/) :notes: jserv ::: ## 藉由核心模組阻擋已知的廣告網址參照 [adriver](https://github.com/Jongy/adriver)，以 Linux 核心模組來阻擋已知的廣告網址。使用 Linux 核心模組，即可在核心裡頭註冊不同的 hook 來處理封包 ```c static struct nf_hook_ops blocker_ops = {.hook = blocker_hook, .pf = NFPROTO_IPV4, .hooknum = NF_INET_LOCAL_OUT}; static int mod_init(void) { return nf_register_net_hook(&init_net, &blocker_ops); } static void mod_exit(void) { nf_unregister_net_hook(&init_net, &blocker_ops); } ``` 而 hook function 則需要接收三個參數 * `priv` 為在 `nf_hook_ops` 的 `priv` 提供的物件 * `skb` 為此封包的 `sk_buff` * `state` 則為此封包的各種資訊，包括裝置、網路的命名空間等 ```c static unsigned int blocker_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state) { ... ``` `sk_buff` 可參考 [Socket Buffer](http://vger.kernel.org/~davem/skb.html)，這個結構便存著封包的資料。 ### 從 `sk_buff` 中提取 TCP 及 UDP 的資料為了獲取 host，我們首先需要從 TCP 或 UDP 的資料中提取它對於 UDP 版本，我們將 `tcp->doff * 4` 修改為 `sizeof(struct udphdr)`,這是因為 UDP header 是固定長度的，而 TCP header 可能包含額外的欄位由於 `sk_buff` 可能不是連續的，這可能導致在讀取資料時讀取到錯誤的記憶體位置，因此我們使用 `skb_linearize()` 函式將 `sk_buff` 轉換為連續的記憶體，這樣可以確保我們能夠正確獲取資料。 ```c static int extract_tcp_data(struct sk_buff *skb, char **data) { struct iphdr *ip = NULL; struct tcphdr *tcp = NULL; size_t data_off, data_len; if (!skb || !(ip = ip_hdr(skb)) || IPPROTO_TCP != ip->protocol) return -1; // not ip - tcp if (!(tcp = tcp_hdr(skb))) return -1; // bad tcp /* data length = total length - ip header length - tcp header length */ data_off = ip->ihl * 4 + tcp->doff * 4; data_len = skb->len - data_off; if (data_len == 0) return -1; if (skb_linearize(skb)) return -1; *data = skb->data + data_off; return data_len; } ``` ### 提取 host 並阻擋一旦自 `extract_udp_data()` 擷取封包資料後，直接去處理 DNS query，並用 gperf 產生的完美雜湊函數來判斷 host 是否在 block list 中，若是，則拋棄該封包。 DNS 協定的處理可參照 [RFC 1035](https://datatracker.ietf.org/doc/html/rfc1035) ```c /* Extract TCP data */ len = extract_udp_data(skb, &data); if (len > 0) { if (ntohs(udp_hdr(skb)->dest) == 53) { /* Extract host from data */ dns_protocol->parse_packet(data, len, &host); /* Drop packet if host is within block list */ if (host) { result = in_word_set(host, strlen(host)) ? 1 : 0; } kfree(host); } } ``` ### 針對路徑阻擋對於 HTTP 封包，由於在核心空間內處理，我們只能使用簡單的 [glob](https://man7.org/linux/man-pages/man7/glob.7.html) 來阻擋 ```c else if (strncmp(data, "GET ", sizeof("GET ") - 1) == 0) { /* HTTP */ result = glob_match("*ad[bcfgklnpqstwxyz_.=?-]*", data + 4); } ``` 而對於 HTTPS 封包，因為是在 user space 處理，於是我們可用 POSIX 的 [<regex.h>](https://man7.org/linux/man-pages/man3/regcomp.3.html) 來做阻擋，並藉由 [character device drivers](https://sysprog21.github.io/lkmpg/#character-device-drivers) 直接把處理結果傳遞給核心模組。由於 pid 是有號整數，所以我們可將該結果保存於 pid 的 MSB: ```c const char regexp[] = "[/_.?\\-]ad[bcfgklnpqstwxyz/_.=?\\-]"; regex_t preg; void handle_sniff(void *ctx, int cpu, void *data, unsigned int data_sz) { struct data_t *d = data; uint32_t result = 0; if (d->buf[0] == 'G' && d->buf[1] == 'E' && d->buf[2] == 'T') { int r = regexec(&preg, d->buf, 0, NULL, 0); if (!r) result = 1; } lseek(fd, d->pid | result << 31, 0); } ``` Netfilter hook 在執行的時候是非同步的，同時由於多個封包可以由同一個 process 發送，所以我們需要設計一個機制來處理 1. 我們宣告兩個 linked list `order` 跟 `verdict`，前者記錄每一個封包到 hook 的 timestamp 並依照時間先後排順序，而後者則紀錄 pid 跟判斷結果 2. 每次 poll 時，最前面的 order 在 verdict 裡面找對應自己的 pid，並取出判斷結果 3. 如果找不到或超時表示 eBPF 沒有探測到對應的加密函式，只能直接通過細節如下 ```c struct queue_st { struct list_head head; union { ktime_t timestamp; pid_t pid; }; }; struct list_head order_head, verdict_head; ``` 插入 `order` 使用 `insert_order(time)`，同一時間只能有一個東西插入 `order` ```c ktime_t time = ktime_get(); struct queue_st *order = insert_order(time); ``` ```c struct queue_st *insert_order(ktime_t timestamp) { int ret = mutex_trylock(&insert_mutex); if (ret != 0) { struct list_head *cur; struct queue_st *order; list_for_each (cur, &order_head) { order = list_entry(cur, struct queue_st, head); if (order->timestamp < timestamp) break; } order = kmalloc(sizeof(struct queue_st), GFP_KERNEL); order->timestamp = timestamp; list_add(&order->head, cur); mutex_unlock(&insert_mutex); return order; } return NULL; } ``` `insert_verdict()` 則不需要 locking，因為 user space 的 program 是同步的 ```c void insert_verdict(pid_t pid) { struct queue_st *verdict; verdict = kmalloc(sizeof(struct queue_st), GFP_KERNEL); verdict->pid = pid; list_add_tail(&verdict->head, &verdict_head); } ``` 取出判斷結果則是不斷使用 `poll_verdict()` 直到超時為止，跟據傳入的 timestamp 來確認是不是 `order` 的第一項，是的話就直接在 `verdict` 尋找對應的 pid 並取出結果 ```c while (result == -1 && ktime_to_ms(ktime_sub(ktime_get(), time)) < 50) { result = poll_verdict(time, current->pid); } ``` ```c int poll_verdict(ktime_t timestamp, pid_t pid) { struct queue_st *verdict, *first; int ret = -1; if (list_empty(&order_head) || list_empty(&verdict_head)) return -1; first = list_first_entry(&order_head, struct queue_st, head); if (!first || first->timestamp != timestamp) return -1; list_for_each_entry (verdict, &verdict_head, head) { pid_t cpid = verdict->pid & ((1U << 31) - 1); int result = (u32) verdict->pid >> 31; if (cpid == pid) { ret = result; break; } } list_del(&verdict->head); list_del(&first->head); kfree(verdict); kfree(first); return ret; } ``` 若判斷結果是要阻擋，我們在最後向伺服器發起 TCP Reset 來強制中斷連線，並丟棄封包: ```c if (result > 0) { send_server_reset(skb, state); return NF_DROP; } ``` ## 藉由 eBPF 獲取加密前的 HTTP Header 為了要使用 eBPF，我們只好額外建立 user space 的程式，並讓他傳遞資料給我們的核心模組。我們使用 [libbpf](https://github.com/libbpf/libbpf) 來植入我們的 eBPF 程式我們只需要 HTTP 的路徑就好，並不需要完整的 HTTP request，同時較小的 buffer 也可以直接塞入 eBPF 的 stack 裡面，而不需要額外去定義 map 而為了能讓判斷結果跟封包映射我們還需要程式的 pid ```c struct { __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); __uint(key_size, sizeof(int)); __uint(value_size, sizeof(int)); } tls_event SEC(".maps"); #define BUF_MAX_LEN 256 struct data_t { unsigned int pid; int len; unsigned char buf[BUF_MAX_LEN]; }; ``` 接著使用 `bpf_perf_event_output()` 來將 request 以及 pid 輸出到我們 user space 的程式 ```c SEC("uprobe") int BPF_KPROBE(probe_SSL_write, void *ssl, char *buf, int num) { unsigned long long current_pid_tgid = bpf_get_current_pid_tgid(); unsigned int pid = current_pid_tgid >> 32; int len = num; if (len < 0) return 0; struct data_t data; data.pid = pid; data.len = (len < BUF_MAX_LEN ? (len & BUF_MAX_LEN - 1) : BUF_MAX_LEN); bpf_probe_read_user(data.buf, data.len, buf); bpf_perf_event_output(ctx, &tls_event, BPF_F_CURRENT_CPU, &data, sizeof(struct data_t)); return 0; } ``` 最後在我們的程式中使用 `perf_buffer__poll()` 就能獲得 BPF 輸出的資料，並且傳遞到我們提供的 callback function `handle_sniff()` ```c pb = perf_buffer__new(bpf_map__fd(skel->maps.tls_event), 8, &handle_sniff, NULL, NULL, NULL); if (libbpf_get_error(pb)) { fprintf(stderr, "Failed to create perf buffer\n"); return 0; } ... printf("All ok. Sniffing plaintext now\n"); while (1) { int err = perf_buffer__poll(pb, 1); if (err < 0) { printf("Error polling perf buffer: %d\n", err); break; } } ``` ## 已知問題 ### libbpf attach libbpf 具備透過 binary 路徑和函式名稱獲取 probe 需要 attach 的位置。然而，有時候可能會無法讀取到相應的 symbol。在這種情況下，需要使用腳本計算所需 symbol 的 offset，然後將其傳遞給 libbpf 此外，瀏覽器的加密函式並不一定存在於 `/usr/lib` 或 `/lib/x86_64-linux-gnu` 這類存放共用動態連結函式庫的目錄中。例如 Firefox 將相關的動態連結函式庫放在自定的目錄，而 Chrome 則沒有公開相關的 symbol，需要使用逆向工程的手法來獲取對應函式的 offset。 ### HTTP/2 目前無法處理 HTTP/2，遇到對應的封包只能直接通過