Try   HackMD

Linux 核心專題: 透過 Netfilter 自動過濾廣告

執行人: Cheng5840
期末專題影片

Reviewed by charliechiou

鳥哥文章最後更新內容時間為 2011 年,這之間 Netfilter 的行為或是可完成的內容是否有更動 ?

Reviewed by RealBigMickey

目前的封包判斷邏輯似乎由 user-space 中的 handle_sniff() 進行,為什麼不直接使用 eBPF 進行邏輯判斷,而是等待 user-space 回傳判斷結果,增加的通訊成本呢?
這樣的設計是否有特別考量?例如是否是為了保留彈性,還是出於 regex 執行限制的考量?

Reviewed by jserv

改寫書寫,避免只是張貼程式碼,這是專題報告,展現自身專業的所在。

TODO: 重現去年實驗並紀錄相關問題

2024 年報告
確保可在 Linux v6.8+ 以上運作
彙整你的認知並整理相關背景知識
探討針對 YouTube 一類網站過濾廣告的限制和解決方案

TODO: 以 eBPF 解析 SSL/TLS 加密封包

針對 HTTPS 內容,藉由 eBPF 來解析封包,從而進行內容的過濾和阻擋
應當研讀以 eBPF 打造 TCP 伺服器來強化相關認知

去年解說影片
2025 年 Linux 核心設計課程期末專題: 透過 Netfilter 自動過濾廣告

How to decode SSL in case of HTTPS packets

問題探討

為何不完全依賴 eBPF 進行封包判斷?

eBPF 執行環境受限:eBPF 程式執行於 kernel space,雖然具備近 native 的效能,但也必須通過 kernel verifier 的嚴格檢查。許多高階功能,例如正規表示式(regex)比對、記憶體動態配置(如 malloc)、遞迴、系統呼叫等,在 eBPF 中都是不允許的。

正規表示式無法在 eBPF 執行:由於我們希望攔截如 "GET /ad.js" 等類型的廣告封包內容,因此需使用 regex 對 SSL 明文做語意判斷。而 regex 處理只能由 user-space 的 C library 完成(如 regexec()),這是 kernel space 無法實現的功能。

保留彈性與可擴展性:封包過濾邏輯可能隨使用者需求變動,若將邏輯硬編於 kernel 中,不利於更新與除錯。使用者空間實作可大幅提高系統的靈活性與維護性。

封包判斷的 50ms 超時設計?

queue 操作中的競爭控制

本專案設計中,有兩個重要的鏈結串列:order_head 與 verdict_head,分別存放待判斷的封包與 user-space 回傳的判決。由於這些串列在不同上下文下會同時被修改,必須保證其操作具有 thread safety。

int ret = mutex_trylock(&insert_mutex);
if (ret != 0) { ... } 

insert_order() 中使用 mutex_trylock: 保護 order_head,避免多個核心同時插入造成 race condition。使用 trylock 而非 lock() 是為了避免阻塞在 kernel context 中導致 sleep,而破壞 kernel module 的安全性。若鎖被佔用,則本次封包會直接略過,不會被插入 queue

if (atomic_read(&device_opened) && data[0] == 0x17) { ... }

使用 atomic_read() 來讀取 device 開啟狀態: device_opened 為一全域 flag,用來確認是否有使用者空間程式正在運行,避免無人判斷卻仍將封包送入 queue。

使用 atomic_t 能保證多核心環境下的同步性,而不需要額外加鎖。

阻擋廣告效果不好

專案依賴 user-space 的 SSL_write() hook 來獲得明文,現在的 eBPF 掛在 libssl.so 的 SSL_write() 上,抓的是 outgoing request 的明文。也就是類似 curl → SSL_write(“GET /some_ad_script.js”) 因此你只能擋下「使用 libssl 的 outgoing TLS request」,但很多瀏覽器或廣告 SDK:使用 boringssl、nss、quiche 等非 OpenSSL 實作,甚至根本沒用標準 library

開發環境

$ uname -rs
Linux 6.11.0-21-generic

$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   8
  On-line CPU(s) list:    0-7
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
    CPU family:           6
    Model:                142
    Thread(s) per core:   2
    Core(s) per socket:   4
    Socket(s):            1
    Stepping:             11
    CPU(s) scaling MHz:   76%
    CPU max MHz:          4600.0000
    CPU min MHz:          400.0000
    BogoMIPS:             3999.93
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    128 KiB (4 instances)
  L1i:                    128 KiB (4 instances)
  L2:                     1 MiB (4 instances)
  L3:                     8 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-7

source: https://github.com/ItisCaleb/Netfilter-Adblock

$ make user
./generate_hash.sh .
gcc -o adblock adblock.c tls.c http.c -lnetfilter_queue
adblock.c:14:10: fatal error: http.h: No such file or directory
   14 | #include "http.h"
      |          ^~~~~~~~
compilation terminated.
cc1: fatal error: tls.c: No such file or directory
compilation terminated.
cc1: fatal error: http.c: No such file or directory
compilation terminated.
make: *** [Makefile:21: user] Error 1

專案中沒有 http.h, tls.c, http.c

雖然在 https://github.com/steven523/Netfilter-Adblock 有這些檔案,但不知道為何該專案內沒有。

$ make kernel
./generate_hash.sh .
make unload
make[1]: Entering directory '/home/daniel/NCKU/linux2025/Netfilter-Adblock'
sudo rmmod adblock || true >/dev/null
rmmod: ERROR: Module adblock is not currently loaded
make[1]: Leaving directory '/home/daniel/NCKU/linux2025/Netfilter-Adblock'
make -C /lib/modules/6.11.0-21-generic/build M=/home/daniel/NCKU/linux2025/Netfilter-Adblock modules
make[1]: Entering directory '/usr/src/linux-headers-6.11.0-21-generic'
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-13 (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
  You are using:           gcc-13 (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
  CC [M]  /home/daniel/NCKU/linux2025/Netfilter-Adblock/kadblock.o
In file included from /home/daniel/NCKU/linux2025/Netfilter-Adblock/kadblock.c:17:
hosts.gperf:88:1: warning: no previous prototype for ‘in_word_set’ [-Wmissing-prototypes]
hosts.gperf: In function ‘hash’:
hosts.gperf:54:14: warning: this statement may fall through [-Wimplicit-fallthrough=]
hosts.gperf:56:7: note: here
hosts.gperf:58:14: warning: this statement may fall through [-Wimplicit-fallthrough=]
hosts.gperf:60:7: note: here
hosts.gperf:61:14: warning: this statement may fall through [-Wimplicit-fallthrough=]
hosts.gperf:63:7: note: here
hosts.gperf:65:14: warning: this statement may fall through [-Wimplicit-fallthrough=]
hosts.gperf:67:7: note: here
hosts.gperf:68:14: warning: this statement may fall through [-Wimplicit-fallthrough=]
hosts.gperf:70:7: note: here
hosts.gperf:72:14: warning: this statement may fall through [-Wimplicit-fallthrough=]
hosts.gperf:74:7: note: here
hosts.gperf:75:14: warning: this statement may fall through [-Wimplicit-fallthrough=]
hosts.gperf:77:7: note: here
hosts.gperf:78:14: warning: this statement may fall through [-Wimplicit-fallthrough=]
hosts.gperf:80:7: note: here
  CC [M]  /home/daniel/NCKU/linux2025/Netfilter-Adblock/dns.o
  CC [M]  /home/daniel/NCKU/linux2025/Netfilter-Adblock/send_close.o
  CC [M]  /home/daniel/NCKU/linux2025/Netfilter-Adblock/verdict_ssl.o
  LD [M]  /home/daniel/NCKU/linux2025/Netfilter-Adblock/adblock.o
  MODPOST /home/daniel/NCKU/linux2025/Netfilter-Adblock/Module.symvers
  CC [M]  /home/daniel/NCKU/linux2025/Netfilter-Adblock/adblock.mod.o
  LD [M]  /home/daniel/NCKU/linux2025/Netfilter-Adblock/adblock.ko
  BTF [M] /home/daniel/NCKU/linux2025/Netfilter-Adblock/adblock.ko
Skipping BTF generation for /home/daniel/NCKU/linux2025/Netfilter-Adblock/adblock.ko due to unavailability of vmlinux
make[1]: Leaving directory '/usr/src/linux-headers-6.11.0-21-generic'
make load
make[1]: Entering directory '/home/daniel/NCKU/linux2025/Netfilter-Adblock'
sudo insmod adblock.ko
make[1]: Leaving directory '/home/daniel/NCKU/linux2025/Netfilter-Adblock'

儘管有一些非致命警告,但模組有正確載入

後來發現是因為沒有執行 git submodule --init --recursive 把 lipppf submodule 下載

$ lsmod | grep adblock
adblock               217088  0

目前在載入 adblock module 後,執行 curl http://googleads.g.doubleclick.net 依舊會得到結果,似乎不如預期。

$ curl http://googleads.g.doubleclick.net
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="https://marketingplatform.google.com/about/enterprise/">here</A>.
</BODY></HTML>

struct iphdr *ip = ip_hdr(skb);
        struct tcphdr *tcp = tcp_hdr(skb);
        pr_info("[adblock] Intercepted TCP data, %pI4:%u -> %pI4:%u, length=%d, content=%.50s\n", 
        &ip->saddr, ntohs(tcp->source),
        &ip->daddr, ntohs(tcp->dest),
        len, data);

        print_hex_dump(KERN_INFO, "payload: ", DUMP_PREFIX_OFFSET, 16, 1, data, len > 128 ? 128 : len, true);

為了要印出封包內容而新增 print_hex_dump()

$ curl http://neverssl.com

$ dmesg -w 

[1290160.551307] [adblock] Intercepted TCP data, 192.168.211.41:46132 -> 34.223.124.45:80, length=75, content=GET / HTTP/1.1
                 Host: neverssl.com
                 User-Agent: cu
[1290160.551330] payload: 00000000: 47 45 54 20 2f 20 48 54 54 50 2f 31 2e 31 0d 0a  GET / HTTP/1.1..
[1290160.551337] payload: 00000010: 48 6f 73 74 3a 20 6e 65 76 65 72 73 73 6c 2e 63  Host: neverssl.c
[1290160.551343] payload: 00000020: 6f 6d 0d 0a 55 73 65 72 2d 41 67 65 6e 74 3a 20  om..User-Agent: 
[1290160.551347] payload: 00000030: 63 75 72 6c 2f 38 2e 35 2e 30 0d 0a 41 63 63 65  curl/8.5.0..Acce
[1290160.551352] payload: 00000040: 70 74 3a 20 2a 2f 2a 0d 0a 0d 0a                 pt: */*....

其他封包

[1290189.378798] [adblock] Intercepted TCP data, 192.168.211.41:40958 -> 20.189.172.73:443, length=517, content=\x16\x03\x01\x02
[1290189.378822] payload: 00000000: 16 03 01 02 00 01 00 01 fc 03 03 85 6a ac 5c f3  ............j.\.
[1290189.378830] payload: 00000010: 8e 19 d8 03 27 36 f8 bc 08 24 e5 d7 4d 89 db 55  ....'6...$..M..U
[1290189.378836] payload: 00000020: 66 8e 3b d1 c3 f7 38 7f 8e fd 4f 20 72 c7 4e e1  f.;...8...O r.N.
[1290189.378840] payload: 00000030: 2c 6a 08 96 3d 2f 84 f0 8b 2b 19 96 d9 de 0f be  ,j..=/...+......
[1290189.378845] payload: 00000040: c8 2f fb 93 a4 fe 93 33 51 48 4d 90 00 24 13 01  ./.....3QHM..$..
[1290189.378850] payload: 00000050: 13 02 13 03 c0 2f c0 2b c0 30 c0 2c c0 27 cc a9  ...../.+.0.,.'..
[1290189.378854] payload: 00000060: cc a8 c0 09 c0 13 c0 0a c0 14 00 9c 00 9d 00 2f  .............../
[1290189.378859] payload: 00000070: 00 35 01 00 01 8f 00 00 00 2e 00 2c 00 00 29 77  .5.........,..)w
...
[1290192.930104] [adblock] Intercepted TCP data, 192.168.211.41:41914 -> 20.189.172.73:443, length=517, content=\x16\x03\x01\x02
[1290192.930127] payload: 00000000: 16 03 01 02 00 01 00 01 fc 03 03 d0 ed 81 02 dc  ................
[1290192.930134] payload: 00000010: 15 8f cc b4 f6 a0 87 41 95 99 03 89 48 87 7d 99  .......A....H.}.
[1290192.930140] payload: 00000020: f7 8e da 8e 7e c4 2a ee 6f c9 30 20 eb 05 b9 47  ....~.*.o.0 ...G
[1290192.930145] payload: 00000030: d4 16 f3 0b a3 f8 45 46 f5 a7 de 03 67 1b b0 9e  ......EF....g...
[1290192.930150] payload: 00000040: bb b9 82 43 43 d6 61 7a d8 78 4b 0d 00 24 13 01  ...CC.az.xK..$..
[1290192.930154] payload: 00000050: 13 02 13 03 c0 2f c0 2b c0 30 c0 2c c0 27 cc a9  ...../.+.0.,.'..
[1290192.930159] payload: 00000060: cc a8 c0 09 c0 13 c0 0a c0 14 00 9c 00 9d 00 2f  .............../
[1290192.930163] payload: 00000070: 00 35 01 00 01 8f 00 00 00 2e 00 2c 00 00 29 77  .5.........,..)w
...

Netfilter

參考: 鳥哥

Linux 的 Netfilter 機制到底可以做些什麼事情呢?其實可以進行的分析工作主要有:

  • 拒絕讓 Internet 的封包進入主機的某些埠口
    這個應該不難瞭解吧!例如你的 port 21 這個 FTP 相關的埠口,若只想要開放給內部網路的話,那麼當 Internet 來的封包想要進入你的 port 21 時,就可以將該資料封包丟掉!因為我們可以分析的到該封包表頭的埠口號碼呀!
  • 拒絕讓某些來源 IP 的封包進入
    例如你已經發現某個 IP 主要都是來自攻擊行為的主機,那麼只要來自該 IP 的資料封包,就將他丟棄!這樣也可以達到基礎的安全呦!
  • 拒絕讓帶有某些特殊旗標 (flag) 的封包進入
    最常拒絕的就是帶有 SYN 的主動連線的旗標了!只要一經發現,嘿嘿!你就可以將該封包丟棄呀!
  • 分析硬體位址 (MAC) 來決定連線與否
    如果你的區域網路裡面有比較搗蛋的但是又具有比較高強的網路功力的高手時,如果你使用 IP 來抵擋他使用網路的權限,而他卻懂得反正換一個 IP 就好了,都在同一個網域內嘛! 同樣還是在搞破壞~怎麼辦?沒關係,我們可以鎖死他的網路卡硬體位址啊!因為 MAC 是銲在網路卡上面的,所以你只要分析到該使用者所使用的 MAC 之後,可以利用防火牆將該 MAC 鎖住,呵呵!除非他能夠一換再換他的網路卡來取得新的 MAC,否則換 IP 是沒有用的啦!

螢幕擷取畫面 2025-05-20 233357

當一個網路封包要進入到主機之前,會先經由 NetFilter 進行檢查,那就是 iptables 的規則了。 檢查通過則接受 (ACCEPT) 進入本機取得資源,如果檢查不通過,則可能予以丟棄 (DROP) ! 上圖中主要的目的在告知你:『規則是有順序的』!例如當網路封包進入 Rule 1 的比對時, 如果比對結果符合 Rule 1 ,此時這個網路封包就會進行 Action 1 的動作,而不會理會後續的 Rule 2, Rule 3 等規則的分析了。

image

source: https://commons.wikimedia.org/wiki/File:Netfilter-packet-flow.svg

簡易版

image

source : https://linux.vbird.org/linux_server/centos6/0250simple_firewall.php

netfilter 與 eBPF 分別在網路 7 層的哪些 layer 運作 ?

eBPF

Linux 核心設計: eBPF
Linux 核心設計: 透過 eBPF 觀察作業系統行為
Linux Extended BPF (eBPF) Tracing Tools

eBPF: Unlocking the Kernel[OFFICIAL DOCUMENTARY]

every packets that goes to facebook.com runs through eBPF at XDP layer
cilium: 最初的 eBPF 是組合語言層級的工具,對於一般 end users 難以使用,為了將 eBPF 強大的功能帶給 end users,因此創立 Cilium
18:00 最初是想要建立一個全新的網路層
BPF LSM
An eBPF application usually consists out of at least two parts:
A user-space program (USP) that declares the kernel space program and attaches it to the relevant tracepoint/probe.
A kernel-space program (KSP) is what gets triggered and runs inside the kernel once the tracepoint/probe is met. This is where the actual eBPF logic is implemented.

eBPF 能夠在封包尚未進入到 kernel 時,先過濾封包
然而,不論是 Netfilter 或 eBPF,他們若拿到 ssl 加密過的封包(https),看到的都只會是加密過的亂碼,因此無法進行有效過濾。

因此該專案嘗試直接針對封鎖特定 IP (廣告投遞商) 發出的封包,來達到過濾廣告的目的 ?

https://eunomia.dev/tutorials/0-introduce/
What Makes eBPF So Powerful?

  • Direct Kernel Interaction: eBPF programs execute within the kernel, interacting with system-level events such as network packets, system calls, or tracepoints.
  • Safe Execution: eBPF ensures safety through a verifier that checks the logic of the program before it runs, preventing potential kernel crashes or security breaches.
  • Minimal Overhead: eBPF achieves near-native execution speed by employing a Just-In-Time (JIT) compiler, which translates eBPF bytecode into optimized machine code for the specific architecture.

eBPF skeleton

libbpf Overview
BPF application typically goes through the following phases:

  • Open phase: BPF object file is parsed: BPF maps, BPF programs, and global variables are discovered, but not yet created. After a BPF app is opened, it’s possible to make any additional adjustments (setting BPF program types, if necessary; pre-setting initial values for global variables, etc), before all the entities are created and loaded.

  • Load phase: BPF maps are created, various relocations are resolved, BPF programs are loaded into the kernel and verified. At this point, all the parts of a BPF application are validated and exist in kernel, but no BPF program is yet executed. After the load phase, it’s possible to set up initial BPF map state without racing with the BPF program code execution.

  • Attachment phase: This is the phase at which BPF programs get attached to various BPF hook points (e.g., tracepoints, kprobes, cgroup hooks, network packet processing pipeline, etc). This is the phase at which BPF starts performing useful work and read/update BPF maps and global variables.

  • Tear down phase: BPF programs are detached and unloaded from the kernel. BPF maps are destroyed and all the resources used by the BPF app are freed.

Generated BPF skeleton has corresponding functions to trigger each phase:

<name>__open() – creates and opens BPF application;
<name>__load() – instantiates, loads, and verifies BPF application parts;
<name>__attach() – attaches all auto-attachable BPF programs (it’s optional, you can have more control by using libbpf APIs directly);
<name>__destroy() – detaches all BPF programs and frees up all used resources

BIO(basic I/O) in openSSL

IPC in Linux

device files

$gpls -l 的輸出中,mode共有 10 個字元。第一個字元就是檔案類型;後面 9 個字元則是傳統的三組 rwx 權限

第一字元 檔案型態
- 一般(regular)檔案
d 目錄 (directory)
l 符號連結 (symbolic link)
b 區塊裝置 (block device)
c 字元裝置 (character device)
p 命名管線 (FIFO)
s Unix Domain Socket
# cat /proc/devices
Character devices:
/*major name*/
  1 mem
  4 /dev/vc/0
...
261 accel
511 adbdev

Block devices:
  7 loop
...
254 mdp
259 blkext

Are the major, minor number unique
From The Linux Programming Interface, §14.1

Each device file has a major ID number and a minor ID number. The major ID identifies the general class of device, and is used by the kernel to look up the appropriate driver for this type of device. The minor ID uniquely identifies a particular device within a general class. The major and minor IDs of a device file are displayed by the ls ­-l command.

block device

區塊裝置會以固定大小的資料區塊(通常 512 bytes 或 4KB)為單位進行 I/O 操作,支援 random access,常用於 硬碟、SSD、USB 隨身碟
$ls -l /dev/ | grep '^b' or $lsblk

# ls -l /dev/ | grep '^b'
/* --------------------------major---minor--------------------*/
brw-rw----   1 root   disk      7,     0  6月 23 15:12 loop0
brw-rw----   1 root   disk      7,     1  6月 23 15:12 loop1
...
brw-rw----   1 root   disk    259,     0  6月 23 15:12 nvme0n1
brw-rw----   1 root   disk    259,     1  6月 23 15:12 nvme0n1p1
brw-rw----   1 root   disk    259,     2  6月 23 15:12 nvme0n1p2

character device

負責處理以一個一個字元(byte)為單位進行資料傳輸的裝置。這種裝置資料是線性、串流的,不支援 random access。

你在 user-space 開啟 /dev/xxx 時,Linux 核心會根據 major number 找到對應的 device driver,再透過 file_operations 提供的 callback 處理。

device driver 不能簡稱為「驅動」,否則會造成理解的困難,使用明確的話語「裝置驅動程式」

pipes

socket

Code Trace

kadblock.c

mod_init()

模組的啟動入口,負責註冊字元設備、初始化資料結構,並把 Netfilter hook 掛上去

static int mod_init(void)
{
    major = register_chrdev(0, DEV_NAME, &fops);
    if (major < 0) {
        pr_alert("Registering char device failed with %d\n", major);
        return major;
    }
    cls = class_create(THIS_MODULE, DEV_NAME);
    device_create(cls, NULL, MKDEV(major, 0), NULL, DEV_NAME);

    init_verdict();
    return nf_register_net_hook(&init_net, &blocker_ops);
}

register_chrdev()

static inline int register_chrdev(unsigned int major, const char *name,
				  const struct file_operations *fops)
{
	return __register_chrdev(major, 0, 256, name, fops);
}
  • major : 要註冊的 major number,若傳 0,kernel 會自動分配一個
  • name : 裝置名稱,用於 /proc/devices 與內部識別
  • fops : 指向 struct file_operations 的指標,定義 open/read/write 等行為
    return
  • ≥0 : 成功,回傳分配的 major number(若你傳入 0)或你指定的值
  • <0 : 失敗,通常是 -EBUSY(已存在)、-ENOMEM(資源不足)等

pr_alert("msg") 等同 printk(KERN_ALERT, "msg")

mod_init 流程:
呼叫 register_chrdev(0, "adbdev", &fops),動態取得一個 major number,註冊字元設備,接著透過 class_create("adbdev") 先在 /sys/class/adbdev 建 class,再用 device_create 建立真正的 /dev/adbdev 節點,供 usermode 程式操作

  1. mod_init 及 mod_exit 可以分別加上 __init 及 __exit 嗎? 這樣是否會節省記憶體使用,__init 及 __exit 具體行為尚待釐清
  2. linux network namespace 是甚麼需要再了解
    network_namespaces(7) — Linux manual page
  3. 現在的 mod_init 是回傳 nf_register_net_hook(),那如果 hook 註冊失敗,先前創造的 device 依然存在但功能會缺失,所以應該要自動清除,因此是否要改成
-static int mod_init(void)
+static int __init mod_init(void)
 {
+    int ret;
+
     major = register_chrdev(0, DEV_NAME, &fops);
     if (major < 0) {
-        pr_alert("Registering char device failed with %d\n", major);
+        pr_alert("Registering char device failed: %d\n", major);
         return major;
     }
-    cls = class_create(DEV_NAME);
-    device_create(cls, NULL, MKDEV(major, 0), NULL, DEV_NAME);
+
+    cls = class_create(THIS_MODULE, DEV_NAME);
+    if (IS_ERR(cls)) {
+        ret = PTR_ERR(cls);
+        unregister_chrdev(major, DEV_NAME);
+        pr_alert("class_create failed: %d\n", ret);
+        return ret;
+    }
+
+    dev = device_create(cls, NULL, MKDEV(major, 0), NULL, DEV_NAME);
+    if (IS_ERR(dev)) {
+        ret = PTR_ERR(dev);
+        class_destroy(cls);
+        unregister_chrdev(major, DEV_NAME);
+        pr_alert("device_create failed: %d\n", ret);
+        return ret;
+    }
     class_destroy(cls);
     unregister_chrdev(major, DEV_NAME);
-    nf_unregister_net_hook(&init_net, &blocker_ops);
+    pr_info("adblock module unloaded\n");
 }

Error Codes in Linux

extract_tcp_data()

static int extract_tcp_data(struct sk_buff *skb, char **data)
{
    struct iphdr *ip = NULL;
    struct tcphdr *tcp = NULL;
    size_t data_off, data_len;

    if (!skb || !(ip = ip_hdr(skb)) || IPPROTO_TCP != ip->protocol)
        return -1;  // not ip - tcp

    if (!(tcp = tcp_hdr(skb)))
        return -1;  // bad tcp

    /* data length = total length - ip header length - tcp header length */
    data_off = ip->ihl * 4 + tcp->doff * 4;
    data_len = skb->len - data_off;

    if (data_len == 0)
        return -1;

    if (skb_linearize(skb))
        return -1;

    *data = skb->data + data_off;

    return data_len;
}
data_off = ip->ihl * 4 + tcp->doff * 4;
data_len = skb->len - data_off;

ip->ihl 為 Header 長度(以 32-bit words 為單位),因此乘以 4 後即為 IP header 的實際 byte 數,tcp->doff 是 TCP Header Length(同樣單位為 32-bit words)

  1. 為甚麼 data->off 只需要計算 IPheader, TCPheader 的偏移量?

    因為在 Netfilter HOOK 階段,skb->data 已經指到 IP header,MAC header 已經於先前被跳過,具體來說如下:( gpt 說的 有待求證)
    eth_type_trans()(由 driver 呼叫)中透過 skb_pull() 從 MAC header 拉到 IP header
    這是 在 Netfilter NF_INET_PRE_ROUTING 之前發生的事

    所以其實在 hook 階段時 ip = ip_hdr(skb) 其實等價於 ip = skb->data

skb->len = skb->tail - skb->data

block_hook()

先確認是否為

if (len > 0) {
    ...
    if (atomic_read(&device_opened) && data[0] == 0x17) {

        /*  TLS application */
        ktime_t time = ktime_get();
        struct queue_st *order = insert_order(time);
        result = -1;

        while (result == -1 &&
               ktime_to_ms(ktime_sub(ktime_get(), time)) < 50) {
            result = poll_verdict(time, current->pid);
        }
        if (result == -1) {
            list_del(&order->head);
            kfree(order);
        }
    }
        

ktime_t time = ktime_get(); 取得當前的高精度時間(nanosecond 精度),作為該封包的唯一「時間戳記」,也是暫存隊列的識別碼

struct queue_st *order = insert_order(time); 新增一個封包處理單(order)到 order_head 鍊結串列中,order 結構儲存該封包的時間戳記與用來比對的狀態,目的是等一下讓 user-space 根據時間戳去回傳 verdict

result = -1; 初始設為未決定狀態

while(...){...} 不斷輪詢 poll_verdict() 函式,查看有沒有 user-space 程式根據這個時間戳傳入 verdict,最多等 50ms。如果超過時間或得到結果就跳出。

static loff_t adbdev_lseek(struct file *file, loff_t offset, int orig)
{
    pid_t pid = offset;
    insert_verdict(pid);
    return 0;
}

static struct file_operations fops = {.owner = THIS_MODULE,
                                      .open = adbdev_open,
                                      .release = adbdev_release,
                                      .llseek = adbdev_lseek};

當呼叫 ssl_sniff.c 中的 handle_sniff()->lseek(fd, pid | (verdict << 31), 0); 時,在 kernel 就會觸發 adbdev_lseek()

verdict

verdict_ssl.c

核心模組攔截到 TLS 封包時不立即決定是否放行,而是暫存該封包的處理單 queue_st 到 order_head 串列中,讓 user-space 來分析封包內容並做決定。

insert_order()

int ret = mutex_trylock(&insert_mutex);
if (ret != 0) {
    ...
}
return NULL;
}

嘗試取得互斥鎖,避免其他執行緒同時修改 order_head,若沒成功,就不插入任何東西,返回 NULL

  1. 甚麼時候會有其他執行緒?

  2. mutex_trylock() return 甚麼?

    A:
    mutex_trylock() 與 mutex_lock() 差異:
    mutex_trylock() : 嘗試取得鎖,如果鎖已被持有,不會阻塞,而是立即回傳 0,取得成功則回傳 1

    mutex_lock(): 嘗試取得鎖,如果鎖已經被其他 context 持有,會進入 blocking 直到可以取得鎖為止

list_for_each (cur, &order_head) {
    order = list_entry(cur, struct queue_st, head);
    if (order->timestamp < timestamp)
        break;
}

為了讓 order_head 保持依照 timestamp 排序,找到 第一個 timestamp 小於新封包的 entry,然後用 list_add() 把新的 order 插在它前面order_head 會變成「由新到舊」(時間遞減)的順序

  1. 為甚麼 order_head 順序是時間遞減,這樣舊的封包不就要等新進來的先處理完嗎? 舊的有可能都沒辦法被處理到

    A: 因為 poll_verdict() 內的 first = list_first_entry(&order_head, struct queue_st, head); if (!first || first->timestamp != timestamp) return -1; 是將最新封包跟 linked list 第一筆比較 timestamp

    那如果真是如此,為甚麼 insert_order() 中會使用 list_add(&order->head, cur); 這樣似乎會出錯。
    例子:
    order_head → [A:300] → [B:200] → [C:100] → NULL

    ​​​​list_for_each (cur, &order_head) {
    ​​​​    order = list_entry(cur, struct queue_st, head);
    ​​​​    if (order->timestamp < 250)
    ​​​​        break;
    ​​​​}
    

    第一次:order = A,timestamp = 300 → 300 > 250,不 break
    第二次:order = B,timestamp = 200 → 200 < 250 → break!
    此時 cur 指到 B 節點 → 所以接下來執行list_add(&new_order->head, cur);
    把新節點插入在 cur(B)後面 → 出錯 !!!
    order_head → [A:300] → [B:200]→ [new:250] → [C:100]

    所以是否應將 list_add(&order->head, cur); 改為 list_add_tail(&order->head, cur); ??
    list_add source code

    ​​​​/*
    ​​​​ * Insert a new entry between two known consecutive entries.
    ​​​​ *
    ​​​​ * This is only for internal list manipulation where we know
    ​​​​ * the prev/next entries already!
    ​​​​ */
    ​​​​static inline void __list_add(struct list_head *new,
    ​​​​                  struct list_head *prev,
    ​​​​                  struct list_head *next)
    ​​​​{
    ​​​​    if (!__list_add_valid(new, prev, next))
    ​​​​        return;
    
    ​​​​    next->prev = new;
    ​​​​    new->next = next;
    ​​​​    new->prev = prev;
    ​​​​    WRITE_ONCE(prev->next, new);
    ​​​​}
    
    ​​​​/**
    ​​​​ * list_add - add a new entry
    ​​​​ * @new: new entry to be added
    ​​​​ * @head: list head to add it after
    ​​​​ *
    ​​​​ * Insert a new entry after the specified head.
    ​​​​ * This is good for implementing stacks.
    ​​​​ */
    ​​​​static inline void list_add(struct list_head *new, struct list_head *head)
    ​​​​{
    ​​​​    __list_add(new, head, head->next);
    ​​​​}
    
  2. 甚麼情況會有在 kernel-space 時,封包不按照 timestamp call inser_order() 的情況 ?
    如果不會發生這種狀況,那是不是可以不用再走訪 order_head,而是直接插入 head 的下一個。

order = kmalloc(sizeof(struct queue_st), GFP_KERNEL);
order->timestamp = timestamp;
list_add(&order->head, cur);

insert_verdict()

這個函式的目的是:接收來自 user-space (ssl_sniff.c) 的判決資訊,並加到 verdict_head 鏈結串列中

void insert_verdict(pid_t pid)
{
    struct queue_st *verdict;
    verdict = kmalloc(sizeof(struct queue_st), GFP_KERNEL);  // 分配記憶體
    verdict->pid = pid;                                       // 儲存含判決的 pid
    list_add_tail(&verdict->head, &verdict_head);             // 插入鏈結串列末端
}

pid 設計:

  • MSB:代表判決結果(1=block, 0=allow)
  • 其餘 31 位元:代表 pid_t,即使用者程式的 PID
  1. 為甚麼會想到 pid 這樣設計 ?
    以及為甚麼會想到以 union 作為 queue_st 的 member
struct queue_st {
    struct list_head head;
    union {
        ktime_t timestamp;
        pid_t pid;
    };
};
  1. 為甚麼是使用 list_add_tail 插入末端而不是頭 ?
  2. 為甚麼 order_head, verdict_head 都是使用 queue_st ?

poll_verdict()

poll_verdict(timestamp, pid)
比對第一筆 order_head 的 timestamp 是否為我們要等的,再檢查 verdict_head 中有沒有這個 pid 的結果,如果有就回傳 0 或 1(NF_ACCEPT 或 NF_DROP),同時刪掉 queue 與 verdict

first = list_first_entry(&order_head, struct queue_st, head); if (!first || first->timestamp != timestamp) return -1;

為甚麼只跟第一個 element 比 ?

list_for_each_entry (verdict, &verdict_head, head) {
        pid_t cpid = verdict->pid & ((1U << 31) - 1);
        int result = (u32) verdict->pid >> 31;
        if (cpid == pid) {
            ret = result;
            break;
        }
    }
list_del(&verdict->head);
list_del(&first->head);
kfree(verdict);
kfree(first);

即使 ret = -1,也會清除該筆封包與 verdict。這設計是強制一筆封包最多等一次

函式 來源 作用
insert_order() kernel 暫存等待判決的封包(由 TLS 封包觸發)
insert_verdict() user-space (ssl_sniff.c) 回傳對某筆封包的允許/封鎖判斷
poll_verdict() kernel 驗證是否已有 verdict 結果(最多等 50ms)並做出動作

insert_verdict() 與 poll_verdict() 如何與 user-space 和 kernel-space 互動尚不了解 !!

ssl_sniff.c

ssl_sniff.c 是使用者空間程式,它透過 eBPF perf buffer 監控 TLS 封包內容(在函式像是 SSL_write() 中攔截明文),再根據正規表示式比對是否為廣告,並透過寫入 /dev/adbdev 回報封包判決(允許或封鎖)給 kernel 模組 adblock.ko

handle_sniff()

void handle_sniff(void *ctx, int cpu, void *data, unsigned int data_sz)
{
    struct data_t *d = data;
    uint32_t result = 0;
    if (d->buf[0] == 'G' && d->buf[1] == 'E' && d->buf[2] == 'T') {
        int r = regexec(&preg, d->buf, 0, NULL, 0);
        if (!r)
            result = 1;
    }
    lseek(fd, d->pid | result << 31, 0);
}

透過 lseek() 寫入一個含 MSB 判斷結果的 pid,驅動 kernel /dev/adbdev 的 .llseek() 實作,其中.llseek() 中呼叫的是 insert_verdict(pid);

  1. 為什麼要用 lseek,而不是使用 ioctl ?
  2. 兩者在效能上有什麼差異?

main

int main(int argc, char *argv[])
{
    if (argc == 1 || !(argc & 1)) {
        printf("wrong argument count\n");
        printf("Usage: %s <libpath1> <func1> <libpath2> <func2>\n", argv[0]);
        exit(0);
    }

    int ret = regcomp(&preg, regexp, REG_NOSUB | REG_ICASE);
    assert(ret == 0);


    struct ssl_sniff_bpf *skel;
    struct perf_buffer *pb = NULL;
    skel = ssl_sniff_bpf__open_and_load();
    if (!skel) {
        fprintf(stderr, "Failed to open and load BPF skeleton\n");
        return 1;
    }

    for (int i = 1; i < argc; i += 2) {
        printf("Attaching %s in %s\n", argv[i + 1], argv[i]);
        struct bpf_uprobe_opts *ops = malloc(sizeof(struct bpf_uprobe_opts));
        ops->sz = sizeof(*ops);
        ops->ref_ctr_offset = 0x6;
        ops->retprobe = false;
        ops->func_name = argv[i + 1];
        struct bpf_link *link = bpf_program__attach_uprobe_opts(skel->progs.probe_SSL_write, -1,
                                        argv[i], 0, ops);
        if(!link)
            printf("Error attaching %s in %s\n", argv[i + 1], argv[i]);
    }

    pb = perf_buffer__new(bpf_map__fd(skel->maps.tls_event), 8, &handle_sniff,
                          NULL, NULL, NULL);
    if (libbpf_get_error(pb)) {
        fprintf(stderr, "Failed to create perf buffer\n");
        return 0;
    }

    printf("Opening adbdev...\n");
    fd = open("/dev/adbdev", O_WRONLY);
    if (fd < 0) {
        printf(
            "Failed to open adbdev.\nIt could be due to another program"
            "using it or the kernel module not being loaded.\n");
        exit(1);
    }

    printf("All ok. Sniffing plaintext now\n");
    while (1) {
        int err = perf_buffer__poll(pb, 1);
        if (err < 0) {
            printf("Error polling perf buffer: %d\n", err);
            break;
        }
    }
    return 0;
}

ssl_sniff.bpf.c

屬於 kernel-space
這支程式的功能是:攔截 user-space 中的 SSL_write() 呼叫,把 TLS 明文資料透過 perf buffer 傳到 user-space

<bpf/bpf_tracing.h> 提供 tracing 語法糖,例如 BPF_KPROBE

struct {
    __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
    __uint(key_size, sizeof(int));
    __uint(value_size, sizeof(int));
} tls_event SEC(".maps");

建立一個 perf event buffer map,類型為 BPF_MAP_TYPE_PERF_EVENT_ARRAY,可用來 即時將資料傳送給使用者空間程式,這種 map 是用於「事件推送」而非查詢(與 hash map 相對)

#define BUF_MAX_LEN 256
struct data_t {
    unsigned int pid;
    int len;
    unsigned char buf[BUF_MAX_LEN];
};

要傳給使用者空間的資料格式:

  • pid:來源進程的 process ID
  • len:實際寫入的資料長度(上限 256)
  • buf:從 SSL_write() 中取出的明文資料
SEC("uprobe")
int BPF_KPROBE(probe_SSL_write, void *ssl, char *buf, int num)
{
    unsigned long long current_pid_tgid = bpf_get_current_pid_tgid();
    unsigned int pid = current_pid_tgid >> 32;
    int len = num;
    if (len < 0)
        return 0;

    struct data_t data = {};
    data.pid = pid;
    data.len = (len < BUF_MAX_LEN ? (len & BUF_MAX_LEN - 1) : BUF_MAX_LEN);
    bpf_probe_read_user(data.buf, data.len, buf);
    bpf_perf_event_output(ctx, &tls_event, BPF_F_CURRENT_CPU, &data,
                          sizeof(struct data_t));

    return 0;
}

定義了一個 hook 點,掛載在使用者空間的 SSL_write() 函式,攔截 SSL_write() 可以拿到明文

#include <openssl/ssl.h>
int SSL_write(SSL *ssl, const void *buf, int num);

openSSL doc

SSL_write_ex() and SSL_write() write num bytes from the buffer buf into the specified ssl connection. On success SSL_write_ex() will store the number of bytes written in *written.

BPF_KPROBE() 是語法糖,會展開為:

SEC("uprobe/SSL_write")
int probe_SSL_write(struct pt_regs *ctx, ...)

BPF_F_CURRENT_CPU 表示推送到目前執行的 CPU 上對應的 perf buffer

  1. 在呼叫 ssl_write() 之前,訊息會是明文,但經過 ssl_write() 之後就會是密文,這部分在openSSL doc 中並沒有明確提及,stackoverflow 有說到 ssl_write 及 ssl_read 會自動做加解密的動作,但他們是怎麼被實做的呢 ?
    似乎要了解 BIO(Basic I/O)in OpenSSL

  2. probe (kprobe, uprobe) 是怎麼運作的?

  3. User-space 如何接收資料?
    user space 程式(ssl_sniff.c)會註冊 perf_buffer__new(map_fd, ... , callback); 然後透過 perf_buffer__poll(); 不斷接收這些從 kernel 傳來的 data_t 結構資料。

Linux source

ip_hdr()

https://github.com/torvalds/linux/blob/master/include/linux/ip.h

static inline struct iphdr *ip_hdr(const struct sk_buff *skb)
{
	return (struct iphdr *)skb_network_header(skb);
}

https://github.com/torvalds/linux/blob/master/include/linux/skbuff.h

static inline unsigned char *skb_network_header(const struct sk_buff *skb)
{
	return skb->head + skb->network_header;
}

https://github.com/torvalds/linux/blob/master/include/linux/skbuff.h#L883

iphdr

https://github.com/torvalds/linux/blob/master/include/uapi/linux/ip.h#L87

struct iphdr {
#if defined(__LITTLE_ENDIAN_BITFIELD)
	__u8	ihl:4,
		version:4;
#elif defined (__BIG_ENDIAN_BITFIELD)
	__u8	version:4,
  		ihl:4;
#else
#error	"Please fix <asm/byteorder.h>"
#endif
	__u8	tos;
	__be16	tot_len;
	__be16	id;
	__be16	frag_off;
	__u8	ttl;
	__u8	protocol;
	__sum16	check;
	__struct_group(/* no tag */, addrs, /* no attrs */,
		__be32	saddr;
		__be32	daddr;
	);
	/*The options start here. */
};
欄位 說明
version IP 版本,IPv4 固定為 4
ihl Header 長度(以 32-bit words 計算)
tos 服務類型(Type of Service)
tot_len 封包總長度(標頭 + 資料)
id 封包識別碼,用於分段重組
frag_off 分段資訊與 flags(如 Don't Fragment)
ttl 存活時間(Hop 數限制)
protocol 上層協定號碼(如 TCP=6、UDP=17)
check 標頭 checksum 檢查
saddr Source IP(IPv4)
daddr Destination IP(IPv4)

tcphdr

https://github.com/torvalds/linux/blob/master/include/uapi/linux/tcp.h#L25C1-L60C3

struct tcphdr {
	__be16	source;
	__be16	dest;
	__be32	seq;
	__be32	ack_seq;
#if defined(__LITTLE_ENDIAN_BITFIELD)
	__u16	ae:1,
		res1:3,
		doff:4,
		fin:1,
		syn:1,
		rst:1,
		psh:1,
		ack:1,
		urg:1,
		ece:1,
		cwr:1;
#elif defined(__BIG_ENDIAN_BITFIELD)
	__u16	doff:4,
		res1:3,
		ae:1,
		cwr:1,
		ece:1,
		urg:1,
		ack:1,
		psh:1,
		rst:1,
		syn:1,
		fin:1;
#else
#error	"Adjust your <asm/byteorder.h> defines"
#endif
	__be16	window;
	__sum16	check;
	__be16	urg_ptr;
};
欄位 說明
source TCP 原始埠號
dest TCP 目的埠號
seq 封包序列號
ack_seq 確認應答序號
doff Header 長度(以 32-bit word 為單位)
fin ~ cwr TCP 控制旗標(如 SYN、ACK、FIN)
window 流量控制視窗大小
check TCP 檢查碼
urg_ptr 緊急指標

atomic_read()

Linux 提供 atomic_t 型別來做多核心安全的數值讀寫。它能保證在 SMP(多處理器)環境下:

  • 操作是 不可中斷的
  • 不需要加鎖(比 mutex 更輕量)
  • 適合儲存計數器、旗標等數值

sk_buff

image

https://amsekharkernel.blogspot.com/2014/08/what-is-skb-in-linux-kernel-what-are.html

reg_exec()

regular-expressions.info
regex(3) — Linux manual page

regexec() is used to match a null-terminated string against the
compiled pattern buffer in *preg, which must have been initialised
with regcomp(). eflags is the bitwise OR of zero or more of the
following flags:

regcomp() returns zero for a successful compilation or an error
code for failure.
regexec() returns zero for a successful match or REG_NOMATCH for
failure.
regerror() returns the size of the buffer required to hold the
string.

範例程式:

#include <stdio.h>
#include <stdlib.h>
#include <regex.h>

int main() {
    regex_t regex;              // 存放編譯後的正規表示式
    const char *pattern = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$";
    const char *test_str = "user@example.com";
    int ret;

    // 編譯正規表示式
    ret = regcomp(&regex, pattern, REG_EXTENDED); // 編譯成功會回傳 0,失敗回傳錯誤碼
    if (ret != 0) {  // 編譯失敗
        char msgbuf[100]; // 存放錯誤訊息
        regerror(ret, &regex, msgbuf, sizeof(msgbuf)); // 將錯誤碼轉為錯誤訊息,存進 msgbuf
        fprintf(stderr, "Regex compilation failed: %s\n", msgbuf);
        return 1;
    }

    // 使用 regexec() 比對字串
    ret = regexec(&regex, test_str, 0, NULL, 0);
    if (ret == 0) {
        printf(" Match found: \"%s\"\n", test_str);
    } else if (ret == REG_NOMATCH) {
        printf(" No match: \"%s\"\n", test_str);
    } else {
        char msgbuf[100];
        regerror(ret, &regex, msgbuf, sizeof(msgbuf));
        fprintf(stderr, "Regex match error: %s\n", msgbuf);
    }

    // 清除 regex_t 資源
    regfree(&regex);

    return 0;
}

windows 無法編譯該程式 ( #include <regex.h> )

lseek()

#include <unistd.h>
off_t lseek(int fd, off_t offset, int whence);

用來在檔案中移動讀寫位址

lseek() repositions the file offset of the open file description associated with the file descriptor fd to the argument offset according to the directive whence as follows:

  • SEEK_SET The file offset is set to offset bytes.
  • SEEK_CUR The file offset is set to its current location plus offset bytes.
  • SEEK_END The file offset is set to the size of the file plus offset bytes.

lseek() allows the file offset to be set beyond the end of the
file (but this does not change the size of the file). If data is
later written at this point, subsequent reads of the data in the
gap (a "hole") return null bytes ('\0') until data is actually
written into the gap.