Linux 核心專題: 以 XDP 打造防火牆

執行人: jhin1228, D4nnyLee
專題講解影片 (這是私人影片看不了)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

提問清單

任務簡述

研究 XDP Firewall，解釋其運作原理，在其 GitHub Issue 找出可改進或提交內部實作的錯誤。

Reference Resource: A Beginners Guide to eBPF Programming for Networking: 透過 eBPF hook kernel space的網路模組 (如 TCP/IP stack)，並寫成條件判斷，當封包進到 NIC 時直接在核心被分類並動作。

TODO: 解析 XDP Firewall 運作原理

eXpress Data Path (XDP)

eXpress Data Path (XDP) 是 Extended Berkeley Packet Filter (eBPF) 的其中一種 BPF program type。不同的 program type 代表不同的 hook point 和 helper function，同時 ebpf program 的輸入和輸出格式也不同。

以下列舉 bpf_prog_type 種類 :

/* /usr/src/linux-headers-5.4.0-148/include/uapi/linux/bpf.h */
enum bpf_prog_type {
    BPF_PROG_TYPE_UNSPEC,
    BPF_PROG_TYPE_SOCKET_FILTER,
    BPF_PROG_TYPE_KPROBE,
    BPF_PROG_TYPE_SCHED_CLS,
    BPF_PROG_TYPE_SCHED_ACT,
    BPF_PROG_TYPE_TRACEPOINT,
    BPF_PROG_TYPE_XDP, // XDP
    BPF_PROG_TYPE_PERF_EVENT,
    BPF_PROG_TYPE_CGROUP_SKB,
    BPF_PROG_TYPE_CGROUP_SOCK,
    BPF_PROG_TYPE_LWT_IN,
    BPF_PROG_TYPE_LWT_OUT,
    BPF_PROG_TYPE_LWT_XMIT,
    BPF_PROG_TYPE_SOCK_OPS,
    BPF_PROG_TYPE_SK_SKB,
    BPF_PROG_TYPE_CGROUP_DEVICE,
    BPF_PROG_TYPE_SK_MSG,
    BPF_PROG_TYPE_RAW_TRACEPOINT,
    BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
    BPF_PROG_TYPE_LWT_SEG6LOCAL,
    BPF_PROG_TYPE_LIRC_MODE2,
    BPF_PROG_TYPE_SK_REUSEPORT,
    BPF_PROG_TYPE_FLOW_DISSECTOR,
    BPF_PROG_TYPE_CGROUP_SYSCTL,
    BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
    BPF_PROG_TYPE_CGROUP_SOCKOPT,
};

XDP 可以提早處理從網路裝置進來的封包，根據 hook point 分為以下三種 :

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

圖片出處

其中 Native/Offload XDP 需要網路裝置本身支援。

eBPF Architecture

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

上圖描述整體 eBPF 架構，可分成 BPF program 撰寫及編譯、BPF bytecode 載入到核心並完成 hook point 設定、藉 BPF maps 在 kernel space 和 user space 間傳遞資料三大部分。

載入 BPF 程式到 Linux 核心

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

以 XDP Firewall 為例，會將封包過濾規則寫在 BPF program 後，以 Clang/LLVM 編譯成 BPF bytecode (BPF instruction) 並交由 Loader (BCC、libbpf…) 透過系統呼叫把 BPF ELF object file 載入到核心內。

進到核心後，會先建立 strcut bpf_prog，這個結構體是 BPF bytecode 在核心的代表，之後將 BPF bytecode 從 user space 拷貝至 kernel space 並開始驗證此 bytecode 是否安全，最後分配一個 file descriptor 並傳回給 user space 的 process 作之後的處理。

BPF bytecode: 就是一種可以被虛擬機執行的 machine code，之所以稱其為 bytecode 是因為 BPF 指令集的 opcode 都是一個 byte 長度。

BPF instruction (BPF bytecode): BPF instruction 採用虛擬指令集，類似 assembly 在處理的指令集。

BPF 虛擬機: 可理解成直譯器 (Interpreter)，架構圖如下。

Image Not Showing Possible Reasons
The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted
Learn More →

在 XDP Firewall 中以 xdpfw.c 作為 loader，透過以下流程將 BPF instruction 載入到核心:

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

// Open phase
int loadbpfobj()
  --int bpf_prog_load_xattr() // libbpf.c
      --static struct bpf_object *__bpf_object__open_xattr()
          --static struct bpf_object *__bpf_object__open()
              
// Load phase
int loadbpfobj()
  --int bpf_prog_load_xattr() // libbpf.c
      --int bpf_object__load_xattr()
        --static int bpf_object__load_progs()
          --int bpf_program__load()
            --static int load_program()
              --int bpf_load_program_xattr() // bpf.c
                --static inline int sys_bpf_prog_load()
                  --static inline int sys_bpf()
                    --SYSCALL_DEFINE3() // syscall.c
                      --static int bpf_prog_load()

關於 bpf skeleton 和 bpf app lifecycle 可參考這篇文章

XDP and Hook

完成 BPF program 載入核心動作後，接著關心 BPF machine code 的 hook point 設定及存取到此 hook point 時的後續行為，這裡以 XDP 這個 BPF program type 接續探討。

XDP Firewall 中當 xdpfw.c 將 xdpfw_kern.c 載入完成後便進入 attachment 階段，流程如下 :

// Attach phase
int attachxdp()
  --int bpf_set_link_xdp_fd() // netlink.c
    --static int __bpf_set_link_xdp_fd_replace()
      nla->nla_type = NLA_F_NESTED | IFLA_XDP;
      nla_xdp->nla_type = IFLA_XDP_FD;
        --static int do_setlink() // /net/core/rtnetlink.c
          --int dev_change_xdp_fd() // /net/core/dev.c
            --static int dev_xdp_install()

在函式 do_setlink() 會根據先前設定好的 XDP flag 呼叫函示 dev_change_xdp_fd()

static int do_setlink(const struct sk_buff *skb,
		      struct net_device *dev, struct ifinfomsg *ifm,
		      struct netlink_ext_ack *extack,
		      struct nlattr **tb, char *ifname, int status)
{
    ...
    if (tb[IFLA_XDP]) {
        ...
        if (xdp[IFLA_XDP_FD]) {
            err = dev_change_xdp_fd(dev, extack,
                        nla_get_s32(xdp[IFLA_XDP_FD]),
                        xdp_flags);
            ...
        }
    }
    ...
}

dev_change_xdp_fd() 是將 XDP program fd (struct bpf_prog) 和指定的 NIC interface 作關聯。

在本次實驗 dev->netdev_ops 是 ixgbe_netdev_ops，bpf_op 則是 ixgbe_xdp。

從這裡可知 XDP program 是透過 netlink 中的 NETLINK_ROUTE 相關功能將其 hook 到指定的 interface 上。

static const struct net_device_ops ixgbe_netdev_ops = {
        ...
	.ndo_bpf		= ixgbe_xdp,
	.ndo_xdp_xmit		= ixgbe_xdp_xmit,
	...
};

int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
		      int fd, u32 flags)
{
	const struct net_device_ops *ops = dev->netdev_ops;
	struct bpf_prog *prog = NULL;
	bpf_op_t bpf_op, bpf_chk;
        ...

	bpf_op = bpf_chk = ops->ndo_bpf;
        ...
	if (fd >= 0) {
            ...
		prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_XDP,
					     bpf_op == ops->ndo_bpf);
            ...
	} else {
            ...
	}

	err = dev_xdp_install(dev, bpf_op, extack, flags, prog);
    ...
}

在 dev_xdp_install 就是針對各家網路裝置的驅動程式進行 XDP program 安裝，並根據 XDP 預計 hook 到的地方執行進一步的設定。

本實驗由於 hook 在 driver，所以 xdp.command = XDP_SETUP_PROG;

static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
			   struct netlink_ext_ack *extack, u32 flags,
			   struct bpf_prog *prog)
{
	struct netdev_bpf xdp;

	memset(&xdp, 0, sizeof(xdp));
	if (flags & XDP_FLAGS_HW_MODE)
		xdp.command = XDP_SETUP_PROG_HW;
	else
		xdp.command = XDP_SETUP_PROG;
	xdp.extack = extack;
	xdp.flags = flags;
	xdp.prog = prog;

	return bpf_op(dev, &xdp);
}

ixgbe_xdp_setup 就是將 XDP program (xdp_prog) 記錄到 rx_ring。

// /tools/testing/selftests/powerpc/benchmarks/context_switch.c
static unsigned long xchg(unsigned long *p, unsigned long val)
{
	return __atomic_exchange_n(p, val, __ATOMIC_SEQ_CST);
}

// /drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
static int ixgbe_xdp_setup(struct net_device *dev, struct bpf_prog *prog)
{
	struct ixgbe_adapter *adapter = netdev_priv(dev); // Get network device private data
        ...

	/* verify ixgbe ring attributes are sufficient for XDP */
	for (i = 0; i < adapter->num_rx_queues; i++) {
            ...
	}

        ...

	/* If transitioning XDP modes reconfigure rings */
	if (need_reset) {
            ...
	} else {
		for (i = 0; i < adapter->num_rx_queues; i++)
                    // 將 XDP program (xdp_prog) 記錄到 rx_ring 上
			(void)xchg(&adapter->rx_ring[i]->xdp_prog,
			    adapter->xdp_prog);
	}
        ...
}

static int ixgbe_xdp(struct net_device *dev, struct netdev_bpf *xdp)
{
	struct ixgbe_adapter *adapter = netdev_priv(dev);

	switch (xdp->command) {
	case XDP_SETUP_PROG:
		return ixgbe_xdp_setup(dev, xdp->prog);
	case XDP_QUERY_PROG:
		xdp->prog_id = adapter->xdp_prog ?
			adapter->xdp_prog->aux->id : 0;
		return 0;
	case XDP_SETUP_XSK_UMEM:
		return ixgbe_xsk_umem_setup(adapter, xdp->xsk.umem,
					    xdp->xsk.queue_id);

	default:
		return -EINVAL;
	}
}

__atomic_exchange_n

設定好 XDP program 的 hook 點後，接著思考何時觸發此 hook 及觸發後的後續工作。

一般網路裝置 (不考慮 SmartNIC) 本身沒有網路處理器，當封包從網路裝置進來時沒有進程處理就會被丟棄，而常見的收包方式有以下兩種 :

Busy-polling: 預留特定的 CPU core 和 process 給網路裝置，100% 用於收包，如 DPDK。
IRQ (硬中斷): 當網路裝置收到封包後，透過 IRQ 告知 CPU 需處理到來的封包。然而在高流量的情況下中斷所佔的開銷過大，這也是為何會有 DPDK 所採用 polling 機制。

針對 IRQ 在高流量的情形下的改進方式就是 NAPI，它結合了 polling 和 interrupt :

當執行至 NAPI 的 poll() 時，會從 ring buffer 收取 batched 封包 (每次收取封包的量可以用 budget 決定)，這段期間會接收所有到來的封包且不會觸發 IRQ。
當不在 poll() 時，收到封包時會觸發 IRQ，核心會呼叫 poll() 收包。

接著討論 DMA 將封包複製至 Rx ring buffer，產生一個 IRQ 後的流程 (以 ixgbe 驅動為例) :

// /include/linux/interrupt.h
enum
{
    ...
    NET_TX_SOFTIRQ,
    NET_RX_SOFTIRQ,
    ...
};

// /drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
static irqreturn_t ixgbe_msix_clean_rings() / static irqreturn_t ixgbe_intr()
  --static inline void napi_schedule_irqoff() // /include/linux/netdevice.h
    --void __napi_schedule_irqoff() // net/core/dev.c
      --static inline void ____napi_schedule()
        --__raise_softirq_irqoff(NET_RX_SOFTIRQ) // /kernel/softirq.c

__raise_softirq_irqoff() 會觸發 NR_SOFTIRQS 類型 soft-IRQ，最後會執行 net_rx_action()。

ksoftirqd 會執行 net_rx_action()

// /net/core/dev.c
static __latent_entropy void net_rx_action()
  --budget -= napi_poll(n, &repoll); // In this case, call ixgbe_poll()
    --int ixgbe_poll() // /drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
      --static int ixgbe_clean_rx_irq()
        --static struct sk_buff *ixgbe_run_xdp()

可觀察到當 XDP 的 hook 點在 driver 時，會進到以下程式碼的第 14 行，此時 skb 只是大小為 8 bytes 的指標，尚未將封包內容拷貝給它。








































static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
			       struct ixgbe_ring *rx_ring,
			       const int budget)
{
        ...
	while (likely(total_rx_packets < budget)) {
                ...
		struct sk_buff *skb;
                ...

		/* retrieve a buffer from the ring */
		if (!skb) {
                ...
			skb = ixgbe_run_xdp(adapter, rx_ring, &xdp);
		}

		if (IS_ERR(skb)) {
			unsigned int xdp_res = -PTR_ERR(skb);

			if (xdp_res & (IXGBE_XDP_TX | IXGBE_XDP_REDIR)) {
				xdp_xmit |= xdp_res;
				ixgbe_rx_buffer_flip(rx_ring, rx_buffer, size);
			} else {
				rx_buffer->pagecnt_bias++;
			}
			total_rx_packets++;
			total_rx_bytes += size;
		} else if (skb) {
			ixgbe_add_rx_frag(rx_ring, rx_buffer, skb, size);
		} else if (ring_uses_build_skb(rx_ring)) {
			skb = ixgbe_build_skb(rx_ring, rx_buffer,
					      &xdp, rx_desc);
		} else {
			skb = ixgbe_construct_skb(rx_ring, rx_buffer,
						  &xdp, rx_desc);
		}
            ...
    }
        ...
}

在 ixgbe_run_xdp() 中，根據 bpf_prog_run_xdp() 取得的 XDP action 決定封包該如何處理。

XDP_PASS : 正常處理封包，即封包交給 kernel networking stack 處理。
XDP_TX : 封包從原 interface 出去，適用於 proxy、load balance。
XDP_REDIRECT : 封包從其他 interface 出去、封包交由其他 CPU 處理、透過 AF_XDP 直接將封包導向 userspace 上的 process 處理。
XDP_ABORTED : 類似 XDP_DROP，只是 ebpf program 會在 tracepoint 上提供錯誤訊息的 log。
XDP_DROP : 在 XDP hook 階段將滿足過濾規則的封包丟棄，適用於 DDoS mitigation。

#define IXGBE_XDP_PASS		0
#define IXGBE_XDP_CONSUMED	BIT(0)
#define IXGBE_XDP_TX		BIT(1)
#define IXGBE_XDP_REDIR	BIT(2)

static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
				     struct ixgbe_ring *rx_ring,
				     struct xdp_buff *xdp)
{
	int err, result = IXGBE_XDP_PASS;
	struct bpf_prog *xdp_prog;
	struct xdp_frame *xdpf;
	u32 act;

	rcu_read_lock();
	xdp_prog = READ_ONCE(rx_ring->xdp_prog);

	if (!xdp_prog)
		goto xdp_out;

	prefetchw(xdp->data_hard_start); /* xdp_frame write */

	act = bpf_prog_run_xdp(xdp_prog, xdp); // XDP hook point!
	switch (act) {
	case XDP_PASS:
		break;
	case XDP_TX:
		xdpf = convert_to_xdp_frame(xdp);
		if (unlikely(!xdpf)) {
			result = IXGBE_XDP_CONSUMED;
			break;
		}
		result = ixgbe_xmit_xdp_ring(adapter, xdpf);
		break;
	case XDP_REDIRECT:
		err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
		if (!err)
			result = IXGBE_XDP_REDIR;
		else
			result = IXGBE_XDP_CONSUMED;
		break;
	default:
		bpf_warn_invalid_xdp_action(act);
		/* fallthrough */
	case XDP_ABORTED:
		trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
		/* fallthrough -- handle aborts by dropping packet */
	case XDP_DROP:
		result = IXGBE_XDP_CONSUMED;
		break;
	}
xdp_out:
	rcu_read_unlock();
	return ERR_PTR(-result);
}

bpf_prog_run_xdp() 正是觸發 XDP hook 後要執行的動作

ebpf maps

以下是 loader 創建完 ebpf map 並回傳其 fd 給 userspace process 的過程 :

// Create ebpf map and get map fd
int loadbpfobj()
  --int bpf_prog_load_xattr() // libbpf.c
    --static int bpf_object__create_maps() // Get map fd
      --static int bpf_object__create_map()
        --int bpf_create_map_xattr() // bpf.c
          --sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
            --static int map_create() // /kernel/bpf/syscall.c
     
    // BPF program relocation
    --static int bpf_object__relocate() 
        --static int bpf_program__relocate()
        ...
                
    // BPF program load
    --static int bpf_object__load_progs()
    ...

在 map_create() 中，首先會透過 find_and_alloc_map() 來為這個 map 分配空間，並且不同型態的 map 會使用不同的方法來分配，而最後統一都回傳 struct bpf_map *，不同型態的 map 都可以用此結構來表示。

為不同型態的 map 配置好空間之後，接下來就可以透過 bpf_map_new_fd() 來將 struct bpf_map * 對應到一個 file descriptor。

有一個事先定義好的陣列 struct bpf_map_ops *bpf_map_types，其中每個 struct bpf_map_ops 都對應到一種 map 型態的各種操作（例如分配空間、查找元素等），而 find_and_alloc_map() 就是透過 bpf_map_types[type] 來快速得到不同型態對應到的操作，而不是執行一連串的 if-else 來判斷。

static int map_create(union bpf_attr *attr)
{
    ...
    struct bpf_map *map;
    int f_flags;
    int err;

    ...

    /* find map type and init map: hashtable vs rbtree vs bloom vs ... */
    map = find_and_alloc_map(attr);
    if (IS_ERR(map))
        return PTR_ERR(map);

    err = bpf_obj_name_cpy(map->name, attr->map_name);
    if (err)
        goto free_map;

    ...

    err = bpf_map_new_fd(map, f_flags);
    if (err < 0) {
        /* failed to allocate fd.
         * bpf_map_put_with_uref() is needed because the above
         * bpf_map_alloc_id() has published the map
         * to the userspace and the userspace may
         * have refcnt-ed it through BPF_MAP_GET_FD_BY_ID.
         */
        bpf_map_put_with_uref(map);
        return err;
    }

    return err;

    ... // handle error and free space
}

bpf_map_new_fd() 做的事情其實很簡單，就只是在檔案系統上創建一個匿名檔案並且將前面分配好的 map 放到該檔案的 private_data 欄位，之後只要再將一個未使用的 fd 與此檔案關聯起來，我們就可以透過此 fd 存取到 map。

函式開頭的 anon 代表的意思是 anonymous，也就是匿名的，而程式碼中的 "bpf-map" 則是此檔案的類別，而不是檔案名稱。

private_data 是一個可以讓開發者放自訂義結構的欄位。

int bpf_map_new_fd(struct bpf_map *map, int flags)
{
    ...
        
    return anon_inode_getfd("bpf-map", &bpf_map_fops, map,
                            flags | O_CLOEXEC);
}

int anon_inode_getfd(const char *name, const struct file_operations *fops,
             void *priv, int flags)
{
    int error, fd;
    struct file *file;

    error = get_unused_fd_flags(flags);
    if (error < 0)
        return error;
    fd = error;

    file = anon_inode_getfile(name, fops, priv, flags);
    if (IS_ERR(file)) {
        error = PTR_ERR(file);
        goto err_put_unused_fd;
    }
    fd_install(fd, file);

    return fd;

error:
    ...  // handle error
}

要透過 fd 拿到對應的 map 的話，我們可以從 map_lookup_elem() 中看到其過程。

首先會透過 fdget() 從檔案系統中取得 fd 對應的檔案，之後 __bpf_map_get() 就會讀取檔案的 private_data 欄位，來得到前面 map_create() 中配置的 struct bpf_map，也就是 fd 對應的 map。

之後函式就會根據不同的 map 型別來查找 key 對應的 value，並且將其結果複製到 value（一個用 kmalloc() 得到的 buffer），最後再利用 copy_to_user() 把結果複製到 userspace 的指標（也就是 uvalue）。

以下變數中前綴的 u 都是指 userspace。

程式碼中的 map->ops 其實就是前面提到過的利用 bpf_map_types[type] 得到的每個型別對應的操作。

// Lookup value from the map represented by fd
int bpf_map_lookup_elem()
  --sys_bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
    --static int map_lookup_elem();  // /kernel/bpf/syscall.c

static int map_lookup_elem(union bpf_attr *attr)
{
    void __user *ukey = u64_to_user_ptr(attr->key);
    void __user *uvalue = u64_to_user_ptr(attr->value);
    int ufd = attr->map_fd;
    struct bpf_map *map;
    void *key, *value, *ptr;
    u32 value_size;
    struct fd f;
    int err;

    ...

    f = fdget(ufd);
    map = __bpf_map_get(f);
    if (IS_ERR(map))
        return PTR_ERR(map);
    if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
        err = -EPERM;
        goto err_put;
    }

    ...

    key = __bpf_copy_key(ukey, map->key_size); // Copy key from userspace
    if (IS_ERR(key)) {
        err = PTR_ERR(key);
        goto err_put;
    }

    ...

    err = -ENOMEM;
    value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
    if (!value)
        goto free_key;

    ...

    preempt_disable();
    this_cpu_inc(bpf_prog_active);
    if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
        map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
        err = bpf_percpu_hash_copy(map, key, value);
    } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
        err = bpf_percpu_array_copy(map, key, value);
    } ...
    else {
        rcu_read_lock();
        if (map->ops->map_lookup_elem_sys_only)
            ptr = map->ops->map_lookup_elem_sys_only(map, key);
        else
            ptr = map->ops->map_lookup_elem(map, key);
        if (IS_ERR(ptr)) {
            err = PTR_ERR(ptr);
        } 

        ...

        rcu_read_unlock();
    }
    this_cpu_dec(bpf_prog_active);
    preempt_enable();

done:
    if (err)
        goto free_value;

    err = -EFAULT;
    if (copy_to_user(uvalue, value, value_size) != 0)
        goto free_value;

    err = 0;

free_value:
    kfree(value);
free_key:
    kfree(key);
err_put:
    fdput(f);
    return err;
}

`xdpfw_kern.c`

這個檔案定義要被載入核心執行的程式以及用來與 userspace 程式互相傳遞資料的 map，而為了方便與其他函式與變數做區分，XDP 程式與 map 都會用 SEC() 來表示他們要存放在不同的 section，如此一來透過解析 ELF 中的 section table 就可以快速找到要載入的程式以及 map。

`xdp_prog_main()`

ctx 這個參數讓我們有辦法存取到封包內容，封包的開頭與結尾分別是 ctx->data 和 ctx->data_end。

此函式做的事情主要可以分為四個部分，檢查完整性、確認黑名單、紀錄 pps/bps 以及過濾封包。

檢查完整性
函式一開始會先檢查封包是否完整，從下面的程式可以看到程式還會順便確認封包的 protocol（目前只支援 TCP、UDP、ICMP）。

int xdp_prog_main(struct xdp_md *ctx)
{
    // Initialize data.
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;

    // Scan ethernet header.
    struct ethhdr *eth = data;

    ...

    // Initialize IP headers.
    struct iphdr *iph = NULL;
    struct ipv6hdr *iph6 = NULL;
    __u128 srcip6 = 0;

    // Set IPv4 and IPv6 common variables.
    if (eth->h_proto == htons(ETH_P_IPV6))
    {
        iph6 = (data + sizeof(struct ethhdr));

        ...
    }
    else
    {
        iph = (data + sizeof(struct ethhdr));

        ...
    }

    // Check IP header protocols.
    if ((iph6 && iph6->nexthdr != IPPROTO_UDP && iph6->nexthdr != IPPROTO_TCP && iph6->nexthdr != IPPROTO_ICMP) && (iph && iph->protocol != IPPROTO_UDP && iph->protocol != IPPROTO_TCP && iph->protocol != IPPROTO_ICMP))
    {
        return XDP_PASS;
    }

確認黑名單
若是有封包在後面過濾封包的階段被判斷成惡意封包（回傳 XDP_DROP），並且過濾規則有設定 blocktime 時，接下來的 blocktime 秒內如果又收到同樣來源 IP 的封包的話就會在此階段直接丟棄封包。

時間紀錄的方式是將預計解除封鎖的時間點紀錄在 ip_blacklist_map 或 ip6_blacklist_map 中（依據 IP 為 IPv4 還是 IPv6 來決定），而每次這個階段則是檢查是否已經超過 IP 對應的時間點來決定是否繼續封鎖。

__u64 now = bpf_ktime_get_ns();

// Check blacklist map.
__u64 *blocked = NULL;

if (iph6)
{
    blocked = bpf_map_lookup_elem(&ip6_blacklist_map, &srcip6);
}
else if (iph)
{
    blocked = bpf_map_lookup_elem(&ip_blacklist_map, &iph->saddr);
}

if (blocked != NULL && *blocked > 0)
{
    if (now > *blocked)
    {
        // Remove element from map.
        if (iph6)
        {
            bpf_map_delete_elem(&ip6_blacklist_map, &srcip6);
        }
        else if (iph)
        {
            bpf_map_delete_elem(&ip_blacklist_map, &iph->saddr);
        }
    }
    else
    {
        ...
        
        // They're still blocked. Drop the packet.
        return XDP_DROP;
    }
}

...

matched:
    if (action == 0)
    {
        // Before dropping, update the blacklist map.
        if (blocktime > 0)
        {
            __u64 newTime = now + (blocktime * 1000000000);

            if (iph6)
            {
                bpf_map_update_elem(&ip6_blacklist_map, &srcip6, &newTime, BPF_ANY);
            }
            else if (iph)
            {
                bpf_map_update_elem(&ip_blacklist_map, &iph->saddr, &newTime, BPF_ANY);
            }
        }

        ...

        return XDP_DROP;
    }

*blocked 和 newTime 代表的都是解除封鎖的時間，單位為奈秒

紀錄 pps/bps
透過 ip_stats_map 和 ip6_stats_map 紀錄每個 IP 對應的 pps（packets per second）與 bps（bytes per second），計算的方法為每秒鐘都會將 pps/bps 歸零，並且紀錄接下來的一秒鐘處理的封包與 byte 數量。

// Update IP stats (PPS/BPS).
__u64 pps = 0;
__u64 bps = 0;

struct ip_stats *ip_stats = NULL;

if (iph6)
{
    ip_stats = bpf_map_lookup_elem(&ip6_stats_map, &srcip6);
}
else if (iph)
{
    ip_stats = bpf_map_lookup_elem(&ip_stats_map, &iph->saddr);
}

if (ip_stats)
{
    // Check for reset.
    if ((now - ip_stats->tracking) > 1000000000)
    {
        ip_stats->pps = 0;
        ip_stats->bps = 0;
        ip_stats->tracking = now;
    }

    // Increment PPS and BPS using built-in functions.
    __sync_fetch_and_add(&ip_stats->pps, 1);
    __sync_fetch_and_add(&ip_stats->bps, ctx->data_end - ctx->data);

    pps = ip_stats->pps;
    bps = ip_stats->bps;
}
else
{
    // Create new entry.
    struct ip_stats new;

    new.pps = 1;
    new.bps = ctx->data_end - ctx->data;
    new.tracking = now;

    pps = new.pps;
    bps = new.bps;

    if (iph6)
    {
        bpf_map_update_elem(&ip6_stats_map, &srcip6, &new, BPF_ANY);
    }
    else if (iph)
    {
        bpf_map_update_elem(&ip_stats_map, &iph->saddr, &new, BPF_ANY);
    }
}

注意到更新記錄在 map 的值的時候有使用到 __sync_fetch_and_add()，這是因為 ip_stats 為所有 BPF 程式共享的 map，並且可能會有多個 CPU 都在執行 xdp_prog_main()，因此需要使用 atomic operation 來避免可能的 race condition。

過濾封包
最後這個階段其實就只是根據過濾規則一一比對封包的內容，若是符合的話就會透過 goto matched 直接做後續的步驟，而不是繼續比對下一個規則。

for (__u8 i = 0; i < MAX_FILTERS; i++)
{
    __u32 key = i;

    struct filter *filter = bpf_map_lookup_elem(&filters_map, &key);

    // Check if ID is above 0 (if 0, it's an invalid rule).
    if (!filter || filter->id < 1)
    {
        break;
    }

    // Check if the rule is enabled.
    if (!filter->enabled)
    {
        continue;
    }

    ...

    // Matched.
    #ifdef DEBUG
    bpf_printk("Matched rule ID #%d.\n", filter->id);
    #endif

    action = filter->action;
    blocktime = filter->blocktime;

    goto matched;
}

return XDP_PASS;

matched:
    if (action == 0)
    {
        #ifdef DEBUG
        //bpf_printk("Matched with protocol %d and sAddr %lu.\n", iph->protocol, iph->saddr);
        #endif

        ...

        if (stats)
        {
            stats->dropped++;
        }

        return XDP_DROP;
    }
    else
    {
        if (stats)
        {
            stats->allowed++;
        }
    }

    return XDP_PASS;

而為了要讓 xdp_prog_main() 有辦法存取到使用者自訂的規則，loader（xdpfw.c）在將程式載入到核心之後，就會呼叫 updateconfig() 來解析規則並且存到 cfg->filters 這個陣列裡面，接著就會呼叫 updatefilters() 來將 cfg->filters 裡的所有規則一個一個透過 bpf_map_update_elem() 儲存到核心的 map 中。

TODO: 找出 XDP Firewall 可改進之處並著手

簡化 TCP Flag 比對過程

以下為 src/xdpfw_kern.c 中與比對封包 TCP Flag 相關的程式碼：

其中 tcph 是封包中的 TCP header，而 filter->tcpopts 則是過濾規則中與 TCP 相關的部分

            // URG flag.
            if (filter->tcpopts.do_urg && filter->tcpopts.urg != tcph->urg)
            {
                continue;
            }

            // ACK flag.
            if (filter->tcpopts.do_ack && filter->tcpopts.ack != tcph->ack)
            {
                continue;
            }

            ...

            // CWR flag.
            if (filter->tcpopts.do_cwr && filter->tcpopts.cwr != tcph->cwr)
            {
                continue;
            }

可以發現對於每種 flag 都會用一次 if 來判斷是否符合過濾規則。

因為實際上每種 flag 都只占用一個位元，透過將整數中的不同位元對應到不同的 flag，並利用 bit-wise 操作，我們可以將比對全部 flag 的過程簡化成只需要一次 if 就可比對完成。

在 <linux/tcp.h> 裡面有定義好的巨集讓我們可以以 32 位元整數的形式取出 struct tcphdr 中所有的 flag，並且也有定義每種 flag 對應的 mask，方便我們讀取特定的 flag。

tcp_flag_word() 以及 TCP_FLAG_CWR、TCP_FLAG_ECE 等

因此我們可以將 struct tcpopts 中的 flag 欄位改成兩個整數，其中 enabled_flags 是為了取代 do_* 欄位的 mask，而 expected_flags 則是取代各個 flag 的值。

diff --git a/src/xdpfw.h b/src/xdpfw.h
index 4be467b..8290204 100644
--- a/src/xdpfw.h
+++ b/src/xdpfw.h
@@ -41,29 +41,8 @@ struct tcpopts
     __u16 dport;

     // TCP flags.
-    unsigned int do_urg : 1;
-    unsigned int urg : 1;
-
-    unsigned int do_ack : 1;
-    unsigned int ack : 1;
-
-    unsigned int do_rst : 1;
-    unsigned int rst : 1;
-
-    unsigned int do_psh : 1;
-    unsigned int psh : 1;
-
-    unsigned int do_syn : 1;
-    unsigned int syn : 1;
-
-    unsigned int do_fin : 1;
-    unsigned int fin : 1;
-
-    unsigned int do_ece : 1;
-    unsigned int ece : 1;
-
-    unsigned int do_cwr : 1;
-    unsigned int cwr : 1;
+    __u32 enabled_flags;
+    __u32 expected_flags;
 };

如此一來，我們只需要一次 if 敘述就可以比對所有的 TCP flag。

if ((tcp_flag_word(tcph) & filter->tcpopts.enabled_flags) !=
    filter->tcpopts.expected_flags)
{
    continue;
}

TODO: 提交 pull request 到原專案

實驗

實驗拓樸

我們使用 iperf3 工具來測量防火牆對於封包吞吐量的影響。

實驗中我們將 192.168.1.100 做為 Server，而 192.168.2.200 則是 Client。
Server 會持續監聽 5201 port，而 Client 則透過傳送大量封包來測量吞吐量。

Kernel Version: 中間主機的版本為 5.4.0-152-generic
XDP-Firewall: 基於 @b54c466

目前因為 kernel 版本過舊導致最新版的 XDP-Firewall 無法 build，所以先使用較舊版本的 XDP-Firewall。
libbpf: 基於 @7fc4d50

規則比對有以下特性：

若是有比對到封包的某個數值不符合規則，則會直接跳過並比對下個規則
規則中沒有指定的欄位在防火牆中還是會執行一次 if 來確認是否要比對

因此為了盡量使得防火牆花費更多的時間比對，實驗中我們都只有比較 ICMP 的選項（ ICMP 比對被寫在迴圈的最後）。

我們使用的規則為 90 個以下的 filter

{
    enabled = true,
    action = 1,
    icmp_enabled = true,
    icmp_type = 18
}

關閉防火牆

首先我們先測量不開啟防火牆的狀態下的數據。

Connecting to host 192.168.2.200, port 5201
[  4] local 192.168.1.100 port 51468 connected to 192.168.2.200 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  1.06 GBytes  1111979 KBytes/sec   43   1.13 MBytes
[  4]   1.00-2.00   sec  1.10 GBytes  1149260 KBytes/sec    0   1.13 MBytes
[  4]   2.00-3.00   sec  1.10 GBytes  1149279 KBytes/sec    0   1.13 MBytes
[  4]   3.00-4.00   sec  1.10 GBytes  1149330 KBytes/sec    0   1.13 MBytes
[  4]   4.00-5.00   sec  1.10 GBytes  1149215 KBytes/sec    0   1.13 MBytes
[  4]   5.00-6.00   sec  1.10 GBytes  1149162 KBytes/sec    0   1.13 MBytes
[  4]   6.00-7.00   sec  1.10 GBytes  1149248 KBytes/sec    0   1.13 MBytes
[  4]   7.00-8.00   sec  1.10 GBytes  1149219 KBytes/sec    0   1.13 MBytes
[  4]   8.00-9.00   sec  1.10 GBytes  1149207 KBytes/sec    0   1.13 MBytes
[  4]   9.00-10.00  sec  1.10 GBytes  1149323 KBytes/sec    0   1.73 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  10.9 GBytes  1145522 KBytes/sec   43             sender
[  4]   0.00-10.00  sec  10.9 GBytes  1145187 KBytes/sec                  receive

開啟防火牆 @b54c466

從下面數據可以看到開啟防火牆之後 Bandwidth 每秒下降了約 16.5 MB。

Connecting to host 192.168.2.200, port 5201
[  4] local 192.168.1.100 port 34710 connected to 192.168.2.200 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  1.06 GBytes  1116387 KBytes/sec    0   1.64 MBytes
[  4]   1.00-2.00   sec  1.08 GBytes  1128982 KBytes/sec    0   1.64 MBytes
[  4]   2.00-3.00   sec  1.09 GBytes  1138609 KBytes/sec    0   1.73 MBytes
[  4]   3.00-4.00   sec  1.08 GBytes  1127468 KBytes/sec    0   1.81 MBytes
[  4]   4.00-5.00   sec  1.05 GBytes  1105900 KBytes/sec   29   1.81 MBytes
[  4]   5.00-6.00   sec  1.08 GBytes  1130410 KBytes/sec    0   1.81 MBytes
[  4]   6.00-7.00   sec  1.09 GBytes  1139034 KBytes/sec    0   1.81 MBytes
[  4]   7.00-8.00   sec  1.08 GBytes  1136199 KBytes/sec    0   1.81 MBytes
[  4]   8.00-9.00   sec  1.08 GBytes  1137253 KBytes/sec    0   1.81 MBytes
[  4]   9.00-10.00  sec  1.07 GBytes  1126335 KBytes/sec    0   1.81 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  10.8 GBytes  1128657 KBytes/sec   29             sender
[  4]   0.00-10.00  sec  10.8 GBytes  1128394 KBytes/sec                  receiver

改進 TCP Flags 比對過程 @2a0125c

經過簡化 TCP Flag 的比對後，Bandwidth 提升了約 1.4 MB。

Connecting to host 192.168.2.200, port 5201
[  4] local 192.168.1.100 port 55482 connected to 192.168.2.200 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  1.07 GBytes  1117284 KBytes/sec    0   1.75 MBytes
[  4]   1.00-2.00   sec  1.09 GBytes  1146851 KBytes/sec    0   1.75 MBytes
[  4]   2.00-3.00   sec  1.08 GBytes  1128799 KBytes/sec   26   1.75 MBytes
[  4]   3.00-4.00   sec  1.06 GBytes  1110556 KBytes/sec    0   1.75 MBytes
[  4]   4.00-5.00   sec  1.09 GBytes  1148022 KBytes/sec    0   1.92 MBytes
[  4]   5.00-6.00   sec  1.08 GBytes  1130176 KBytes/sec    0   1.94 MBytes
[  4]   6.00-7.00   sec  1.08 GBytes  1129429 KBytes/sec    0   1.94 MBytes
[  4]   7.00-8.00   sec  1.09 GBytes  1138621 KBytes/sec    0   1.94 MBytes
[  4]   8.00-9.00   sec  1.07 GBytes  1122506 KBytes/sec    0   1.94 MBytes
[  4]   9.00-10.00  sec  1.08 GBytes  1128666 KBytes/sec    0   1.94 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  10.8 GBytes  1130091 KBytes/sec   26             sender
[  4]   0.00-10.00  sec  10.8 GBytes  1129758 KBytes/sec                  receiver