介紹 Linux 中的資料層處理，以及 GTP5G 的實作細節

# Linux 資料層處理以及 GTP5G 的實作細節 ## Linux 資料層處理該段落主要參考並且整合了以下素材： 1. Maxime Chevallier 撰寫的 Newtork Performance in the Linux Kernel [[1]](https://bootlin.com/pub/conferences/2021/fosdem/chevallier-network-performance-in-the-linux-kernel/chevallier-network-performance-in-the-linux-kernel.pdf)：非常容易理解的投影片，該簡報用了短短幾頁就讓受眾快速了解 Linux Kernel 如何處理封包，同時也提出了一些效能上的瓶頸。 2. RedHat 官方提供的 Performance Tuning Guide [[2]](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html-single/performance_tuning_guide/index#chap-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Networking)：這份文件提到的內容就已經十分受用，但該文件的重點放在如何做，而非為何要這麼做，算是比較美中不足的部分。 3. SUSE 工程師的技術部落格文章 Linux Network Scaling: Receiving Packets [[4]](https://garycplin.blogspot.com/2017/06/linux-network-scaling-receives-packets.html)。 ### 封包接收 ![來源](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*NGn5oOoyKczLmMNy2woIGg.png) MAC 負責接收資料，並且利用 DMA 將資料寫到電腦的主記憶體上，該資料的記憶體位址會被記錄到 recive queue 之中。接著網卡會向 CPU 發起 IRQ，讓 CPU 得知有封包被送入到主機之中： ![來源](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*5rZJf2yaF7QoU-S9Hrk7lA.png) 當 CPU 收到 IRQ，CPU 會執行作業系統所提供之對應的 IRQ Handler，IRQ Handler 負責確認（acknowledge）這個封包，這個封包最終會在 softirq context 被處理（丟棄、轉發、或是交給 upper layer）： ![來源](https://miro.medium.com/v2/resize:fit:1224/format:webp/0*tyq2kv0uMjhEtUgP.png) 與此同時，其他的封包同樣能夠被網卡接收（與 IRQ Handler 是平行的）。 ### NAPI * 以接收封包的案例來說，CPU 主要負責處理 L3（包含以上）的處理工作 * Interrupt handler 負責非常基礎的工作，以及負責開關 interrupt。 * Linux Kernel 為了應對大量網路封包導致頻繁中斷 CPU 的問題，在 V2.6 開始導入 NAPI 的機制，讓 Kernel 改以輪詢的方式處理這些封包，盡可能在一次的中斷之中將封包收進 Kernel 的 Network Stack。 * 當 Queue 已經沒有封包需要處理，或是 CPU budget 用盡時，NAPI 會停止 dequeue。 * 此時，NAPI 會重新 enable interrupt。 ### Network Stack ![source: ref[3]](https://miro.medium.com/v2/resize:fit:774/format:webp/0*fxmVZJDHNetEBULb.png) 封包在被 NIC Driver 處理後會先進入 PCAP hook（這時候 tcpdump 就能抓到網路裝置上的封包），接著才會進入 Traffic Control 以及 Netfilter hooks。然而，即使在 kernel 已經大量的最佳化 data path 的前提下，如果面對高吞吐量的應用場景，仍有一些工作需要處理。 Traffic Steering ---------------- 現代處理器通常具備多核心，且現代的 NIC 通常也擁有多個 tx/rx queues，如果沒有特別進行調教，大量的封包通常只會由某個 core 處理，在極大量封包灌入主機時，僅有一個 core 是沒有能力消化這些中斷的。 ![來源](https://miro.medium.com/v2/resize:fit:756/format:webp/1*QvuiR8fJYJeHVD5cZd4SCQ.png) 那麼，有沒有辦法能夠將封包盡可能被多個 CPU Cores 處理呢？答案是肯定的，但目前來說會存在一些限制： * 封包無法隨機的被 CPU core 處理。 * 處理的順序必須預先設定。 * 需要考慮記憶體相關議題（NUMA nodes、L1/L2 Caches）。 * 需要根據 flow 來散播封包。 ![來源](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*pZfDF0zgBdg-zG4nWotbMw.png) * 如果單看 L3（IP Layer）flow 會用 2-tuple。 * L4 則是看 5-tuple。 * kernel 有能力將 2-tuple 或 5-tuple 進行雜湊，並使用這個雜湊值作為 CPU ID（由誰來處理這個 flow 的封包）。上面都還是非常粗淺的概念，接下來我們將具體的探討目前 Kernel 支援哪些方法做到 traffic steering。 ### RPS : Receive Packet Steering ![來源](https://miro.medium.com/v2/resize:fit:596/format:webp/1*6f95oIHJoMOstQGaOAZELQ.png) * 由 google 提出。 * 不需要 NIC 的支援，但 kernel 需要打開 `CONFIG_RPS` 設定。 * RPS 的概念是由一個 core 處理中斷，然後再由被中斷的 core 負責將封包分配給其他 core，至於分佈的方法可以參考剛剛提到的 2-tuple 以及 5-tuple。 ``` $ echo 0x03 > /sys/class/net/eth0/queues/rx-0/rps_cpus $ echo 0x0c > /sys/class/net/eth0/queues/rx-1/rps_cpus ``` 上面的案例會將： * rx0 上的封包分佈到 CPU 0 以及 CPU 1 上（0x03 代表二進制 11，這是 CPU MASK 的表示法）。 * rx1 上的封包分佈到 CPU 2 以及 CPU 3 上。 ### RSS : Receive Side Scaling ![來源](https://miro.medium.com/v2/resize:fit:1280/format:webp/0*uM3ZFEApf04w-khq.png) RSS 是一個讓封包能夠在多個 RX/TX queues 間出裡的技術，在網卡開啟 RSS 的狀態下收到封包時，網卡會為封包分類放到不同的 RX queue，分類的方式通常是 hash function。Linux kernel [[5]](https://www.kernel.org/doc/Documentation/networking/scaling.txt) 官方文件也有提到： > The filter used in RSS is typically a hash function over the network and/or transport layer headers — for example, a 4-tuple hash over IP addresses and TCP ports of a packet. The most common hardware implementation of RSS uses a 128-entry indirection table where each entry stores a queue number. The receive queue for a packet is determined by masking out the low order seven bits of the computed hash for the packet (usually a Toeplitz hash), taking this number as a key into the indirection table and reading the corresponding value. ![來源](https://miro.medium.com/v2/resize:fit:874/format:webp/1*Kl2WXSx_5fwD_sME2v06hA.png) 以上圖為例子，rxq 0 的權重為 4，rxq 1 的權重為 2，rxq 2 以及 rxq 3 的權重為 1。接下來，讓我們看一下使用 ethtool 開啟與設定 RSS 的範例： ``` $ ethtool -K eth0 rx-hashing on # 開啟 RSS $ ethtool -X eth0 equal 3 # eth0 收到的封包分配到前三個 RX queues $ ethtool -X eth0 hkey <magic hash key> # 設定 magic hash key $ ethtool -X eth0 weight 1 2 2 1 # 設定 indirection table $ ethtool -N eth0 rx-flow-hash tcp4 sdfn # 設定 hashed fields ``` 對於一些有低延遲需求的使用場景，除了使用 ethtool 設定 filter，CPU affinity 也是不可忽視的一部分： ``` $ cat /proc/interrupts # 找到 rx queue 對應的 IRQ number $ echo <CPU_MASK> > /proc/irq/<IRQ_NUMBER>/smp_affinity # 設定 cpu affinity ``` * <CPU_MASK> 是 16 進制表示。 * 部分系統會開啟 irqbalance，如果要設定 cpu affinity 需要將其關閉。 * hyperthreading 對於 interrupt 的處理沒有優勢 [[5]](https://github.com/torvalds/linux/blob/v4.11/Documentation/networking/scaling.txt#L92)，rx queue 的數量建議等於或小於實體的處理器核心。 ### RFS : Receive Flow Steering 雖然 RPS 能夠根據 flow 來分發封包，但該技術並沒有將 userspace application 一同考慮。這可能造成使用 socket 的 application 與處理封包的 CPU 不同的情況發生，這會導致 cache 沒辦法被有效的利用，進而影響效能。 RFS 會追蹤 packet flow 以及 flow consumers，讓 kernel 的封包處理盡可能與使用封包的 application 在同一個 CPU 上執行： ![來源](https://miro.medium.com/v2/resize:fit:874/format:webp/1*8ToD1wBcOw3AD0Fu_6OUNQ.png) 下面的範例解釋要如何設定 RFS： ``` $ echo 32768 > /proc/sys/net/core/rps_sock_flow_entries $ echo 4096 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt $ echo 4096 > /sys/class/net/eth0/queues/rx-1/rps_flow_cnt // ... ``` * 假設有 8 顆 CPU。 * 8 顆 CPU 對應 8 個 RX queue。 * `rps_sock_flow_entries` 為每個 queue 的 `rps_flow_cnt` 的總和（8 * 4096 = 32768）。 Kernel 對應的內部結構為 `rps_sock_flow_table`： ``` /* * The rps_sock_flow_table contains mappings of flows to the last CPU * on which they were processed by the application (set in recvmsg). * Each entry is a 32bit value. Upper part is the high-order bits * of flow hash, lower part is CPU number. * rps_cpu_mask is used to partition the space, depending on number of * possible CPUs : rps_cpu_mask = roundup_pow_of_two(nr_cpu_ids) - 1 * For example, if 64 CPUs are possible, rps_cpu_mask = 0x3f, * meaning we use 32-6=26 bits for the hash. */ struct rps_sock_flow_table { u32 mask; u32 ents[] ____cacheline_aligned_in_smp; }; ``` * 每個 entry 包含 32 bit value，高位用來儲存 flow hash (flow id)，低位儲存 CPU number。 * rps_cpu_mask 用來分割 entry，mask 會因為 CPU 數量而改變，假設有 64 個 CPU，mask 則為 0x3f（二進制為 6 個 1），所以我們會使用扣除掉低位 6 位元後的 26 位元儲存 flow hash。 ``` /* * get_rps_cpu is called from netif_receive_skb and returns the target * CPU from the RPS map of the receiving queue for a given skb. * rcu_read_lock must be held on entry. */ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb, struct rps_dev_flow **rflowp) { // ... /* Avoid computing hash if RFS/RPS is not active for this rxqueue */ flow_table = rcu_dereference(rxqueue->rps_flow_table); map = rcu_dereference(rxqueue->rps_map); if (!flow_table && !map) goto done; skb_reset_network_header(skb); hash = skb_get_hash(skb); if (!hash) goto done; sock_flow_table = rcu_dereference(net_hotdata.rps_sock_flow_table); if (flow_table && sock_flow_table) { struct rps_dev_flow *rflow; u32 next_cpu; u32 ident; /* First check into global flow table if there is a match. * This READ_ONCE() pairs with WRITE_ONCE() from rps_record_sock_flow(). */ ident = READ_ONCE(sock_flow_table->ents[hash & sock_flow_table->mask]); if ((ident ^ hash) & ~net_hotdata.rps_cpu_mask) goto try_rps; ``` * 首先取得 rxqueue 對應的 rps_flow_table `flow_table` 。 * 取得 socket buffer 的 hash。 * 取得 global table `sock_flow_table` 。 * 使用 socket buffer 的 hash 和 `sock_flow_table->mask` 做 AND 運算取得 identity 的 index。 * 當 entry 的 hash 與 socket buffer 的 hash 完全相同時，`ident ^ hash` 會得到 entry 記錄的 cpu id（高位元皆為 0）。 * 如果先將 maks 做 NOT 運算，再將該值與 cpu id 做 AND 運算，我們可以比較兩者是不是完全吻合的（如果結果是 0 ，代表 hash 和 cpu id 都正確，這樣我們能夠使用 RFS）。 * 反之，如果上一步或是上上一步運算的結果出現大於 1 的結果，則代表 table 紀錄的資訊與封包的資訊對不上，應改為嘗試 RPS。 ``` next_cpu = ident & net_hotdata.rps_cpu_mask; /* OK, now we know there is a match, * we can look at the local (per receive queue) flow table */ rflow = &flow_table->flows[hash & flow_table->mask]; tcpu = rflow->cpu; /* * If the desired CPU (where last recvmsg was done) is * different from current CPU (one in the rx-queue flow * table entry), switch if one of the following holds: * - Current CPU is unset (>= nr_cpu_ids). * - Current CPU is offline. * - The current CPU's queue tail has advanced beyond the * last packet that was enqueued using this table entry. * This guarantees that all previous packets for the flow * have been dequeued, thus preserving in order delivery. */ if (unlikely(tcpu != next_cpu) && (tcpu >= nr_cpu_ids || !cpu_online(tcpu) || ((int)(READ_ONCE(per_cpu(softnet_data, tcpu).input_queue_head) - rflow->last_qtail)) >= 0)) { tcpu = next_cpu; rflow = set_rps_cpu(dev, skb, rflow, next_cpu); } if (tcpu < nr_cpu_ids && cpu_online(tcpu)) { *rflowp = rflow; cpu = tcpu; goto done; } } // ... ``` * 取得 disired CPU `next_cpu` 。 * 取得 current CPU `tcpu` 。 * 若滿足上面註解列出的條件，則改變 current CPU，並且更新 rflow。 ### aRFS: Accelerated Receive Flow Steering 對於一些特定的 NICs，網卡能夠在硬體上將封包導向 rxqs，aRFS 會要求 driver 設定每一個 flow 的 steering rules。 * 需開啟 `CONFIG_RFS_ACCEL`。 * `$ ethtool -K eth0 ntuple on` 開啟 n-tuple filtering offloading。 * NIC 和 driver 都需要支援 aRFS。 ![來源](https://miro.medium.com/v2/resize:fit:1280/format:webp/0*8L1-z6zTwgzxAbZ4.png) 使用 ethtool 和 tc 都可以做到 flow steering： ``` # ethtool example $ ethtool -K eth0 ntuple on # 開啟 n-tuple filtering offloading $ ethtool -N eth0 flow-type udp4 dst-port 1234 action 2 loc 0 # 對 port 1234 的 udp traffic 導流到 rxq 2 $ ethtool -N eth0 flow-type udp4 action -1 loc 1 # 除了 1234 port 以為的 udp traffic 都會被丟棄 # tc example $ ethtool -K eth0 hw-tc-offload on $ tc qdisc add dev eth0 ingress $ tc flower protocol ip parent ffff: flower ip_proto tcp \ dst_port 80 action drop # drop 所有 port 80 的 tcp traffic ``` ### XPS: Transmit Packet Steering * 若傳送封包的行為都發生在同一個 CPU 上，仍會發生 cache misses。 * XPS 是一個用來選擇 tx queue 來發送封包的機制。 * 每個 tx queue 都能夠指派到特定 CPUs 上。 * 請確保 tx 和 rx 發生在同一個 CPU 上 ![來源](https://miro.medium.com/v2/resize:fit:818/format:webp/1*X8ky8aAbb6H-DgiIHWFeCQ.png) ## GTP5G 實作 GTP5G 的本質就是一個實作虛擬網路裝置的 kernel module，讓我們透過 GTP5G 利用的核心內建機制深入了解 GTP5G 如何運作 1\. udp_tunnel 使用 `modinfo` 可以得知，GTP5G 依賴 `udp_tunnel` 模組： ```shell ian@ian:~$ modinfo --field=depends gtp5g udp_tunnel ``` 而 GTP5G 的程式碼之中存在以下段落： ```c struct sock *gtp5g_encap_enable(int fd, int type, struct gtp5g_dev *gtp){ struct udp_tunnel_sock_cfg tuncfg = {NULL}; struct socket *sock; struct sock *sk; int err; GTP5G_LOG(NULL, "enable gtp5g for the fd(%d) type(%d)\n", fd, type); sock = sockfd_lookup(fd, &err); if (!sock) { GTP5G_ERR(NULL, "Failed to find the socket fd(%d)\n", fd); return NULL; } if (sock->sk->sk_protocol != IPPROTO_UDP) { GTP5G_ERR(NULL, "socket fd(%d) is not a UDP\n", fd); sk = ERR_PTR(-EINVAL); goto out_sock; } lock_sock(sock->sk); if (sock->sk->sk_user_data) { GTP5G_ERR(NULL, "Failed to set sk_user_datat of socket fd(%d)\n", fd); sk = ERR_PTR(-EBUSY); goto out_sock; } sk = sock->sk; sock_hold(sk); tuncfg.sk_user_data = gtp; tuncfg.encap_type = type; tuncfg.encap_rcv = gtp5g_encap_recv; tuncfg.encap_destroy = gtp5g_encap_disable_locked; setup_udp_tunnel_sock(sock_net(sock->sk), sock, &tuncfg); out_sock: release_sock(sock->sk); sockfd_put(sock); return sk; } ``` 該函式的用途是建立 UDP 封裝（encapsulation）關聯，把一個 UDP socket (fd) 與一個 GTP5G 裝置綁在一起，使 kernel 能透過該 socket 接收 GTP-U 封包。 ![image](https://hackmd.io/_uploads/S1lRXI0y-e.png) *udp_tunnel 運作示意圖，圖片來源：https://blog.csdn.net/chenmo187J3X1/article/details/101794791* ![image](https://hackmd.io/_uploads/BJgdvC01Wg.png) *UDP encapsulation 示意圖，圖片來源：https://linuxgeeks.github.io/2017/12/26/210127-%E5%9F%BA%E4%BA%8EUDP%E5%8D%8F%E8%AE%AE%E6%90%AD%E5%BB%BA%E9%9A%A7%E9%81%93/* 2\. rtnetlink :::info 推薦閱讀：https://hackmd.io/@zoo868e/LKN-Netlink ::: rtnetlink 是 Linux 系統中基於 netlink 機制的特定協議，用於在使用者空間和核心之間交換與網路路由相關的資訊。它讓使用者可以讀取或更改核心的路由表、IP位址、網路介面、鄰居設定等網路配置，以達到動態管理網路的目的。 GTP5G 模組註冊了專屬的 rtnetlink operations，讓 client 能夠利用 rtnetlink 協議新增 link device（gtp5g）： ```c= struct rtnl_link_ops gtp5g_link_ops __read_mostly = { .kind = "gtp5g", .maxtype = IFLA_GTP5G_MAX, .policy = gtp5g_policy, .priv_size = sizeof(struct gtp5g_dev), .setup = gtp5g_link_setup, .validate = gtp5g_validate, .newlink = gtp5g_newlink, .dellink = gtp5g_dellink, .get_size = gtp5g_get_size, .fill_info = gtp5g_fill_info, }; ``` 3\. generic netlink GTP5G 在初始化時會呼叫 `genl_register_family(&gtp5g_genl_family);` 將 `gtp5g_genl_family` 指向的 `gtp5g_genl_ops` 註冊至 `genl`： ```c= struct genl_family gtp5g_genl_family __ro_after_init = { .name = "gtp5g", .version = 0, .hdrsize = 0, .maxattr = GTP5G_ATTR_MAX, .netnsok = true, .module = THIS_MODULE, .ops = gtp5g_genl_ops, .n_ops = ARRAY_SIZE(gtp5g_genl_ops), .mcgrps = gtp5g_genl_mcgrps, .n_mcgrps = ARRAY_SIZE(gtp5g_genl_mcgrps), #if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 0, 0) .resv_start_op = GTP5G_ATTR_MAX, #endif }; static const struct genl_ops gtp5g_genl_ops[] = { { .cmd = GTP5G_CMD_ADD_PDR, // .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = gtp5g_genl_add_pdr, // .policy = gtp5g_genl_pdr_policy, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_DEL_PDR, // .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = gtp5g_genl_del_pdr, // .policy = gtp5g_genl_pdr_policy, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_GET_PDR, // .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = gtp5g_genl_get_pdr, .dumpit = gtp5g_genl_dump_pdr, // .policy = gtp5g_genl_pdr_policy, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_ADD_FAR, // .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = gtp5g_genl_add_far, // .policy = gtp5g_genl_far_policy, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_DEL_FAR, // .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = gtp5g_genl_del_far, // .policy = gtp5g_genl_far_policy, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_GET_FAR, // .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = gtp5g_genl_get_far, .dumpit = gtp5g_genl_dump_far, // .policy = gtp5g_genl_far_policy, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_ADD_QER, // .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = gtp5g_genl_add_qer, // .policy = gtp5g_genl_qer_policy, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_DEL_QER, // .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = gtp5g_genl_del_qer, // .policy = gtp5g_genl_qer_policy, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_GET_QER, // .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = gtp5g_genl_get_qer, .dumpit = gtp5g_genl_dump_qer, // .policy = gtp5g_genl_qer_policy, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_ADD_URR, .doit = gtp5g_genl_add_urr, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_DEL_URR, .doit = gtp5g_genl_del_urr, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_GET_URR, .doit = gtp5g_genl_get_urr, .dumpit = gtp5g_genl_dump_urr, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_ADD_BAR, .doit = gtp5g_genl_add_bar, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_DEL_BAR, .doit = gtp5g_genl_del_bar, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_GET_BAR, .doit = gtp5g_genl_get_bar, .dumpit = gtp5g_genl_dump_bar, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_GET_VERSION, .doit = gtp5g_genl_get_version, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_GET_REPORT, .doit = gtp5g_genl_get_usage_report, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_GET_MULTI_REPORTS, .doit = gtp5g_genl_get_multi_usage_reports, .flags = GENL_ADMIN_PERM, }, { .cmd = GTP5G_CMD_GET_USAGE_STATISTIC, .doit = gtp5g_genl_get_usage_statistic, .flags = GENL_ADMIN_PERM, }, }; ``` 上方的 `gtp5g_genl_ops` 記載所有需要被註冊至 genl 的 command。讓我們觀察其中一個例子 `gtp5g_genl_add_pdr`： ```c= int gtp5g_genl_add_pdr(struct sk_buff *skb, struct genl_info *info) { struct gtp5g_dev *gtp; struct pdr *pdr; int ifindex; int netnsfd; u64 seid = 0; u16 pdr_id; int err; if (info->attrs[GTP5G_LINK]) { ifindex = nla_get_u32(info->attrs[GTP5G_LINK]); } else { ifindex = -1; } if (info->attrs[GTP5G_NET_NS_FD]) { netnsfd = nla_get_u32(info->attrs[GTP5G_NET_NS_FD]); } else { netnsfd = -1; } rtnl_lock(); rcu_read_lock(); // ... ``` 4\. net device `gtp5g_newlink` 是 GTP5G 為 `rtnl_link_ops` 中 `.newlink` 函式指標定義的實作： ```c #if LINUX_VERSION_CODE >= KERNEL_VERSION(6,15,0) static int gtp5g_newlink(struct net_device *dev, struct rtnl_newlink_params *params, struct netlink_ext_ack *extack) #else static int gtp5g_newlink(struct net *src_net, struct net_device *dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) #endif { #if LINUX_VERSION_CODE >= KERNEL_VERSION(6,15,0) struct nlattr **data; data = params->data; #endif struct gtp5g_dev *gtp; struct gtp5g_net *gn; struct sock *sk; unsigned int role = GTP5G_ROLE_UPF; u32 fd1; int hashsize, err; gtp = netdev_priv(dev); if (!data[IFLA_GTP5G_FD1]) { GTP5G_ERR(NULL, "Failed to create a new link\n"); return -EINVAL; } fd1 = nla_get_u32(data[IFLA_GTP5G_FD1]); sk = gtp5g_encap_enable(fd1, UDP_ENCAP_GTP1U, gtp); if (IS_ERR(sk)) return PTR_ERR(sk); gtp->sk1u = sk; if (data[IFLA_GTP5G_ROLE]) { role = nla_get_u32(data[IFLA_GTP5G_ROLE]); if (role > GTP5G_ROLE_RAN) { if (sk) gtp5g_encap_disable(sk); return -EINVAL; } } gtp->role = role; if (!data[IFLA_GTP5G_PDR_HASHSIZE]) hashsize = 1024; else hashsize = nla_get_u32(data[IFLA_GTP5G_PDR_HASHSIZE]); err = dev_hashtable_new(gtp, hashsize); if (err < 0) { gtp5g_encap_disable(gtp->sk1u); GTP5G_ERR(dev, "Failed to create a hash table\n"); goto out_encap; } err = register_netdevice(dev); if (err < 0) { netdev_dbg(dev, "failed to register new netdev %d\n", err); gtp5g_hashtable_free(gtp); gtp5g_encap_disable(gtp->sk1u); goto out_hashtable; } gn = net_generic(dev_net(dev), GTP5G_NET_ID()); list_add_rcu(&gtp->list, &gn->gtp5g_dev_list); list_add_rcu(&gtp->proc_list, get_proc_gtp5g_dev_list_head()); GTP5G_LOG(dev, "Registered a new 5G GTP interface\n"); return 0; out_hashtable: gtp5g_hashtable_free(gtp); out_encap: gtp5g_encap_disable(gtp->sk1u); return err; } ``` 其中提到的 `gtp5g_dev` 正是 GTP5G 模組實作的虛擬裝置類別： ```c const struct net_device_ops gtp5g_netdev_ops = { .ndo_init = gtp5g_dev_init, .ndo_uninit = gtp5g_dev_uninit, .ndo_start_xmit = gtp5g_dev_xmit, #if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 11, 0) .ndo_get_stats64 = dev_get_tstats64, #else .ndo_get_stats64 = ip_tunnel_get_stats64, #endif }; ``` 5\. RCU (Read Copy Update) - https://hackmd.io/@sysprog/linux-rcu References ---------- 1. [https://bootlin.com/pub/conferences/2021/fosdem/chevallier-network-performance-in-the-linux-kernel/chevallier-network-performance-in-the-linux-kernel.pdf](https://bootlin.com/pub/conferences/2021/fosdem/chevallier-network-performance-in-the-linux-kernel/chevallier-network-performance-in-the-linux-kernel.pdf) 2. [https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html-single/performance_tuning_guide/index#chap-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Networking](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html-single/performance_tuning_guide/index#chap-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Networking) 3. [https://techblog.criteo.com/demystification-of-tc-de3dfe4067c2](https://techblog.criteo.com/demystification-of-tc-de3dfe4067c2) 4. [https://garycplin.blogspot.com/2017/06/linux-network-scaling-receives-packets.html](https://garycplin.blogspot.com/2017/06/linux-network-scaling-receives-packets.html) 5. [https://www.kernel.org/doc/Documentation/networking/scaling.txt](https://www.kernel.org/doc/Documentation/networking/scaling.txt) 6. [coder-kung-fu](https://github.com/yanfeizhang/coder-kung-fu?tab=readme-ov-file)