Netlink - HackMD

# Netlink Sockets 第1章討論了 Linux 核心網絡子系統的角色及其運作的三個層次。Netlink socket 介面首次出現在 2.2 版的 Linux 核心中，作為 AF_NETLINK socket。它被創建為比笨拙的 IOCTL 通訊方法更靈活的替代方案。IOCTL 處理程序無法從核心向用戶空間發送非同步消息，而 netlink sockets 則可以。使用 IOCTL 還有另一層複雜性：需要定義 IOCTL 編號。Netlink 的操作模型相當簡單：使用者在 userspace 使用 socket API 打開並註冊一個 netlink socket，這個 netlink socket 處理與核心 netlink socket 的雙向通信，通常用於發送消息來配置各種系統設置並從核心獲取回應。本章描述了 netlink 協議的實現和 API，並討論了其優缺點。本書還介紹了新的通用 netlink 協議，討論了其實現和優點，並給出了一些使用 libnl 庫的示例。最後，討論了 socket 監控介面。 ## The Netlink Family Netlink 協定是一個基於 [Linux Netlink as an IP Services Protocol](https://datatracker.ietf.org/doc/html/rfc3549) 的 [Inter Process Communication(IPC)](https://en.wikipedia.org/wiki/Inter-process_communication) 機制。它提供核心跟核心或核心跟使用者雙向的溝通頻道。該協定的實作主要在下列四個檔案中。 - `af_netlink.c`: 負責提供該協定的 API - `af_netlink.h` - `genetlink.c`: 負責提供易於操作 netlink 訊息的 API - `diag.c`: 負責取得 netlink sockets 資訊的 API，主要用於監控 netlink socket 的運作雖然 Netlink socket 可以用來讓兩個 userspace process 溝通，但是通常會使用 Unix 中的 Inter-Process Communication(IPC)，該機制使用的是 domain socket 提供的 API。 Netlink 的優勢在於 1. 不需要 polling，要接收資料只需要使用 `recvmsg()` 就能在接收到資料之前 block 住程式。例如 `iproute2` soure code `lib/libnetlink.c` 中的 `rtnl_listen()`就有使用到這個 API 2. 核心可以主動發送非同步的訊息到 userspace，userspace 不需要 `IOCTL` 或者寫東西到 `sysfs` 3. 支援一對多的使用情境(multicast transmission) ### How to create netlink socket 不管是在核心還是 userspace，最終都會透過 `__netlink_create` 建立 netlink socket。 - Userspace: `socket` system call with `SOCK_RAW` sockets or `SOCK_DGRAM` sockets - Kernelspace: `netlink_kernel_create` ![image](https://hackmd.io/_uploads/ryjoOiPQC.png) ## Netlink Sockets Libraries 作者建議在時做 userspace 應用時，使用 `libnl` API，該庫收藏了各個提供以 netlink protocol 為基礎的 Linux 核心介面，前述的 `iproute2` 就是使用該庫。除了 `libnl` 以外，還使用了 generic netlink family(`libnl-genl`), routing family(`libnl-route`) 以及 netfilter family(`libnl-nf`)。 ## The sockaddr_nl Structure ```clike /* include/uapi/linux/netlink.h */ struct sockaddr_nl { __kernel_sa_family_t nl_family; /* AF_NETLINK */ unsigned short nl_pad; /* zero */ __u32 nl_pid; /* port ID */ __u32 nl_groups; /* multicast groups mask */ }; ``` - nl_family: 永遠是 `AF_NETLINK` - nl_pad: 永遠是0 - nl_pid: Netlink socket 用的 unicast address - Userspace: 有時候是 process id，但也可以將其設定為0或者不設定，讓呼叫 `bind` 時用到的 `netlink_autobind` 填入資料。但若溝通的是兩個 userspace process，且沒有呼叫 `bind` 的情況下，必須自己確保 `nl_pids` 是唯一的。Netlink 並不只用於網路，同時也應用於其他子系統，例如 SELinux 就透過 Netlink 實作 `audit`, `uevent` 等等的功能，這些功能主要用於 routing messages, neighbouring messages, link messages 等等的網路子系統訊息。 - Kernel: 永遠是 0 - nl_groups: 用來指定要使用的 multicast group ## Userspace Packages for Controlling TCP/IP Networking ![image](https://hackmd.io/_uploads/ry2GG3PQA.png) 這節主要介紹 `net-tools` 以及 `iproute2` 的差異，以及其中包含的功能。 `iproute2` 主要透過 netlink sockets 從 userspace 傳送請求到核心以取得資料，只有少數幾個特例(`ip tuntap`)會是用 IOCTL。`net-tools`則主要是透過 IOCTL 來完成溝通。另外有些 `iproute2` 提供的功能並無法在 `net-tools` 找到對應的功能。 - `iproute2` - `ip`: 管理 network tables 和網路介面卡 - `tc`: 管理流量控制 - `ss`: 輸出 socket 的統計結果 - `lnstat`: 輸出 linux network 的統計結果 - `bridge`: 管理 bridge address 以及裝置 - `net-tools` - `ifconfig` - `arp` - `route` - `netstat` - `hostname` - `rarp` ref: [net-tools vs iproute2](https://www.linkedin.com/pulse/net-tools-vs-iproute2-harshvardhan-kumar-singh-f8uhc) ## Kernel Netlink Sockets 在 Kernel 中可以建立不同的 Netlink socket 來處理不同型態的訊息，例如 `rtnetlink_net_init` 會建立一個處理 `NETLINK_ROUTE` 訊息的 Netlink socket。 ```clike static int __net_init rtnetlink_net_init(struct net *net) { ... struct netlink_kernel_cfg cfg = { .groups = RTNLGRP_MAX, .input = rtnetlink_rcv, .cb_mutex = &rtnl_mutex, .flags = NL_CFG_F_NONROOT_RECV, }; sk = netlink_kernel_create(net, NETLINK_ROUTE, &cfg); ... } ``` rtnetlink socket 屬於 network [namespaces](https://hackmd.io/@0xff07/r1wCFz0ut)，network namespace object(struct `net`) 包含了名為 `rtnl`(rtnetlink socket)的成員，在 `rtnetlink_get_init`中，透過 `netlink_kernel_create` 建立 rtnetlink socket 之後，會把該物件賦予給 `net` 的 `rtnl` 指標。 ### `netlink_kernel_create` ```clike struct sock *netlink_kernel_create(struct net *net, int unit, struct netlink_kernel_cfg *cfg) ``` - `net`: Network namespace - `unit`: Netlink Protocol，例如: `NETLINK_ROUTE` 是 rtnetlink 訊息，`NETLINK_XFRM` 是 IPsec，`NETLINK_AUDIR` 是 audit subsystem 的。在 Linux Kernel 中，有超過 20 種不同的訊息可以使用 netlink protocol 傳遞。[include/uapi/linux/netlink.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/netlink.h) 包含了所有的種類。 - `cfg`: 在建立 netlink socket 時的一些選項 ```clike /* include/uapi/linux/netlink.h */ struct netlink_kernel_cfg { unsigned int groups; unsigned int flags; void (*input)(struct sk_buff *skb); struct mutex *cb_mutex; void (*bind)(int group); }; ``` - `groups`: multicast group，可以透過設定 `sockaddr_nl` 物件中的 `nl_groups` 來加入群組。但用這樣的方法最多只能加入32個群組，所以 2.6.14 開始，可以透過 `NETLINK_ADD_MEMBERSHIP`/`NETLINK_DROP_MEMBERSHIP` 選項來加入/離開群組。用這樣的方法可以加入更多的群組。`libnl`中的 `nl_socket_add_membership`/`nl_socket_drop_membership` 就是使用這個方法 - `flags`: 可以是 - `NL_CFG_F_NONROOT_RECV`: non-superuser can bind to a multicast group。`netlink_bind` 中會判斷是否有使用該旗標。如果沒有設定該旗標，在 non-superuser 嘗試 bind 到 multicast group 的時候會失敗並收到 `-EPERM` ![image](https://hackmd.io/_uploads/r17PAnwmR.png) - `NL_CFG_F_NONROOT_SEND`: 被設定時，non-superuser 可以傳送給 multicasts - `input`: 如果為 `NULL`，該 socket 無法從 userspace 接收資訊，但可以從核心傳訊息到 userspace。在 rtnetlink kernel socket 中，`rtnetlink_rcv` 就是 `input`，所以從 userspace 傳送到核心的資料就會由 `rtnetlink_rcv` 來處理 ~~然而，以 uevent 為例，由於只需要傳送資料到 userspace，所以在 [lib/kobject_uevent.c](https://github.com/torvalds/linux/blob/master/lib/kobject_uevent.c) 中，就不需要設定 `input`~~ 在2018年的 [patch](https://github.com/torvalds/linux/commit/692ec06d7c92af8ca841a6367648b9b3045344fd) 中已經在 uevent 添加了 `input` - `cb_mutex`: 指定要使用的 mutex，沒有指定時會使用預設的 ~~`cb_def_mutex`~~ 2017年被改成 [`nlk_cb_mutex_keys`](https://github.com/torvalds/linux/blob/61307b7be41a1f1039d1d1368810a1d92cb97b44/net/netlink/af_netlink.c#L95) 了，並且 `cb_def_mutex` 在2024年的這次 [commit](https://github.com/torvalds/linux/commit/e39951d965bf58b5aba7f61dc1140dcb8271af22) 後不復存在，都改成了 `nlk->nl_cb_mutex`。另外，rtnetlink socket 使用的是 `rtnl_mutex`。 `netlink_kernel_create` 會透過 `netlink_insert` 建立一個 `nl_table` 中的物件，`nl_table` 被一個讀寫鎖 `nl_table_lock` 保護，可以使用協議及 port number 透過 `netlink_lookup` 讀取特定內容，而為特定訊息型態註冊 callback function 則可以使用 [`rtnl_register`](https://github.com/torvalds/linux/blob/61307b7be41a1f1039d1d1368810a1d92cb97b44/net/core/rtnetlink.c#L309) 使用 Netlink socket 的第一步就是 `rtnl_register`。一般來說 doit, dumpit, calcit 三個只會有一個有值，這三個就是負責處理訊息的實作。 ```clike extern void rtnl_register(int protocol, int msgtype, rtnl_doit_func, rtnl_dumpit_func, rtnl_calcit_func); ``` - `protocol`: Protocol family，沒有指定就是 `PF_UNSPEC`，可以在[include/linux/socket.h](https://github.com/torvalds/linux/blob/master/include/linux/socket.h) 中找到所有的 protocol family。 - `msgtype`: 訊息型態，例如 `RTM_NEWLINK` 或者 `RTM_NEWNEIGH` 等等，所有的訊息型態可以在 [include/uapi/linux/rtnetlink.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/rtnetlink.h) 中找到。 - `rtnl_doit_func`: 處理新增/刪除/修改的實作 - `rtnl_dumpit_func`: 取得資訊用 - ~~`rtnl_calcit_func`~~: 計算 buffer size 的實作，在2017年被[移除](https://github.com/torvalds/linux/commit/b97bac64a589d0158cf866e8995e831030f68f4f)了上述提到的三個 callback function 都會被存在 `rtnl_msg_handlers` 中，該表以協議號碼索引註冊的 callback function。表中的每個元素都是一個 `rtnl_link` 的實體，並透過 `rtnl_register` 註冊。我們可以透過 `rtmsg_ifinfo` 來傳遞 rtnetlink 的訊息。例如 `dev_open` 會建立新的 Link 並且呼叫 `rtmsg_ifinfo`。在 `rtmsg_ifinfo` 中則會 1. 先呼叫 `nlmsg_new` 來建立足夠大的 `sk_buff` 來接收訊息 2. `rtnl_fill_ifinfo` 把訊息填進 `sk_buff` 中 3. `nlmsg_notify` 把訊息傳遞出去 ![image](https://hackmd.io/_uploads/SJ_lyCPm0.png) ## The Netlink Message Header RFC 3549 中有明確規範 Netlink 的訊息格式。Linux kernel 在 [include/uapi/linux/netlink.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/netlink.h#L45) 定義 `nlmsghdr` ```clike /* include/uapip/linux/netlink.h */ struct nlmsghdr { __u32 nlmsg_len; __u16 nlmsg_type; __u16 nlmsg_flags; __u32 nlmsg_seq; __u32 nlmsg_pid; }; ``` - `nlmsg_len`: 訊息的長度 - `nlmsg_type`: 訊息的類型這裡只有列出 `netlink.h` 中定義的類型，使用者可以自定義類型並註冊 callback functionn。例如 `rtnetlink.h` 中就定義了多種訊息類型，可以透過 `man 7 rtnetlink` 料解更多。要注意的是，自定義類型值不得小於 `NLMSG_MIN_TYPE`(0x10)，這些值保留給控制訊息使用。 - `NLMSG_NOOP`: 該訊息應該被丟棄 - `NLMSG_ERROR`: 錯誤訊息 - `NLMSG_DONE`: 訊息結束了 - `NLMSG_OVERRUN`: 資料遺失 - ...etc - `nlmsg_flags`: 下列為常見的 flags，詳見[include/uapi/linux/netlink.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/netlink.h#L60) - `NLM_F_REQUEST`: 代表是 request message - `NLM_F_MULTI`: 通常訊息的長度會被 PAGE_SIZE 限制，所以當我們有特別長的訊息時，就會使用到這個旗標，並把訊息分段傳送。每一個訊息段都會有這個旗標，而最後一個傳送的訊息段會有 `NLMSG_DONE` - `NLM_F_ACK`: 收到訊息後，要回傳一個 ACK 給發送端。Netlink ACK 是透過 [`netlink_ack`](https://github.com/torvalds/linux/blob/master/net/netlink/af_netlink.c#L2477) 來發送訊息。 - `NLM_F_DUMP`: 取得 table/entry 的資訊 - `NLM_F_ROOT`: 指定樹根 - `NLM_F_MATCH`: 回傳所有符合的元素 - `NLM_F_REPLACE`: 複寫已經存在的元素 - `NLM_F_EXCL`: 如果元素存在，不要碰它 - `NLM_F_CREATE`: 如果元素不存在就建立一個新的 - `NLM_F_APPEND`: 在串列的末端添加一個元素 - `NLM_F_ECHO`: 顯示這個請求 - ...etc - `nlmsg_seq`: Sequence number - `nlmsg_pid`: 發送者的 port id - Kernel: 0 - Userspace: May be process id ![image](https://hackmd.io/_uploads/HJEUgv_QR.png) Header 後續會接著 payload；Netlink 訊息的 Payload 由一組 Type-Length-Value([TLV](https://github.com/torvalds/linux/blob/master/include/uapi/linux/netlink.h#L229)) 組成。TLV 在其他網路實作中也會用到，例如 IPv6(RFC 2460)，這樣的做法提供了很好的可擴展性，並且可以巢狀的使用。 ```clike /* include/uapi/linux/netlink.h */ struct nlattr { __u16 nla_len; __u16 nla_type; }; ``` - `nla_len`: 屬性的個數 - [`nla_type`](https://github.com/torvalds/linux/blob/master/include/net/netlink.h#L172): 屬性的類型 - `NLA_U32` - `NLA_STRING` - `NLA_NESTED` - `NLA_UNSPEC` - ...etc 每個屬性的值都應該對其4 bytes(NLA_ALIGNTO)。每個"family"都可以透過 `nla_policy` 來定義希望接收的訊息屬性，~~這個結構跟 `nla_type` 一模一樣。~~用來驗證的 "Policy" 就是一連串的 `nla_policy`。 - ~~驗證 `NLA_STRING`: len 必須為字串不包含 `\0` 的長度 - ~~驗證 `NLA_UNSPEC` or unknown: len 必須為該該屬性的長度 - ~~驗證 `NLA_FLAG`: 透過 type 驗證現行版本的 `nla_policy` 已經大不相同，詳情請見 [includ/net/netlink.h](https://github.com/torvalds/linux/blob/master/include/net/netlink.h#L335)。當核心接收到 Netlink 訊息時，會由 `genl_rcv_msg` 負責處理該訊息。若有設定 `NLM_F_DUMP` 就會使用 `netlink_dump_start` 取得資訊，否則會使用 `nlmsg_parse` 處理訊息。 `nlmsg_parse` 中會先使用 `validate_nla` 驗證是否接收，不接收的話會 silently ignored。若要接收就會使用 `doit` 處理訊息，最後再回傳 error code。 ## NETLINK_ROUTE Message rtnetlink(NETLINK_ROUTE) 訊息不只是使用於網路路由子系統中；還會用於 neighbouring subsystem messages, interface setup messages, firewalling message, netlink queuing messages, policy routing messages, 等等很多地方。 NETLINK_ROUTE 可以大致分為下列幾種 - LINK(network interfaces) - ADDR(network addresses) - ROUTE(routing messages) - NEIGH(neighbouring subsystem messages) - RULE(policy routing ruels) - QDISC(queueing discipline) - TCLASS(traffic classes) - ACTION(packet action API, see [net/sched/act_api.c](https://github.com/torvalds/linux/blob/master/net/sched/act_api.c)) - NEIGHTBL(neighbouring table) - ADDRLABEL(address labeling) 每一種訊息都需要處理三件事 1. creation information 2. deletion information 3. retrieving information 當有錯誤產生時，會使用 `nlmsgerr` 結構來表示 ```clike /* include/uapi/linux/netlink.h */ struct nlmsgerr { int error; struct nlmsghdr msg; }; ``` ![image](https://hackmd.io/_uploads/HkjCRetQR.png) 如果傳送了一個沒有正確建立的訊息(例如無效的 `nlmsg_type`)，就會接收到包含 error message 的回覆，以上述的例子來說，會收到 `-EOPNOTSUPP`。發送端可以透過設定 `nlmsg_type` 為 `NLM_F_ACK` 來請求一個 ACK。而核心收到這樣的請求後，會回傳 `NLMSG_ERROR` error message，其中的 error code 為0。詳見 [`netlink_ack`](https://github.com/torvalds/linux/blob/master/net/netlink/af_netlink.c#L2477) 的實作 ### Adding and Deleting a Routing Entry in a Routing Table 從 userspace(RTM_NEWROUTE) 傳送 Netlink 訊息來添加路由入口。這件事情主要由 `rtnetlink_rcv` 接收資料後交由 [`inet_rtm_newroute`](https://github.com/torvalds/linux/blob/master/net/ipv4/fib_frontend.c#L885) 來處理。 ```shell ip route add 192.168.2.11 via 192.168.2.20 ``` ```graphviz digraph insert_routing_entry{ A[label = "rtnetlink_rcv"] B[label = "inet_rtm_newroute"] C[label = "fib_table_insert"] D[label = "rtmsg_fib"] E[label = "rtnl_notify"] A->B[label = "接收到的新增訊息"] B->C[label = "在 routing table 添加新的元素"] C->D[label = "通知有註冊 RTM_NEWROUTE 的所有人"] D->E[label = "將訊息送出"] } ``` 也可以刪除路由入口 ```shell ip route del 192.168.2.11 ``` ```graphviz digraph delete_routing_entry{ A[label = "rtnetlink_rcv"] B[label = "inet_rtm_delroute"] C[label = "fib_table_delete"] D[label = "rtmsg_fib"] E[label = "rtnl_notify"] A->B[label = "接收到的刪除訊息"] B->C[label = "刪除 routing table 中的元素"] C->D[label = "通知有註冊 RTM_NEWROUTE 的所有人"] D->E[label = "將訊息送出"] } ``` 監控 routing table，所有人對 routing table 做改動都會 dump 出來 ```shell ip monitor route ``` ## Generic Netlink Protocol Generic Netlink Protocol 是為了擴充 protocol families 只能有32個的限制。使用 NETLINK_GENERIC 就能把 netlink 當作一個多功器，這個協定以 Netlink Protocol 為基礎，並且使用他的 API。一般來說，要添加新的 protocol family 需要將新的家族添加於 `include/linux/netlink.h` 中，但是 Generic Netlink Protocol 不需要這麼做。另外，由於這個協定提供了通用的溝通通道，所以也會用於除了 Networking 以外的子系統實作中，例如 [acpi subsystem](https://en.wikipedia.org/wiki/ACPI)。要在核心使用這個協定需要呼叫 [`genl_pernet_init`](https://github.com/torvalds/linux/blob/master/net/netlink/genetlink.c#L1879)，同樣的，這個 socket 也會被記錄在 network namespace 中的 `genl_sock`。從 userspace 透過這個 socket 傳送到核心的資料會由 `genl_rcv` 來處理。使用時需要呼叫 `genl_register_family` 以及 `genl_register_ops`。 ```clike /* net/netlink/genelink.c */ static int __net_init genl_pernet_init(struct net *net) { .. struct netlink_kernel_cfg cfg = { .input = genl_rcv, .cb_mutex = &genl_mutex, .flags = NL_CFG_F_NONROOT_RECV, }; net->genl_sock = netlink_kernel_create(net, NETLINK_GENERIC, &cfg); ... } ``` ```graphviz digraph genl_init{ A[label = genl_init] B[label = register_pernet_subsys] C[label = register_pernet_operations] D[label = __register_pernet_operations] E[label = "exist node in list first_device", shape = diamond] F[label = ops_init] G[label = genl_pernet_init] H[label = netlink_kernel_create] done[label = Done] A->B[label = genl_pernet_ops] B->C[label = "first_device, genl_pernet_ops"] C->D[label = "first_device, genl_pernet_ops"] D->E[label = "first_device, genl_pernet_ops"] E->F[label = "exist, net in list first_device"] F->G[label = "genl_pernet_ops, net device"] G->H[label = "net device"] H->E E->done[label = "non-exist"] } ``` 在無線網路子系統中，也有用到 netlink sockets，例如 [`nl80211_init`](https://github.com/torvalds/linux/blob/master/net/wireless/nl80211.c#L20228)。現在的版本在註冊 family 之後直接註冊 notifier，因為 [family](https://github.com/torvalds/linux/blob/master/net/wireless/nl80211.c#L17496) 的結構中已經包含了 ops 的結構。以下列出 family 的一部分 ```clike static struct genl_family nl80211_fam = { .id = GENL_ID_GENERATE, /* don't bother with a hardcoded ID */ .name = "nl80211", /* have users key off the name instead */ .hdrsize = 0, /* no private header */ .version = 1, /* no particular meaning now */ .maxattr = NL80211_ATTR_MAX, .netnsok = true, .pre_doit = nl80211_pre_doit, .post_doit = nl80211_post_doit, }; ``` - `name`: 必須是 unique 的 - `id`: 告訴 neneric netlink controller 這個 family 要註冊 channel ，然後會被改成 16(GENL_MIN_ID)~1023(GENL_MAX_ID) 任意數字 - `hdrsize`: private header 的長度 - `maxattr`: 支援的 attributes 數量 - `netnsok`: 表示是否可以處理 network namespaces - `pre_doit`: `doit` 之前的 callback - `post_doit`: `doit` 之後的 callback 除此之外，最重要的是 `.ops` 成員，這個成員指向一個 `genl_ops` 陣列，下列為 [`genl_ops`](https://github.com/torvalds/linux/blob/master/include/net/genetlink.h#L209) 結構 ```clike /** * struct genl_ops - generic netlink operations * @cmd: command identifier * @internal_flags: flags used by the family * @flags: GENL_* flags (%GENL_ADMIN_PERM or %GENL_UNS_ADMIN_PERM) * @maxattr: maximum number of attributes supported * @policy: netlink policy (takes precedence over family policy) * @validate: validation flags from enum genl_validate_flags * @doit: standard command callback * @start: start callback for dumps * @dumpit: callback for dumpers * @done: completion callback for dumps */ struct genl_ops { int (*doit)(struct sk_buff *skb, struct genl_info *info); int (*start)(struct netlink_callback *cb); int (*dumpit)(struct sk_buff *skb, struct netlink_callback *cb); int (*done)(struct netlink_callback *cb); const struct nla_policy *policy; unsigned int maxattr; u8 cmd; u8 internal_flags; u8 flags; u8 validate; }; ``` 現行 Linux Kernel 中還有 [`genl_small_ops`](https://github.com/torvalds/linux/blob/master/include/net/genetlink.h#L187) ```clike /** * struct genl_small_ops - generic netlink operations (small version) * @cmd: command identifier * @internal_flags: flags used by the family * @flags: GENL_* flags (%GENL_ADMIN_PERM or %GENL_UNS_ADMIN_PERM) * @validate: validation flags from enum genl_validate_flags * @doit: standard command callback * @dumpit: callback for dumpers * * This is a cut-down version of struct genl_ops for users who don't need * most of the ancillary infra and want to save space. */ struct genl_small_ops { int (*doit)(struct sk_buff *skb, struct genl_info *info); int (*dumpit)(struct sk_buff *skb, struct netlink_callback *cb); u8 cmd; u8 internal_flags; u8 flags; u8 validate; }; ``` 當 userspace 想要透過這個 protocol 發送訊息給核心時，需要知道 family ID。由於 userspace 只會知道 family name，所以可以透過發送 `CTRL_CMD_GETFAMILY` 請求給 generic netlink，從而取得 family ID，這個請求可以透過 [`ctrl_getfamily`](https://github.com/torvalds/linux/blob/master/net/netlink/genetlink.c#L1429) 送出。 ### Creating and Sending Generic Netlink Messages ![image](https://hackmd.io/_uploads/B14fM2i70.png) 由於是以 netlink 為基礎，所以最一開始是 netlink 的 header。再來才是 [generic netlink message header](https://github.com/torvalds/linux/blob/master/include/uapi/linux/genetlink.h#L13)。 ```clike struct genlmsghdr { __u8 cmd; __u8 version; __u16 reserved; }; ``` - `cmd`: generic netlink message type；以 80211 為例，在 [`nl80211_commands`](https://github.com/torvalds/linux/blob/master/include/uapi/linux/nl80211.h#L1335C6-L1335C23) 中有很多不同的類型 - `version`: 用來標示版本 - `reserved`: 保留給未來使用我們可以透過 [`genlmsg_new`](https://github.com/torvalds/linux/blob/master/include/net/genetlink.h#L606) 分配足夠的空間來建立 generic netlink message。有足夠的空間之後，我們使用 [`genlmsg_put`](https://github.com/torvalds/linux/blob/master/net/netlink/genetlink.c#L893) 來建立 generic netlink header。如果要發送 unicast generic netlink message，則使用 [`genlmsg_unicast`](https://github.com/torvalds/linux/blob/master/include/net/genetlink.h#L548)。發送 unicast message 有兩種方法。 1. `genlmsg_multicast`: 這個方法將訊息送到預設的 network namespace(net_init) 2. [`genlmsg_multicast_allns`](https://github.com/torvalds/linux/blob/master/net/netlink/genetlink.c#L1969): 這個方法將訊息送給每一個 network namespace 在 userspace，我們可以透過 `socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC)` 來建立 generic netlink socket，然後使用 `bind`, `sendmsg` 以及 `recvmsg`。但作者比較建議使用 `libnl` 提供的 API 來使用 generic netlink socket。最後，作者提到可以透過 iproute2 的指令 `genl` 來取得所有已註冊的 generic netlink families ```shell genl ctrl list ``` ## Socket Monitoring Interface 可以透過指令 `ss` 來看 socket 的資訊以及不同 socket 類型的統計量。這個指令主要透過 netlink socket 提供的 `sock_diag` 來取得 socket 的資訊，這個功能主要是為了支援在 Linux userspace 可以做 checkpoint/restore 的功能(checkpoint/restore functionality for Linux in userspace, CRIU)。如果想要讓自己的 socket 也可以透過 NETLINK_SOCK_DIAG 取得，需要先建立 [`sock_diag_handler`](https://github.com/torvalds/linux/blob/master/include/linux/sock_diag.h#L15) 然後透過 [`sock_diag_register`](https://github.com/torvalds/linux/blob/master/net/core/sock_diag.c#L205) 註冊。完成之後就可以透過 `ss -x` 或者 `ss --unix` 取得資訊。這樣取得資訊的方式，在很多協定都有使用，例如 IPv4 中的 [`inet_diag_init`](https://github.com/torvalds/linux/blob/master/net/ipv4/inet_diag.c#L1553)。同時，我們也可以透過 `/proc/net/netlink` 來取得 netlink socket entry information。在這裡，取得資訊這件事情由 [`netlink_seq_show`](https://github.com/torvalds/linux/blob/master/net/netlink/af_netlink.c#L2754) 來處理。但是有些資訊，`/proc/net/netlink` 是不提供的。例如: 超過32的 `dst_group` 或 `dst_portid`。因此，才會添加了 `net/netlink/diag.c` 讓使用者可以透過 `ss` 來讀取 socket 的資訊。 ## General interfaces - [`netlink_rcv_skb`](https://github.com/torvalds/linux/blob/master/net/netlink/af_netlink.c#L2538): 負責處理接收到的 netlink 訊息。這個方法會對收到的訊息做 Sanity Check，以確保訊息長度沒有超過能夠接收的最大長度。同時，如果收到的是控制訊號，這個方法會去呼叫特定的 callback function，例如收到帶有 ACK flag 的訊息時，會透過 `netlink_ack` 傳送錯誤訊息。 - [`netlink_alloc_skb`](https://github.com/torvalds/linux/commit/c5b0db3263b92526bc0c1b6380c0c99f91f069fc): 不再使用，因為已經沒有 wrap 的必要性。後來都直接使用 [`alloc_skb`](https://github.com/torvalds/linux/blob/master/include/linux/skbuff.h#L1305) - [`nlk_sk`](https://github.com/torvalds/linux/blob/master/net/netlink/af_netlink.h#L57): 回傳包含 `sk` 的 `netlink_sock` 物件 - [`netlink_kernel_create`](https://github.com/torvalds/linux/blob/master/include/linux/netlink.h#L60): 建立核心的 netlink socket - [`nlmsg_hdr`](https://github.com/torvalds/linux/blob/master/include/linux/netlink.h#L16): 回傳 `skb->data` 指向的 netlink message header - [`__nlmsg_put`](https://github.com/torvalds/linux/blob/master/net/netlink/af_netlink.c#L2151): 建立一個 netlink message header 並且放到 `skb` 中 - [`nlmsg_new`](https://github.com/torvalds/linux/blob/master/include/net/netlink.h#L1013): 根據參數，透過 `alloc_skb` 分配空間給一個新的 netlink 訊息。如果設定 payload 為0，`alloc_skb` 就會使用 `NLMSG_HDRLEN` - [`nlmsg_msg_size`](https://github.com/torvalds/linux/blob/master/include/net/netlink.h#L569): 回傳 header 長度加上 `payload` 長度 - [`rtnl_register`](https://github.com/torvalds/linux/blob/master/net/core/rtnetlink.c#L309): 將 `doit`, `dumpit` 以及 `calcit` 註冊為一個 rtnetlink message type - [`rtnetlink_rcv_msg`](https://github.com/torvalds/linux/blob/master/net/core/rtnetlink.c#L6487): 處理收到的 rtnetlink 訊息 - [`rtnl_fill_ifinfo`](https://github.com/torvalds/linux/blob/master/net/core/rtnetlink.c#L1811): 建立 netlink message header(`nlmsghdr`) 以及 `ifinfomsg` 物件，並把 `ifinfomsg` 放在 `nlmsghdr` 後面 - [`rtnl_notify`](https://github.com/torvalds/linux/blob/master/net/core/rtnetlink.c#L752): 傳送 rtnetlink 訊息 - [`genl_register_mc_group`](https://github.com/torvalds/linux/commit/2a94fe48f32ccf7321450a2cc07f2b724a444e5b): 不再使用這樣的方法註冊群組。新的方法是在 family 中加入 `mcgrps` 成員，該成員指向一個群組陣列。這樣的方法使得使用者不需要再提供群組索引就能傳送訊息給群組 - [`genl_unregister_mc_group`](https://github.com/torvalds/linux/commit/06fb555a273dc8ef0d876f4e864ad11cfcea63e0): 因為不再透過 `genl_register_mc_group` 註冊群組，所以也不再需要註銷。現在只有 `genl_unregister_family` 會使用到這個功能 - [`genl_register_ops`](https://github.com/torvalds/linux/commit/d91824c08fbcb265ec930d863fa905e8daa836a4): 不再使用 `genl_register_ops` 來註冊 `ops` ，而是改用 [`genl_validate_ops`](https://github.com/torvalds/linux/blob/master/net/netlink/genetlink.c#L571) 檢查建立的 family 中的 `ops` 成員是否有效 - [`genl_unregister_ops`](https://github.com/torvalds/linux/commit/3686ec5e84977eddc796903177e7e0a122585c11): 由於已經沒有 `genl_register_ops` 所以也不需要 `genl_unregister_ops` - [`genl_register_family`](https://github.com/torvalds/linux/blob/master/net/netlink/genetlink.c#L780)/[`genl_unregister_family`](https://github.com/torvalds/linux/blob/master/net/netlink/genetlink.c#L855): 註冊/註銷 family，註冊的時候會呼叫 `genl_validate_ops` 檢查 `ops` 是否有效 - [`genl_register_family_with_ops`](https://github.com/torvalds/linux/commit/489111e5c25b93be80340c3113d71903d7c82136): 由於更改了 `ops` 註冊的方法(改成建立 family 時直接賦予 `ops`)，所以也不再需要這個功能 - [`genlmsg_put`](https://github.com/torvalds/linux/blob/master/net/netlink/genetlink.c#L893): 把 generic netlink header 放到 netlink message 中 - [`genl_lock`](https://github.com/torvalds/linux/blob/master/net/netlink/genetlink.c#L33)/[`genl_unlock`](https://github.com/torvalds/linux/blob/master/net/netlink/genetlink.c#L39): 透過 `mutex_lock`/`mutex_unlock` 上鎖/解鎖