Linux kernel networking (Netlink, data flow, netfilter等)

Linux kernel networking -- Rami Rosen 閱讀心得與重點筆記: 施工中... 除了ch8 IPv6, ch14網路命名空間跳過其他已讀完，重點要複習腦袋建立架構樹狀關聯圖等等然後去發散斯考說寫router bsp driver這些會如何運用? [Linux data flow diagram適合對應chapter讀完自己整理過後來對照這位大神整理的data flow](https://github.com/gfreewind/kernel_net_chart/blob/master/kernel_skb_path.jpg) 此書wifi部分講得很簡潔。下列這些額外資源要閱讀 [Linux 802.11 驅動程式開發者指南](https://docs.kernel.org/driver-api/80211/index.html) [Linux 802.11 Data Flow](https://github.com/WeitaoZhu/wi-fi_books/blob/master/driver-mac80211_intro.pdf) [ath9k driver structure](https://blog.csdn.net/weixin_44258973/article/details/108003367) [ath9k 解析](https://blog.csdn.net/lizuobin2/article/details/53678299) data flow很清晰先把此書ch12讀完後讀data flow 再讀ath9k 而開發就研究開發者指南那篇 # ch1. 介紹 ch2. Netlink 套接字 ch3. ICMP：網際網路控制訊息協議 ch4. IPv4 ch5. IPv4 路由子系統 ch6. 進階路由(組播) ch7. Linux 鄰居子系統 ch8. IPv6(skip) ch9. Netfilter ch10. IPsec ch11. 第四層協議 ch12. Linux 中的無線網路 # 讀完ch1~ch4後，配合ch11 寫簡易[socket programing](https://www.geeksforgeeks.org/socket-programming-cc/)有點感覺，原來sockaddr真的在用戶空間APP這樣設定，還有AF_INET。重點還是要有整體架構知道設定那些變數使用那些函數。後續這些也要記錄在Anki 字卡程式上做間歇性複習。現在只記錄ipv6/wifi/C細節與陷阱/面試題等字卡題庫，後續針對IPv4 與cisco LAN switching 這本書也要記錄下複習字卡。 :::spoiler data flow and fast recap! ch2 Netlink/Socket data flow diagram: ![LKN-ch2.drawio](https://hackmd.io/_uploads/SJgXt5snRyg.png) ch3 ICMP data flow diagram: ![LKN-ch3.drawio](https://hackmd.io/_uploads/S1mev3nRkg.png) [ipv4 Linux data flow詳解](https://blog.csdn.net/weixin_58966834/article/details/137572176) ch4 IPv4 data flow diagram: ![LKN-ch4.drawio](https://hackmd.io/_uploads/BkHyuy6R1x.png) ::: ### ch1. introduction :::success 術語: `socket`: 網路傳送和接收資料的端點，位於kernel端的軟體結構接收app data並與protocol溝通 `skb`: socket buffering 以前是封包進來就做中斷，現在是封包進buffer等kernel做polling拉取. (SKB是Linux 網路程式碼中最基本的資料結構，收送的封包以此資料處理) ```cpp struct sk_buff { . . . struct sock *sk; //處理它的socket can be netlink or generic struct net_device *dev; //網卡驅動的抽象 . . . __u8 pkt_type:3, . . . __be16 protocol; . . . sk_buff_data_t tail; sk_buff_data_t end; unsigned char *head, *data; sk_buff_data_t transport_header; //只能由skb api 的skb_push來做指標offset 不允許直接offset sk_buff_data_t network_header; sk_buff_data_t mac_header; . . . }; ``` //網路命名空間ch14(結構net)netns_ipv4也記錄了多個 table 的結構指標像是ch9 nat_table, xt_table(iptable的每個表由xt_table組成) Each SKB has a dev member, which is an instance of the net_device structure 每個SKB有其設備成員對應網路設備結構(網卡抽象結構) ```cpp struct net_device { unsigned int irq; /* device IRQ number */ . . . const struct net_device_ops *netdev_ops; . . . unsigned int mtu; . . . unsigned int promiscuity; . . . unsigned char *dev_addr; . . . }; ``` network namespace使程序以為自己獨享網路資源 Networking stack(Protocol Stack): 網路協定套件的一個具體的軟體實現 ::: 流程: 封包從net_device(網卡驅動的系統struct物件)近來，再傳向內核由kernel_netlink_socket轉交ip封包由ip_rcv()處理，先作合格封包檢驗，在進netfilter封包過濾系統(進routing前)，由NF_INET_PRE_ROUTING 訊息hook對應callback(函數指標傳給其他函式使用)，若pre_routing程序沒要把封包過濾掉，就進ip_rcv_finish()，然後在決定目的(決定是目地local主機地址就往上傳，否則進routing subsystem)用routing subsystem的lookup table (lookup建目的cache entry 也就是 dst_entry object)。 :::warning `hook`: 攔截(偵測)傳遞的函數呼叫或訊息 `callback`: 是電腦編程中對某一段可執行代碼的參照，它被作為參數傳遞給另一段代碼；預期這段代碼將回呼（執行）這個回呼函式作為自己工作，所以hook callback在內核網路像偵測有封包近來就CALL ip_rcv()。 callback好處: 確保程式時機與關連，便於維護 `Netlink`: 是一組Linux核心介面 communication between userspace and the kernel is done with netlink sockets (add or del route, config neighbor table and so on 從用戶端調整路由與網路設定要走netlink socket介面) `network namespace` 如其名稱所示，用來隔離作業系統跟網路相關的資源。被network namespace 隔離的那些行程，會以為自己獨享了一個完整的network stack ::: 本地主機產出的封包由第四層socket產出，主要socket分兩類: (UDP)datagram sockets and (TCP)stream sockets. ioctl是設備驅動程序中對設備的I/O通道進行管理的函數 :::spoiler quick recap 請解釋skb_dst() 與kernel netlink socket 以及5個netfilter hook點以及callback是甚麼? skt_dst()方法檢查是否有dst物件(一種dst_entry的實例表示routing subsystem的查找結果)隸屬於SKB。 netlink就是用戶空間與內核溝通機制，kernel netlink socket，透過交換各式(rtnetlink的, NETLINK_XFRM等等)netlink訊息來處理用戶與內核溝通。 skb_dst() method checks whether there is a "dst" object attached to the SKB; dst is an instance of `dst_entry` and represents the *result of a lookup* in the routing subsystem kernel netlink sockets handling communication between userspace and the kernel by exchanging netlink messages of different types callback: 一段可執行代碼的參照，它被作為參數傳遞給另一段代碼(就像function pointer)；預期這段代碼將回呼（執行）這個回呼函式作為自己工作，確保程式時機與關連，便於維護五個Netfilter掛載點: NF_INET_PRE_ROUTING, NF_INET_LOCAL_IN, NF_INET_FORWARD, NF_INET_LOCAL_OUT, NF_INET_POST_ROUTING, ::: :::warning #### kernel netlink sockets— the core engine of handling communication between userspace and the kernel by exchanging netlink messages of different types. `daemon`: 常駐程式是一種在後台執行，而不由使用者直接互動控制的電腦程式工作階段session: 是一種持久網路協定，在用戶端和伺服器端之間*建立關聯，從而起到交換封包*的作用機制 AF stands for `Address Family` and PF stands for Protocol Family. Also INET stands for INTERNET. SKB: 封包資料在內核通訊的數據結構統一是SKB The skb_dst() method checks whether there is a "dst" object attached to the SKB; dst is an instance of `dst_entry` and represents the *result of a lookup* in the routing subsystem. network stack: 網路協定套件的一個具體的軟體實現 ::: netlink protocl: NETLINK_ROUTE for rtnetlink messages, or NETLINK_XFRM for IPsec or NETLINK_AUDIT for the audit subsystem netlink_kernel_cfg, which consists of optional parameters for the netlink socket creation :::spoiler ( . meaning in a struct initializer) What does dot (.) mean in a struct initializer? (在linux kernel networking中很常見寫法) This is a C99 feature that allows you to set specific fields of the struct by name in an initializer. eg struct ex{int first}; and you can use initializer as struct ex example1={1}; or struct ex example1 = {.first =1}; ::: :::spoiler (linux coding style __ meaning kernel usage, dont conflict with usr function) Anything with __ in it is reserved for implementation use. This means that compiler writers and standard library writers can use those identifiers without worrying about a name clash with user code (eg __u8) u8 is non-standard but almost certainly means the same as uint8_t __be8 denotes big-endian (Not significant for a single byte), aka network-byte-order. Linux 核心中，下劃線通常用於以下幾個原因: 1. 變數和函數名稱通常使用下劃線來提高可讀性 2. 避免名稱衝突核心開發人員經常使用帶有下劃線的前綴或後綴來區分可能具有相似名稱的函數和變數 3. 範圍指示下劃線前綴（例如_my_function）表示函數或變數僅供文件或模組內部使用，向開發人員發出信號，表示不應在外部使用它 4. 巨集定義：核心中的巨集通常使用下劃線來分隔單字，以使它們更易於閱讀。例如，#define MAX_BUFFER_SIZE 1024 5. 結構和類型名稱：定義結構或類型時，底線有助於澄清命名。例如，struct file_operations 還有coding style 建議 macro def之constant, enum等都是全大寫 ::: IPv4 (ARP, NAT, DNS, routing, ICMPv4, masking如何設, VLAN, netfilter, OSPF, ACL，基礎有後也要了解業界架構怎用C設定register來實現VLAN等功能, fast forwarding...) 與cisco LAN switching ### ch2. netlink socket netlink protocol is a socket-based Inter Process Communication (IPC, communication between user space and kernel) mechanism Linux內核網路本質上就是封包從L2(網路驅動設備net_device)近來傳，傳給L3(network layer, ip_rcv() )若目的是本地主機就往L4(TCP/UDP)送，不是則近routing subsystem net_deivce: 此結構包含 `設備IRQ`, `MTU`, `dev_addr`, `netdev_ops`這些關鍵資料。還有`promiscuity`只要大於0，就不會丟棄目的非本地主機的封包。甚麼是NAPI? 原始網路收封包就interupt, 但效率差，因此`NAPI改用polling 與buffer`機制，封包近來都放buffer等作業系統來polling取得封包，這可以處理loading 高的情況 :::spoiler 簡述Linux內核網路本質，sockaddr_nl，netlink socket 創建, netlink訊息創建與處理(註冊) Linux內核網路本質上就是封包從L2(網路驅動設備net_device)近來傳，傳給L3(network layer, ip_rcv() )若目的是本地主機就往L4(TCP/UDP)送，不是則進routing subsystem，中間會透過五個固定hook 點，做filter, ipsec等過濾或處理，影響封包走向。net_dev接收封包 -> ip_rcv(做sanity check) -> NF_PRE_ROUTING-> ip_rcv_finish -> (路由子系統決定是轉發或往本地L4送後續分兩流向) ip_forward/ip_mr_input組播 (or ip_deliver_local) -> NF_FORWARDING (or NF_LOCAL_IN) -> ip_ouput (ip_local_finish) -> NF_POST_ROUTING -> ip_output_finsih -> 往L2 network driver (往L4 TCP) sockaddr_nl: 表示 netlink socket 地址的結構，包含nl_family, nl_pid, nl_group等表示其協議家族與 unicast addr (內核pid=0 用戶pid=其process id) 與multicast group (或稱multicast mask) 內核網路中，可創建多種內核netlink socket, 每種處理不同訊息 (eg rtnetlink處理NETLINK_ROUTE訊息)，像是在rtnetlink_net_init() 中創建 sk = netlink_kernel_create(net, NETLINK_ROUTE, &cfg);//註冊處理NETLINK_ROUTE 訊息並創建此內核nl socket與引入前面定好參數cfg ::: 決定封包去向，不只路由子系統的結果(IPsec, TTL也會影響)，還有netfilter(iptable防火牆的基礎)提供封包過濾的機制，會在系統中 `註冊`(function在內核中常需註冊表明是哪些變數(訊息)可用)，五個hook 點包括 `NF_INET_PREROUTING` 等等。具體使用就在method裡調用macro NF_HOOK像是 ```cpp= int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) { ...... return NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish); ...... } ``` Netfilter 五個掛載點為? ```cpp enum nf_inet_hooks{ NF_INET_PRE_ROUTING, NF_INET_LOCAL_IN, NF_INET_FORWARD, NF_INET_LOCAL_OUT, NF_INET_POST_ROUTING, NF_INET_NUMHOOKS, } ``` :::success SKB使用要遵循SKB_API ，也就是不能指標直接移，要用skb_pull()等，以及SKB包含`sock`,`net_dev`,`sk_buff_data_t trans_header/net_header/mac_header`,等重要參數每個SKB都有一個dev成員，一個 net_device instance表示接收封包的網路設備。註冊處理程序eg ip_rcv 就由 dev_add_pack處理結構sockaddr_nl表示 netlink socket 地址 ::: ch11討論SCTP 與 DCCP 兩種L4的新傳輸協議 (面試有討論到!) ch11也會討論到結構socket 與sock 分別向用戶空間與網路曾提供接口根據IP地址查找MAC的工作，由鄰接子系統完成。 Netlink 運行架構: 用socket API打開並註冊netlink socket，他就會處理與內核netlink socket的雙向通訊( 通常是 *發送訊息來配置系統設置與獲得來自內核的響應*) ![create_process](https://hackmd.io/_uploads/SyD2oYX6Jg.png) 結構sockaddr_nl表示 netlink socket 地址 ```cpp= struct sockaddr_nl={ __kernel_sa_family_t nl_family;//AF_NETLINK __u32 nl_pid;//port number, netlink socket的單播地址對內核NL是0 對用戶是其PID __u32 nl_groups; unsigned int nl_pad;//總是0 }; ``` 檢測工具 net_tools包(基於IOCTL) 包含: netstat, ifconfig, arp, route等等 #### kernel netlink soket 在內核網路中，可創建多種內核netlink socket, 每種處理不同訊息 (eg rtnetlink處理NETLINK_ROUTE訊息的 kernel netlink socket在rtnetlink_net_init() 中創建 ) ```cpp= static int __net_init rtnetlink_net_init(struct net* net){ ... struct netlink_kernel_cfg cfg = { .group = RTNLGRP_MAX, .input = rtnetlink_rcv, .cb_mutex = &rtnl_mutex, .flags = NL_CFG_F_NONROOT_RECV, } sk = netlink_kernel_create(net, NETLINK_ROUTE, &cfg);//註冊處理NETLINK_ROUTE 訊息並創建此內核nl socket與引入前面訂好參數cfg ... } ``` 成員input 用於指定callback function，有設定才使kernel nl socket可收到用戶空間訊息 (雖可從kernel to usr 因特權差異) mutex則是大多都不會指定，只有rtnetink與generic netlink例外會設立mutex. :::success 要使用netlink內核socket要先註冊它rtnl_register(int protocol, int msgtype, rtnl doit_func, rtnl dumpit, rtnl calcit); doit 此callback用於指定添加刪除修改， dumpit用於檢索資訊，calit計算緩衝區大小根據你處理的module 大多會有table (xx_handler) 像是rtnetlink has rtnl_msg_handlers table. 這table indexed by protocol number. 每個此表中的項目又是一個由message type index的表，each element in the table is instance of rtnl_link 是一個結構由三個pointer to these callback組成，所以當你註冊callback with rtnl_register(), 就是加了callback到此表 For example, in rtnetlink_init() you *register callbacks for some messages*, like RTM_NEWLINK (creating a new link), RTM_DELLINK (deleting a link), RTM_GETROUTE (dumping the route table) The netlink_kernel_create() method makes an entry in a table named nl_table by calling the netlink_insert() method. Access to the nl_table is protected by a read write lock named nl_table_lock; lookup in this table is done by the netlink_lookup() method, specifying the protocol and the port id. Registration of a callback for a specified message type is done by rtnl_register(); ![send_msg](https://hackmd.io/_uploads/Bk1Nx5Q6ye.png) ::: Neklink訊息，都以netlink message header `nlmsghdr` 做頭包括len, type(四種基本 NOP, ERR, DONE, OVERRUN緩衝區溢出), flag(訊息類型請求檢索 ACK等), seq(dont care), pid(發送端口內核0 用戶PID). 後續payload是依組格式 `TLV` 類型長度數值表示的屬性可對屬性定義有效性策略policy ，意味對收到的屬性的期望。 #### NETLINK_ROUTE 訊息 :::warning 訊息(由mshhdr起頭)有多種子系統訊息鄰接子系統防火牆 netlink排隊，策略路由，各種rtnetlink 訊息我們根據訊息 (hook, register, callback)做操作 NETLINK_ROUTE 分多個訊息family * LINK (網路接口) * ADDR (網路地址) * ROUTE (路由選擇訊息) * NEIGH (鄰居子系統訊息)等等每個訊息family 分三類用於`創建`, `刪除`,`檢索`訊息以rtnetlink為例創建RTM_NERROUTE, 刪除路由 RTM_DELROUTE, 檢索路由訊息RTM_GETROUTE，還有修改RTM_SETLINK ::: ![nlemsg](https://hackmd.io/_uploads/BymVFaQpJx.png) netlink錯誤訊息由nlmsghdr 與錯誤代碼創建，錯誤代碼不為0就在其字段後附加導致錯誤的原始請求netlink msg header。一切傳輸data 是SKB，而通訊則是 netlink message去做操作例如: ip route add 用戶命令通過renetlink socket從用戶空間發送添加的訊息給rtnetlink kernel netlink socket 接收後呼叫 rtnetlink_rcv()處理添加路由由inte_rtm_newroute()完成由fib_table_insert()完成插入forwarding table，還需`通知所有註冊RTM_NEWROUTE`的偵聽者，用rtmst_fib()將RTM_NEWROUTE作為參數創建netlink訊息，並調用rtnl_notify發送 #### 通用netlink協議 netlink缺點是 protocol family 不超過32 個，因此開發通用netlink, net是網路命名空間，將網路資源隔離使用戶以為獨享的結構由netlink_kernel_create()創建並設定 net->genl_sock = netlink_kernel_create(net, NETLINK_GENERIC, &cfg); genl_sock是一個通用netlink socket 成員創建通用netlink kernel socket後，再來註冊控制family (genl_ctrl) ```cpp static struct genl_family genl_ctrl = { .id = GENL_ID_CTRL, .name = "nlctrl", .version = 0x2, .maxattr = CTRL_ATTR_MAX, ,netnsok = true, }; ``` 支持註冊組播組定義 genl_multicast_group對象並調用genl_register_mc_group() *要在內核中使用通用netlink socket* 可用此兩種做法 (通用就像我們自己註冊protocol) * 創建genl_family 對象並調用genl_register_family()來註冊他 * 創建genl_ops對象，並調用genl_register_ops()來註冊他目前以來創建任何物件在內核，都說建立，給參數，註冊，定operators (ops如前述重點在結構中設定doit, dumpit, calit等callback函數)等等，算是一個大致架構(另外一點猜想，一堆callback 怎回來就stack會存return address 以及回來後若又要在call其他function ，就根據訊息或hook 這樣整體架構應該合乎邏輯) ```cpp struct genl_opts { u8 cmd; //命令標示定義單個命及及其 doit dumpit 處理程序 u8 internal_flag; //family definition and 私有標誌 int (*doit)(struct sk_buff *skb, struct genl_info * info); int (*dumpit)(...); int (*done)(...); //轉儲結束後執行的callback function } ``` 創建與發送通用netlnik msg: ![nlmsgge](https://hackmd.io/_uploads/HyF_y07akx.png) libnl-genl 提供通用netlink API，可用於管理控制器ops, 與命令註冊。 netlink socket sock_diag提供一個基於netlink的子系統，來獲取有關socket的資訊(支持工具ss用的監視窗口)，內核中使用它目的在 user space中支持 CRIU(支持檢查點/恢復功能) CRIU: 支持用戶空間中的檢查點/還原功能是用於Linux操作系統的軟件工具總結: 講了 netlink 機制(用戶空間與內核通訊機制), netlink socket 創建, skb, netlink訊息創建與處理(註冊), 以及討論通用netlink socket優點與創建註冊訊息。 ### ch3. ICMP skip temporary :::success ICMP 用作發送 L3 錯誤與控制訊息的機制，提供錯誤處理與診斷分兩類錯誤訊息與通知訊息，像是ping(於iputils包中)實際是用戶空間APP 打開原始socket 發送ICMP_ECHO，來收ICMP_REPLY訊息的方式，返回回應資訊。 traceroute確定主機與目的IP路徑透過TTL設置不同值，當TTL到0救回ICMP_TIME_EXCEED訊息 (TTL 1 遞增每次收到exceed就TTL+1 默認用UDP協議) ping_rcv()負責處理接收ping應答(ICMP_ECHOREPLY) ICMP初始化調用inet_init()方法再調用icmp_init()與後者在建收送消息的內核ICMP socket (調用icmp_sk_init) icmphdr 8 bits type, 8 bits code, 16 bits checksum ![icmphdr](https://hackmd.io/_uploads/HyHMERmayl.png) ICMPv4 定義icmp_control 對象數組 icmp_pointers 將訊息作為索引來決定操作 ::: ```cpp struct icmp_control{ void (*handler)(struct sk_buff *skb); short error; //這條ICMP訊息為錯誤訊息 }; static const struct icmp_control icmp_pointers[NR_ICMP_TYPES +1]; ``` 這個數組 icmp_control對象都是錯誤訊息， eg ICMP_DEST_UNREACH, 因其字段error =1, 若是通知訊息 error =0 如ICMP_ECHO ping_rcv()負責處理接收ping應答(ICMP_ECHOREPLY)，在內核3.0之前(業界用哪版==?)，發送ping要在用戶空間創 raw socket，尤其專門處理，在ip_local_deliver_finish()中做處理 ![icmp_loca](https://hackmd.io/_uploads/ryrZLAm61e.png) 如上圖調用一個監聽的raw socket處理在內核3.0中集成ICMP socket (ping socket)，變化在 ping發送方可以不是 raw socket 例如創建 socket(PF_INET, SOCK_DGRAM, PORT_ICMP) 用其處理，但此非原本的socket(與收送方不一致?)，因此應答沒有收到(沒有偵聽他的socket) ，所以後續用ping_rcv() 此callback處理 ICMP_REDIRECT 訊息由icmp_redirect()處理，主機不該發送重定向，而由網關(gateway eg 主機接的router)發送。 icmp_reply與icmp_send() 都用了icmp_bxm (ICMP生成xmit訊息)結構 ![icmp_xbm](https://hackmd.io/_uploads/ByUsD07aJg.png) skb在icmp_reply為請求數據包，在icmp_send為導致ICMP訊息被發送的數據包 times\[3\] 包含三個時間戳(重要性還沒甚麼感覺?)，由icmp_timestamp()填充 (發送方的始發timestamp 接收timestamp? 傳輸時間戳), 有比ICMP時間戳更常用的網路時間同步協議 NTP。接收ICMPv4 訊息 ip_local_deliver_finish() 處理目的為當前主機的封包，收到ICMP就給註冊了ICMP的socket叫icmp_rcv()中處理，checksum錯誤就丟，廣播或組播就檢驗是否 ICMP_ECHO, ICMP_ADDRESS等訊息，並調用前面講的icmp_control分派表的對應handler處理! 發送ICMPv4訊息有兩種: icmp_reply() 對應ICMP_ECHO和ICMP_TIMESTAMP的回應, 以及icmp_send() 發送主動訊息，都透過icmp_push_reply()來執行實際發送以port unreach舉例: 接收UDPv4包查找匹配的UDP socket, 若無就檢查checksum 錯就默默丟，正確就返回ICMP 目的不可達，下方例子重點在lookup_skb 與 icmp_send 回ICMP_PROT_UNREACH訊息 ![port_unreach](https://hackmd.io/_uploads/rkUiKC7p1e.png) 那ICMPv6 多了新功能 MLD組播偵聽發現與 ND鄰居發現 ![dif-icmpv4 and v6](https://hackmd.io/_uploads/H1d_9C761e.png) 創建由 icmpv6_init()和icmv6_sk_init()完成，callback handler設為icmp_rcv()代表收到next header是58 (ICMPv6對應值)的封包時，就調用icmp_rcv() 接收ICMPv6訊息: 交給icmpv6_rcv() 只接受一個SKB作為參數，執行sanity check後將InMsg SNMP計數器 +1，在按照下圖data flow去跑。 ![icmp-path](https://hackmd.io/_uploads/Bk3LnRm61l.png) ICMP socket (ping socket) LINUX(Owl版)提供安全改進補丁，新增socket (IPPROTO_ICMP) 可用socket(PF_INET, SOCK_DGRAM, IPPROTO_ICMP)創建新的ICMPv4 ping socket 不可socket(PF_INTE, SOCK_RAW, IPPROTO_ICMP)因其不是raw socket :::spoiler what is raw socket RawSocket 一種低級別，不依賴高層抽象的網路通訊端點，允許直接傳送/接收IP協定封包而不需要任何傳輸層協定格式 ::: procfs項目: 內核提供一種在用戶空間對各種子系統的設置進行配置的方式，方法是將值寫入/proc下的項目，被稱為procfs項目，由netns_ipv4結構中便量表示。用戶空間工具 iptables能讓讓用戶設置一些規則，指定內核如何處理這些規則指定的過濾器流量， iptables規則的處理工作在netfilter子系統中完成 ![Netfilter-packet-flow.svg](https://hackmd.io/_uploads/BJPxJJEaJe.png) [關於qdisc於ingress, engress](https://blog.csdn.net/qq_44577070/article/details/123967699) 總結: 介紹ICMPv4 與v6架構結構收發實現，標頭格式，v6新增功能MLD與ND。(對於V6部分，後續再詳細補充，因switch軟韌體工程師目前已v4為主) :::spoiler 請解釋ICMPv4 標頭架構，創建與接收發送，以及data flow? ICMP 提供L3 錯誤與控制訊息的機制，提供錯誤處理與診斷 icmphdr 8 bits type, 8 bits code, 16 bits checksum 分兩類錯誤訊息與通知訊息，eg ping 於用戶空間打開原始socket 發送ICMP_ECHO，並由ping_rcv方法收ICMP_ECHOREPLY訊息取得回應資訊。 traceroute 確定主機與目的IP路徑透過TTL到0救回ICMP_TIME_EXCEED的機制遞增TTL來確定路徑默認用UDP 定義icmp_control(error表錯誤或通知與 handler方法)對象數組 icmp_pointers 將`訊息`作為索引來決定操作 icmp_reply(處理ECHO, ICMP_TIMESTAMP)與icmp_send(送主動訊息) 都用了icmp_bxm (ICMP生成xmit訊息)結構，都透過icmp_push_reply()來執行實際發送。 ICMPv4 協議註冊就像其他IPv4 protocol 在inet_init()中註冊(調用icmp_init再調用icmp_sk_init建立處理ICMP socket)，icmp_protocol結構有成員handler(因收到要由誰處理就叫icmp_rcv), netns_ok表示支持網路命名空間 // quick recap, 目前為止註冊協議常見dev_add_pack 與inet_add_protocl [這兩有甚麼差異?](https://blog.csdn.net/wangquan1992/article/details/109723653) 可以想像成像icmp,tcp是處理完基本IP後再做處理所以是inet_add_protocol。接收ICMPv4 訊息 ip_local_deliver_finish() 處理目的為當前主機的封包，收到ICMP就給註冊了ICMP的socket叫icmp_rcv()中處理 ::: ### ch4. Ipv4 ![ipv4-header](https://hackmd.io/_uploads/rkHOJ1Vaye.png) ipv4 header由結構iphdr表示 ```cpp struct iphdr { #if defined(__LITTLE_ENDIAN_BITFIELD) __u8 ihl:4, // 以4bytes為單位的 header 長度 version:4; //ipv4 version 4, v6 be 6 #elif defined(__BIG_ENDIAN_BITFIELD) __u8 versio:4, ihl:4; #else #error "please fix <asm/byteorder.h>" #endif __u8 tos; //unsigned int 8 bits 支持QoS(服務品質) 一種控制機制，提供對不同使用者或者不同資料流採用不同的優先級，保證效能有一定水準 //例如讓某VLAN 優先級高於另一VLAN, QoS能最佳化網路資源分配，緩解間歇性流量 __be16 tot_len; //類似u16 但在big endian machine上? 主要是為了避免對位元組順序轉換的進行位置產生混淆 __be16 id; __be16 frag_off; // 分段 offset值前三位 MF, DF, CF 後13位代表偏移量 __u8 ttl; //time to live __u8 protcol; // 上層協議 __sum check; //checksum __be32 saddr; // source ip addr __be32 daddrl; //dest ip addr /* 後續是ipv4 options 所以header size 20bytes to 60 bytes變動 */ }; ``` #### ipv4 create, receive: 創建ipv4: ipv4數據包的乙太類型 0x0800(ETH_P_IP 指的是L2 eth header type=0800)，每種協議都必須指定一個協議處理並進行初始化，以便讓網路stack能處理歸屬於該協議的封包。 ```cpp static struct packet _type ip_packet_type __read_mostly = { .type = cpu_to_be16(ETH_P_IP), .func = ip_rcv, }; static int __init inet_init(void) { ... dev_add_pack(&ip_packet_type); ... } ``` :::success dev_add_pack將方法ip_rcv指定為v4數據包的協議處理程序，dev_add_pack是一種註冊處理程序。接收IPv4封包: 封包由net_device接收通過socket往內核kernel netlink socket送交由ip_rcv()處理由ip_rcv接收做sanity check, 調用ip_rcv_finish()完成實際工作，NF_HOOK_COND(NF_HOOK的變種)的macro會接受一個boolean(最後一個參數)當其true時才執行指定的鉤子，若沒有丟棄，再根據目的是本主機的話調用ip_local_deliver往上層送(LCAOL IN netfilter)，若要轉發就近ip_routing 子系統(後續進post routing netfilter)。 ![ipv4-rcv-flow](https://hackmd.io/_uploads/B1gGNkV6kl.png) (出於簡化此flow未考慮分段重組 IPsec等, IPsec在IP_rcv_fisish解碼重跑ip_rcv, 重組於ip_local_deliver()檢查buffer是否都到了才往下跑, 具體flow配合wiki netfilter的圖。整本讀完後就對該圖有心得了) 方法skt_dst 檢查與SKB相關聯的dst對象(一個dst_entry實例表示路由子系統查找結果)，也會在子系統設置input callback funcion 為local_deliver or mr_input等等 ::: The skb_dst() method checks whether there is a `dst object attached to the SKB`; dst is an instance of dst_entry (include/net/dst.h) and represents the result of a lookup in the routing subsystem. The lookup is done according to the routing tables and the packet headers. 接收組播數據包: ip_rcv -> ip_rcv_finish -> 調用`ip_route_input_noref` 查找會先調用ip_check_mc_rcu() 檢查是否屬於目標組播指定的組播組。若是就將dst input callback 設為ip_local_deliver。(高明 socket只處理input callback 但其中由各子系統或中間處理程序來設定他下一步流向) #### ip options 創建ipv4 options兩種: ip_options_build() 以ip_options作為參數將內容寫入到ipv4 header中以及ip_options_compile(struct net \*net, struct ip_options \*opt, struct sk_buff \*skb) 對指定的SKB中的ipv4 header進行分析，根據內容生成ip_options對象。(收的路徑根據ipv4 hdr 判斷有無options透過compile此函式來建出ip_options, 若再發送路徑則靠ip_options_build 創建options (與compile方向剛好相反) ) 在接收路徑中，ip_options_compile()生成的op_option對象會儲存在SKB中的控制緩衝區(cb) ![ip_rcv_option](https://hackmd.io/_uploads/BkRr8AVpJx.png) #### 發送ipv4封包從第四層發送ipv4封包主要方法兩種: ip_queue_xmit() 提供自己處理分段(TCP)的協議使用或是TCP也用ip_build_and_send_pkt() 來發送SYN ACK訊息) 另一種是 ip_append_data 讓不處理分段的協議(UDP,ICMPv4)使用 ![ipv4-xmit](https://hackmd.io/_uploads/Byf38kVaJe.png) rtable對象為路由選擇子系統查找結果(rtable內嵌dst 結構) ```cpp int ip_queue_xmit(struct sk_buff *skb, struct flowi *fl){ ... rt = (struct rtable*) __sk_dst_check(sk,0); if(rt == NULL){ //使用ip_route_output_port()在子系統查找，失敗就丟 rt = ip_route_output_ports(sock_net(sk), fl4, sk, daddr, inet->inet_saddr, inet->inet_dport, inet->inet_sport, sk->sk_protocol, RT_CONN_FLAGS(sk), sk->sk_bound_dev_if); } ... //生成ipv4標頭，因我們是L4來的 skb->data指標只向傳輸曾標頭回去複習ch2 的skb結構，且SKB要用API方法skb_pull or skb_pull_inline做skb->data 指標位移不可pointer直接操作 skb_push(skb, sizeof(struct iphdr) +(inet_opt? inet_opt->opt.optlen : 0) ); skb_reset_network_header(skb); iph = ip_hdr(skb); } ``` 最後發送SKB工作都由ip_send_skb()完成分段: 在ip_fragment()完成，收到分段要重組，則在ip_defrag()完成 ip_fragment(struct sk_buff \*skb, int (\*output)(struct sk_buff \*) ) callback function output是要使用的傳輸方式，包含兩條路徑快速路徑(SKB之frag_list 不為NULL的封包) 與慢速路徑 :::success 注意 ch4.5 option, 4.7 fragment 都很多細節，是未來重看複習要了解，這邊目前只記錄大架構，與目標。 ::: 重組: 將封包所有分段重組為一個緩衝區，在ip_local_deliver()中調用ip_defrag() ，接受兩參數 SKB與 32位字段指出調用方法的位置(IP_DEFRAG_LOCAL_DELIVER) 重組基於 ipq 對象組成的hash表，hash function (ipqhashfn)接受四個參數數據ID 原地址目標地址協議， ```cpp struct ipq { struct inet_frag_queue q; //這個queue 就是分段重組的緩衝區所在 u32 user; __be32 saddr; __be32 daddr; __be16 id; u8 protocol; u8 ecn; /* RFC3168 support */ int iif; unsigned int rid; struct inet_peer *peer; }; ``` ip_defrag()方法先調用ip_evictor()確保足夠內容，再調用ip_find()返回ipq對象給變數qp (一個指向ipq對象的指針) ，再調用ip_frag_queue()方法將分段加入到一個分段鍊表(qp->q.fragments)，都加入後再用ip_frag_reasm()組成一個新包。轉發用ip_forward() 包括檢查MTU 不夠要分段，設定優先權 setsocketopt() (設置了SOL_SOCKET 和SO_PRIORITY) //這些在geek 的socket programing簡易練習都有提到诶總結: IP數據包創建, header結構, 處理header與options, 註冊IPv4處理協議(在inet_init()中調用dev_add_pack(&ip_packet_type) )，IPv4接收路徑與傳輸路徑，分段與重組。 :::spoiler 請解釋 IPv4 在Linux 中數據包創建, header結構, 註冊與創建, 接收與傳輸註冊ipv4: ipv4數據包的乙太類型 0x0800(ETH_P_IP 指的是L2 eth header type=0800)，每種協議都必須指定一個協議處理並進行初始化，以便讓網路stack能處理歸屬於該協議的封包。使用dev_add_pack()將協議註冊，該協議結構成員.type = 0x0800, .func = ip_rcv。 dev_add_pack將方法ip_rcv指定為v4數據包的協議處理程序，dev_add_pack是一種註冊處理程序。接收IPv4封包: 封包由net_device接收通過socket往內核kernel netlink socket送交由ip_rcv()處理由ip_rcv接收做sanity check, 調用ip_rcv_finish()完成實際工作，NF_HOOK_COND(NF_HOOK的變種)的macro會接受一個boolean(最後一個參數)當其true時才執行指定的鉤子，若沒有丟棄，再根據目的是本主機的話調用ip_local_deliver往上層送(會進NF_LOCAL_IN hook點處理都accept就進ip_local_deliver_finish)，若要轉發就近ip_routing 子系統做ip_forward/ip_mr_input組播 (後續進NF_POST_ROUTING )。方法skt_dst 檢查與SKB相關聯的dst對象(一個dst_entry實例表示路由子系統查找結果)，也會在子系統設置input callback funcion 為local_deliver or mr_input 建options: ip_options_build() 以ip_options作為參數將內容寫入到ipv4 header中 ip_options_compile() 對指定的SKB中的ipv4 header進行分析，根據內容生成ip_options對象。傳輸: ip_queue_xmit() 提供自己處理分段(TCP)的協議使用另一種是 ip_append_ata 讓不處理分段的協議(UDP,ICMPv4)使用 rtable對象為路由選擇子系統查找結果(rtable內嵌dst 結構) 最後發送SKB工作都由ip_send_skb()完成分段: 在ip_fragment()完成，收到分段要重組，則在ip_defrag()完成重組基於 ipq 對象組成的hash表，hash function (ipqhashfn)接受四個參數數據ID 原地址目標地址協議 ::: ### ch5. IPv4 routing subsystem 介紹路由選擇子系統與主要數據結構 (路由選擇表(routing table), 轉發訊息庫(FIB 可以想像是由主表構出的最優路徑的新表 ))，路由選擇子系統是如何查找的看不出linux fib 到底是routing table還是forwarding table 結果 [其實是這樣看](https://zhuanlan.zhihu.com/p/415032187)! 兩張比較相似的表，分別為：路由資訊表（Routing Information Base），簡稱RIB表、路由表轉發資訊表（Forwarding Information Base）, 簡稱為FIB表、轉發表路由器的核心工作為每個經過的資料包找最佳路徑，並將到達不同網路的最優路徑對應的路由組成一張新的表格，即FIB表。路由表在路由選擇中起著關鍵作用，轉送資訊庫（FIB）表在封包轉送中起著關鍵作用流程: 封包進來根據路由器配置的路由選擇表做轉發，在指定默認網關時，不與其他路由選擇條目匹配的數據都將從默認網關發出去(比較switch 是廣播到全部port 或是廣播到該封包對應VLAN的port) CIDR 默認路由0.0.0.0/0 在路由查找都是最長前墜match 找法。 [rtable vs fib_table](https://blog.csdn.net/qq_53111905/article/details/126251996) rtable(一條與SKB關聯路由緩存項) 處理該封包查找結果後結果dst(目標緩存) 嵌入在rtable中與FIB那邊差別在就是這數據流處理後流項結果, 而FIB那邊包刮fib_result都是要把查找結果去做儲存記錄! 在local fib表(fib_table結構 with type 255) 與主表(fib_table with type 254)，而fib_table 路由選擇表的每個條目都指定了前往特定子網所對應的下一跳? (但從那結構看不出, 而且跟fib_info關係是如何...) Each routing entry contains a fib_info object fib_info儲存在fib_info_hash 這個hash table中 #### 在路由選擇子系統查找與 FIB表 *在路由選擇子系統查找*: 分兩階段，先在路由選擇緩存查找，沒有再去路由選擇表中查找。使用函數 int fib_lookup(struct \*net net, const struct flowi4 \*flp, struct fib_result \*res)做查找其中flowi4 對象包含查找過程關鍵字段(SA,DA ToS等)，是查找表的key，而fib_result對象是在查找過程中生成。 fib_lookup 先在本地FIB表搜索，沒有再到主FIB表中查找，查找成功後創建dst物件( dst_entry實例表示目標緩存destination cache)，並將dst嵌入到rtable(表是一個路由選擇項目可與SKB相關聯)結構中，以及查找過程中調整dst_entry的input/output callback function設置下一步處理程序。 ```cpp struct dst_entry { ... int (*input)(struct sk_buff *); int (*output)(struct sk_buff *); ... } //rtable(表是一個路由選擇項目可與SKB相關聯，第一個成員就是dst, 且設置了input,output callback決定流向) struct rtable { struct dst_entry dst;//查找結果dst_entry的實例dst int rt_genid; unsigned int rt_flags; //rtable對象的標誌像是RTCF_BROADCAST/ RTCF_MULTICAST 表示目標地址為廣播/ 組播 __u16 rt_type; __u8 rt_is_input; __u8 rt_uses_gateway; //若下一條為網關設1 若下一跳為直連設0 int rt_iif; /* Info on neighbour */ __be32 rt_gateway; u32 rt_pmtu; //路途上最小MTU struct list_head rt_uncached; }; ``` 對於目的為當前主機的包，將dst對象的input 回調函數設為ip_local_deliver 若轉發就設為ip_forward() 若是本機生成將output callback 設為ip_output()，若為組播數據包將input設為ip_mr_input() ```cpp struct fib_result { unsigned char prefixlen; //前綴長度表示子網路遮罩，在方法check_leaf()中設置 unsigned char nh_sel; unsigned char type; //最重要決定處理封包方式'，轉發或發ICMP 或本機收或丟棄等等 unsigned char scope; u32 tclassid; struct fib_info *fi; //重點就是 fib_result 包含了 fib_table 路由子系統的主數據結構(id254主表255本地表) 與該table每條目包含fib_info存儲路由選擇條目參數( fib_scope 主機環回全局鍊路, fib_prioirty 值低優先高) //fib_dev 將數據包傳輸到下一跳的網路設備, fib_protocol (RTPROT_KERNEL 表該路由選擇條目由內核創建 RTPROT_STATIC 該路由條目由系統管理員添加) struct fib_table *table; //指向用於查找的FIB表 struct list_head *fa_head; //指向fib_alias列表(從而與其他路由關聯起來目的是優化避免為相似fib都建fib_info) }; ``` :::success 如何在路由選擇子系統中查找? 用fib_lookup 與flowi4 結構參數(SA/DA/ToS) 運行，先在本地FIB表查沒有就去主FIB表，查完建dst結果物件，與設定callback調整下一步資料流向處理，並把dst嵌入到rtable中，建立fib_result。路由子系統主數據結構fib_table 此路由選擇表的每個條目指定了往特定子網的下一跳，每個條目包含一個fib_info 對象存儲最重要的路由選擇條目參數 ```cpp struct fib_table { struct hlist_node tb_hlist; u32 tb_id; //路由選擇標示符主表254(RT_TABLE_MAIN), 本地表255(RT_TABLE_LOCAL) int tb_fault; int tb_num_default; //表中包含的默認路由數 unsigned long td_data[0]; //路由選擇條目對象 trie 的 placeholder }; //fib_info 包含優先fib_priority 出站設備fib_dev, fib_protocol等 struct fib_info{ struct hlist_node fib_hash; struct hlist_node fib_lhash; struct net *fib_net; //該fib_info 所屬的network namesapce int fib_treeref; atomic_t fib_clntref; unsigned int fib_flags; unsigned char fib_dead; unsigned char fib_protocol; unsigned char fib_scope; unsigned char fib_type; __be32 fib_prefsrc; u32 fib_priorit; ... }; ``` [fib struct relation](https://blog.csdn.net/MENGHUANBEIKE/article/details/103195777) ![fib-struct-relation](https://hackmd.io/_uploads/ByrTQ7zCkl.jpg) [路由區fh_zone](https://zhuanlan.zhihu.com/p/429907528)子網路遮罩長度相同的認為是相同的路由區 cache: 路由選擇查找結果通常緩存在下一跳對象(fib_nh表示下一跳), 包含nh_dev外出網路設備 , 外出接口索引nh_oif, nh_scope等訊息 ```cpp struct fib_nh { struct net_device *nh_dev;//外出網路設備 struct hlist_node nh_hash; struct fib_info *nh_parent; unsigned int nh_flags; unsigned char nh_scope; #ifdef CONFIG_IP_ROUTE_MULTIPATH int nh_weight; int nh_power; #endif #ifdef CONFIG_IP_ROUTE_CLASSID __u32 nh_tclassid; #endif int nh_oif; __be32 nh_gw; __be32 nh_saddr; int nh_saddr_genid; struct rtable __rcu * __percpu *nh_pcpu_rth_output;//傳輸路徑中通過設置此字段將fib_result對象緩存到fib_nh // 現在看不清故事的是 rtable到底與fib_table如何關聯上?? 只知道查找cache --dst被嵌入到rtable中而dst類似fib_nh怪怪 -- 沒要關聯八 rtable 與dst 就是該SKB 與該dataflow 路由結果與路由cache, // 而fib_table, fib_info 是要把資訊記錄在路由表 struct rtable __rcu *nh_rth_input; //接收路徑中通過設置此字段將fib_result對象緩存到下一跳對象 struct fnhe_hash_bucket *nh_exceptions; }; ``` 策略路由選擇: 不設置CONFIG_IP_MULEIPLE_TABLES時將創建本地表(fib_table結構with type member =255, 包含對本地地址的路由選擇條目) 與主表，只有內核才能在本地表中添加路由選擇條目，在主表中添加條目由系統管理員完成(ip add route命令)，表由fib4_rules_init()創建, 用fib_get_table()訪問表 fib_alias 在多個同目標或子網用fib_info浪費(只差在ToS) 所以引入FIB別名結構 ```cpp struct fib_alias { struct list_head fa_list; struct fib_info *fa_info;//透過這指標讓多個fib_alias共享一個fib_info對象 , 但大家ToS值不同 u8 fa_tos; u8 fa_type; u8 fa_state; struct rcu_head rcu; // (Read-Copy Update) 是一種內核資料同步機制 }; ``` #### ICMPv4 重定向消息 ::: :::spoiler what is network namespace? [network namespace](https://yuminlee2.medium.com/linux-networking-network-namespaces-cb6b00ad6ba4) 為process虛擬化網路相關功能，多用在虛擬化與隔離，不同namespace內process有不同network資源例如 routing table, 虛擬網卡列表，socket，防火牆設定 ::: :::spoiler 請解釋如何在routing subsystem查找步驟，使用哪些結構? 結構rtable表示一個路由選擇項目, dst_entry是目標cache, 主數據結構fib_table 此路由選擇表的每個條目指定了往特定子網的下一跳，每個條目包含喔個fib_info 對象存儲最重要的路由選擇條目參數路由表在路由選擇中起著關鍵作用，轉送資訊庫（FIB）表在封包轉送中起著關鍵作用封包進來根據路由器配置的路由選擇表做轉發，在指定默認網關時，不與其他路由選擇條目匹配的數據都將從默認網關發出去 CIDR 默認路由0.0.0.0/0 以及在路由查找是最長前綴match 找法。 *在路由選擇子系統查找*: 分兩階段，先在路由選擇緩存查找，沒有再去路由選擇表中查找。使用此函數做查找 int fib_lookup(struct \*net net, const struct flowi4 \*flp, struct fib_result \*res) flowi4 對象包含查找過程關鍵字段(SA,DA ToS等)，是查找表的key，而fib_result對象是在查找過程中生成。 fib_lookup 先在本地FIB表搜索，沒有再到主FIB表中查找，查找成功後創建dst物件(dst_entry實例表示destination cache)，並將dst嵌入到rtable結構中，以及調整dst_entry的input/output callback function設置下一步處理程序。 rtable(表是一個路由選擇項目可與SKB相關聯，第一個成員就是dst, 且設置了input,output callback決定流向 eg ip_local_deliver/ ip_forward/ ip_output/ ip_mr_input ) 路由選擇查找結果通常緩存在下一跳對象(fib_nh表示下一跳) 路由子系統主數據結構fib_table 此路由選擇表的每個條目指定了往特定子網的下一跳，每個條目包含一個fib_info 對象存儲最重要的路由選擇條目參數 ::: ### ch6 高級路由 groupcast and policy 組播與策略介紹組播與廣播等策略，以及IPv4的組播管理IGMP 增刪的管理是如何運作的，透過policy策略可使路由不單根據dest addr，使策略路由可根據configuration來決定。 IPv4組播CIDR 前綴(CIDR) prefix of this group is 224.0.0.0/4. 處理組播路由，必須根據用戶路由背景常駐程式來關聯kernel溝通。 Multicast Routing cannot be handled solely by the kernel code without this userspace Routing daemon, as opposed to Unicast Routing. mrouted此daemon based on DVMRP pimd 此daemon based on PIM (protocol independet multicast protocol) 叫做獨立是因不依賴於任何路由協議做拓樸發現 PIM has four different modes: PIM-SM (PIM Sparse Mode), PIM-DM (PIM Dense Mode), PIM Source-Specific Multicast (PIM-SSM) and Bidirectional PIM. The multicast policy routing protocol is implemented using the Policy Routing API (for example, it calls the fib_rules_lookup() method to perform a lookup, creates a fib_rules_ ops object, and registers it with the fib_rules_register() method, and so on). ![simple-multicast](https://hackmd.io/_uploads/HJIF5y60Je.png) #### IGMP Protocol three versions of IGMP: v1: two types of messages—host membership report(host加入組播組) and host membership query(路由發query查詢 discover which host multicast groups have members on their attached local networks). v2: adds three new messages 包括query 有兩subtype 分General query and Group-specify query，以及新增v2 report 與leave group v3: major revision of the protocol adds a feature called `source filtering`. To support the source filtering feature, the socket API was extended; see RFC 3678, “Socket Interface Extensions for Multicast Source Filters.” I should also mention that the multicast router periodically (about every two minutes) sends a membership query to 224.0.0.1, the all-hosts multicast group address. A host that receives a membership query responds with a membership report. This is implemented in the kernel by the igmp_rcv() method: getting an IGMP_HOST_MEMBERSHIP_QUERY message is handled by the igmp_heard_query() method. igmp_rcv()調用igmp_heard_query()來處理IGMP_HOST_MEMBERSHIP_QUERY 訊息基本組播資料結構 mr_table ```cpp= struct mr_table { struct list_head list; #ifdef CONFIG_NET_NS struct net *net; #endif u32 id; //id: The multicast routing table id; it is RT_TABLE_DEFAULT (253) struct sock __rcu *mroute_sk; //pointer represents a reference to the userspace socket that the kernel keeps. struct timer_list ipmr_expire_timer; struct list_head mfc_unres_queue; struct list_head mfc_cache_array[MFC_LINES]; struct vif_device vif_table[MAXVIFS]; . . . }; ``` mroute_sk: The interaction between the userspace and the kernel is based on calling the setsockopt() method, on sending IOCTLs from userspace, and on building IGMP packets and passing them to the Multicast Routing daemon by calling the sock_queue_rcv_skb() method from the kernel. (前面提過組播一定要跟管理組播的daemon程式交互來操作，所以透過setsocketopt() 從userspace送IOCTL ) ### ch7 Linux 鄰接子系統 IP對應源端與終端網路地址中間跳轉都是查IP對應MAC後改SMAC 與DMAC做的鄰接子系統為L2數據包創建L header 以及請求與回應依照L3地址獲知其L2 地址 ```cpp struct neighbour { struct neighbour __rcu *next; struct neigh_table *tbl; struct neigh_parms *parms; unsigned long confirmed; unsigned long updated; rwlock_t lock; atomic_t refcnt; struct sk_buff_head arp_queue; unsigned int arp_queue_len_bytes; struct timer_list timer; unsigned long used; atomic_t probes; __u8 flags; __u8 nud_state; __u8 type; __u8 dead; seqlock_t ha_lock; unsigned char ha[ALIGN(MAX_ADDR_LEN, sizeof(unsigned long))]; struct hh_cache hh; int (*output)(struct neighbour *, struct sk_buff *); const struct neigh_ops *ops; struct rcu_head rcu; struct net_device *dev; u8 primary_key[0]; }; ``` ha: The hardware address of the neighbour object; in the case of Ethernet, it is the MAC address of the neighbour. hh: A hardware header cache of the L2 header (An hh_cache object). next: 指向hash table的下一鄰居 tbl: 與鄰居相關聯的鄰接表 output: 一個指向傳輸方式的指標 primary_key: 鄰居的IP地址依照此作鄰接表查找為了避免每次封包傳輸都發送請求，內核將L3地址與L2地址間的 mapping 存在被稱為鄰接表的數據結構中，再IPv4 稱ARP表(或ARP緩存) 在IPv6稱NDISC表 ```cpp struct neigh_table { struct neigh_table *next; int family; int entry_size; int key_len; __u32 (*hash)(const void *pkey, const struct net_device *dev, __u32 *hash_rnd); int (*constructor)(struct neighbour *); int (*pconstructor)(struct pneigh_entry *); void (*pdestructor)(struct pneigh_entry *); void (*proxy_redo)(struct sk_buff *skb); char *id; struct neigh_parms parms; /* HACK. gc_* should follow parms without a gap! */ int gc_interval; int gc_thresh1; int gc_thresh2; int gc_thresh3; unsigned long last_flush; struct delayed_work gc_work; struct timer_list proxy_timer; struct sk_buff_head proxy_queue; atomic_t entries; rwlock_t lock; unsigned long last_rand; struct neigh_statistics __percpu *stats; struct neigh_hash_table __rcu *nht; struct pneigh_entry **phash_buckets; }; ``` key_len: 查找的key長度對IPv4來說就是4bytes next: 指向neigh_tables 這個全局變量(每個協議都創建自己的neigh_table實例) 的link table 中的下一項 hash: 將L3地址 mapping到特定value的hash函式 gc_work: 異步垃圾收集處理程序 nht: 鄰居散列表 neigh_table_init() In IPv4, the ARP module defines the ARP table and passes it as an argument to the neigh_table_init() 這個no_link方法執行鄰接表的所有初始化工作，但將其添加到全局鄰接表neigh_tables中的工作除外，原本是為了ATM而做的初始化，但後續ATM有更好解法，而這樣的分離做法則保留下來。 You may wonder why you need the neigh_table_init_no_netlink() method—why not perform all of the initialization in the neigh_table_init() method? The neigh_table_init_no_netlink() method performs all of the initializations of the neighbouring tables, except for linking it to the global linked list of neighbouring tables, neigh_tables. I should mention that each L3 protocol that uses the neighbouring subsystem also registers a protocol handler: for IPv4, the handler for ARP packets (packets whose type in their Ethernet header is 0x0806) is the arp_rcv() method: 使用鄰接子系統的每種L3協議都還註冊一個協議處理程序(dev_add_pack) ```cpp static struct packet_type arp_packet_type __read_mostly = { .type = cpu_to_be16(ETH_P_ARP), .func = arp_rcv, }; void __init arp_init(void) { . . . dev_add_pack(&arp_packet_type); . . . } ``` neighbor ops 標明鄰居 protocol, 運算函式等等結構neigh_ops由鄰接表的 constructor方法設置 ```cpp struct neigh_ops { int family; void (*solicit)(struct neighbour *, struct sk_buff *); void (*error_report)(struct neighbour *, struct sk_buff *); int (*output)(struct neighbour *, struct sk_buff *); int (*connected_output)(struct neighbour *, struct sk_buff *); }; ``` solicit: This method is responsible for sending the neighbour solicitation requests: in ARP it is the arp_solicit() method 負責發送鄰居請求的方法 output: When the L3 address of the next hop is known, but the L2 address is not resolved, the output callback should be neigh_resolve_output(). 在下一跳L3地址已知但未能解析L2地址時，應將output回調函式設為neigh_resolve_output() 創建予刪除鄰居: 由__neigh_create()創建先調用neigh_alloc() 分配鄰居對象與作初始化，再來調用指定鄰接表的建構法 arp_constructor() 在使用__neigh_create()創建時若鄰居項目超過 hash table長度就增大(調用neigh_hash_grow) ip neigh show 顯示鄰接表項目的NUD(Neighbour Unreachability Detection)狀態而arp 只顯示IPv4鄰接表添加鄰居使用ip neigh add 命令由neigh_add()方法處理刪除鄰居使用ip neigh del 命令由neigh_del()方法處理在arp中，arp_netdev_event()被註冊為netdev網路事件的 callback function，而ipv6用nidsc_netdev_event() #### ARP協議根據給定IPv4地址來找出MAC地址若未知就以廣播方式發送ARP請求，其中包含IPv4地址若有主機配置(或知道)此地址就用單播ARP回應來答覆。 arp_tbl是 neigh_table結構的實例 ```cpp struct arphdr { //關鍵資訊都在hdr裡包刮源端與目的端 MAC IP 長度與地址 , ar_op是操作馬對應請求回復，緊跟ar_op後的就是送方MAC 與IP 以及收方 MAC與IP 在arp_process()方法中讀取ARP 標頭相應offset來提取這些地址資訊 __be16 ar_hrd; /* format of hardware address */ __be16 ar_pro; /* format of protocol address */ unsigned char ar_hln; /* length of hardware address */ unsigned char ar_pln; /* length of protocol address */ __be16 ar_op; /* ARP opcode (command) */ #if 0 * * Ethernet looks like this : This bit is variable sized however... */ unsigned char ar_sha[ETH_ALEN]; /* sender hardware address */ unsigned char ar_sip[4]; /* sender IP address */ unsigned char ar_tha[ETH_ALEN]; /* target hardware address */ unsigned char ar_tip[4]; /* target IP address */ #endif }; ``` ![arp-header](https://hackmd.io/_uploads/SyVAto1Rkx.png) ARP發送請求: 在ip_finish_output2()中先調用__ipv4_neigh_lookup_noref() 在ARP表中查找下一跳IPv4地址 ```cpp static inline int ip_finish_output2(struct sk_buff *skb) { struct dst_entry *dst = skb_dst(skb); struct rtable *rt = (struct rtable *)dst; struct net_device *dev = dst->dev; unsigned int hh_len = LL_RESERVED_SPACE(dev); struct neighbour *neigh; u32 nexthop; . . . . . . nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr); neigh = __ipv4_neigh_lookup_noref(dev, nexthop); if (unlikely(!neigh)) neigh = __neigh_create(&arp_tbl, &nexthop, dev, false); if (!IS_ERR(neigh)) { int res = dst_neigh_output(dst, neigh, skb); . . . } //Let’s take a look in the dst_neigh_output() method: static inline int dst_neigh_output(struct dst_entry *dst, struct neighbour *n, struct sk_buff *skb) { const struct hh_cache *hh; if (dst->pending_confirm) { unsigned long now = jiffies; dst->pending_confirm = 0; /* avoid dirtying neighbour */ if (n->confirmed != now) n->confirmed = now; } //When you reach this method for the first time with this flow, nud_state is not NUD_CONNECTED, and the //output callback is the neigh_resolve_output() method: hh = &n->hh; if ((n->nud_state & NUD_CONNECTED) && hh->hh_len) return neigh_hh_output(hh, skb); else return n->output(n, skb); } ``` ### ch9 netfilter #### netfilter架構 netfilter 一種軟體框架，管理封包。具有網路位址轉換NAT功能，也具備封包內容修改、以及封包過濾 (一個子系統可以管理封包過濾流量與防火牆功能) 有五個hook 點 pre_routing, local_in, forwarding, post_routing, local_out. 根據訊息類型呼叫不同的HOOK macro 。 netfiler也是建構iptables用戶空間軟體防火牆的基礎。 ![Netfilter-packet-flow.svg](https://hackmd.io/_uploads/HkmUrJSpJe.png) 通過在傳輸路徑的各個hook 點註冊callback 函數，來對封包執行，修改丟棄等操作。 iptables: 是netfilter的前端，為netfilter提供管理層，能增刪netfilter規則，顯示統計訊息等等。 iptables都使用setsockopt 與getsockopt() 來完成用戶空間與內核的通訊。 ```cpp static inline int NF_HOOK(unit8_t pf, unsigned int hook, struct sk_buff *skb, struct net_device *in, struct net_device *out, int (*okfn)(struct sk_buff *) ){ return NF_HOOK_THRESH(pf, hook, skb, in, out, okfn, INT_MIN); } ``` pf: protocol family, NFPROTO_IPV4 hook, 五個掛接點之一 skb 要處理的封包對象 in 輸入網路設備 out 輸出網路設備 okfn 函數指標，代表callback完成後要調用的方法。(這樣data flow就清楚了) netfiler 回傳值(netfilter verdicts)必為下列值 * NF_DROP 丟棄封包 * NF_ACCEPT 正常傳輸 * NF_STOLEN 封包不繼續傳，由hook方法處理了 * NF_QUEUE 排序數據包給用戶空間使用 * NF_REPEAT 再次調用hook 註冊hook callback 先定義nf_hook_ops結構再進行註冊用 int nf_register_hook(struct nf_hook_ops *reg) 註冊nf_hook_ops對象 ```cpp struct nf_hook_ops{ struct list_head list; nf_hookfn *hook; // 要註冊的hook callback function struct module *owner; //? u_int8_t pf; //protocol family, NFPROTO_IPV4 unsigned int hooknum; //五個掛載點之一 int priority;// 根據優先級升序排列回調函數 } ``` #### 連接跟蹤 Conntrack 現代網路僅考慮標頭來處理不夠，還會根據流量基於會話。連接跟蹤能讓內核跟蹤會話，主要目的是為NAT打下基礎。名為ipv4_conntrack_ops 的 nf_hook_ops的對象陣列，重點就是設定 priority, pf, hooknum, hook回調函數 ```cpp static struct nf_hook_ops ipv4_conntrack_ops[] __read_mostly = { { .hook = ipv4_conntrack_in, .owner = THIS_MODULE, .pf = NFPROTO_IPV4, .hooknum = NF_INET_PRE_ROUTING, .priority = NF_IP_PRI_CONNTRACK, }, { .hook = ipv4_conntrack_local, .owner = THIS_MODULE, .pf = NFPROTO_IPV4, .hooknum = NF_INET_LOCAL_OUT, .priority = NF_IP_PRI_CONNTRACK, },... ``` 註冊的兩個最重要的連接跟蹤函數 NF_INET_PRE_ROUTING 的ipv4_conntrack_in 與在 NF_INET_LOCAL_OUT 的 ipv4_conntrack_local 優先級為-200 在NF子系統的掛接處，優先值越小的越先執行。這兩方法都會再調用nf_conntrack_in() 用於IPv4連接跟蹤而ipv4_conntrack_ops的註冊在 nf_conntrack_l3proto_ipv4_init()方法中調用nf_register_hooks()來完成。 (只要是ops 都要註冊從前幾章看很多次，訊息也會註冊) ![conntrack_flow](https://hackmd.io/_uploads/HJfYT9Iakl.png) 連接跟蹤基本元素是 nf_conntrack_tuple ```cpp struct nf_conntrack_tuple { struct nf_conntrack_man src; /* These are the parts of the tuple which are fixed. */ struct { union nf_inet_addr u3; union { /* Add other protocols here. */ __be16 all; struct { __be16 port; } tcp; struct { __be16 port; } udp; struct { u_int8_t type, code; } icmp; ... } u; /* The protocol. */ u_int8_t protonum; /* The direction (for tuplehash) */ u_int8_t dir; } dst; }; ``` 基本元素自然包含要追蹤的protocol的union結構以及protocol和方向(接收? 傳送?) nf_conntrack_tuple 表示特定方向上的資料流 nf_conn 表示連接跟蹤條目，tuple與status和master較重要 ```cpp struct nf_conn { /* 引用計數，SKB也包含他，每創建關聯的期望連接就計數器+1 */ struct nf_conntrack ct_general; spinlock_t lock; /* XXX should I move this to the tail ? - Y.K */ /* These are my tuples; original and reply */ struct nf_conntrack_tuple_hash tuplehash[IP_CT_DIR_MAX]; /* 在兩個方向都看到了流量嗎? */ unsigned long status; /* 若是期望連接，這就連到主連接 */ struct nf_conn *master; /* Timer function; drops refcnt when it goes off. */ struct timer_list timeout; . . . /* Extensions */ struct nf_ct_ext *ext; #ifdef CONFIG_NET_NS struct net *ct_net; #endif /* Storage reserved for other modules, must be the last member */ union nf_conntrack_proto proto; }; ``` //tuple_hash 有兩對向 tuplehash[0]表示原始方向? tuplehash[1]代表回應方向? 我追蹤這個會話原始方向是這會話的資料都流向我這樣嗎? //status 條目狀態剛開始為IP_CT_NEW 建立後變成IP_CT_ESTABLISHED //master 期望連接的主連接 //期望連接 : (有些協議的資料流與控制流不同會開新socket處理，但這對NF就複雜不好追蹤，所以NF用連接追蹤輔助，創建期望對象表示彼此相關) nf_conntrack_in方法中重點在如何追蹤l3 與l4 資料流 ```cpp unsigned int nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum, struct sk_buff *skb ){ struct nf_conn *ct, *tmpl = NULL; enum ip_conntrack_info ctinfo; struct nf_conntrack_l3proto *l3proto; struct nf_conntrack_l4proto *l4proto; unsigned int *timeouts; unsigned int dataoff; u_int8_t protonum; int set_reply = 0; int ret; ... //檢查能否跟蹤 l3協議 l3proto = __nf_ct_l3proto_find(pf); //檢查l4能否跟蹤 ret = l3proto->get_l4proto(skb, ksb_network_offset(skb), &dataoff, &protonum ); ... //調用resolve_normal_ct() 做幾件事計算元組的hash value? 連接追蹤基本元素是 nf_conntrack_tuple //而條目nf_conn 有成員 nf_conntrack_tuple_hash tup[IP_CT_DIR_MAX] 代表連接方向元組: 原始方向與應答方向與這有關嗎? //下件事調用__nf_conntrack_find_get() 將計算的hash 值作為參數查找匹配的元組沒有用init_conntrack 創建nf_conntrack_tuple_hash對象 //每個SKB有nfct成員表示連接狀態 ct = resolve_normal_ct(net, tmp1, skb, dataoff, pf, protonum, l3proto, l4proto, &set_rply, &ctinfo); ... //後續其實超長== 有2頁的程式碼 } ``` 連接跟蹤輔助與期望連接: 有些協議的資料流與控制流不同會開新socket處理，但這對NF就複雜不好追蹤，所以NF用連接追蹤輔助，創建期望對象表示彼此相關連接輔助方法用nf_conntrack_helper 表示由nf_conntrack_helper_register進行註冊這些方法儲存在hash 表 (nf_ct_helper_hash)中 *就FTP而言註冊的輔助法為help() 後續的實現程式碼也兩頁抓不出哪個是核心，這要等日後上工才知道，或是根據本書根據產品已有架構，去做修改與實現* #### iptables 由兩部分組成: 內核部分與用戶空間部分(訪問前端增刪netfilter規則等等)。每個表都由 xt_table 表示，註冊用ipt_register_table() 與刪除 ipt_unregister_table() network namespace 包含IPv4與v6專用對象 netns_ipv4和netns_ipv6 其包含了指向xt_table對象的指標一個簡易例子的過濾表 ```cpp #define FILTER_VALID_HOOKS ( (1 << NF_INET_LOCAL_IN) | \ (1 << NF_INET_FORWARD) | \ (1 << NF_INET_LOCAL_OUT)) static const struct XT_table packet_filter = { .name = "filter", .valid_hooks = FILTER_VALID_HOOKS, .me = THIS_MODULE, .af = NFPROTO_IPV4, .priority = NF_IP_PRI_FILTER, }; //為初始化此表先調用xt_hook_link() 會將packet_filter的nf_hook_ops對象的 hook callback func 設置為iptable_filter_hook() static strcut nf_hook_ops *filter_ops __read_mostly; static int __init iptable_filter_net_init(struct net *net){ ... filter_ops = xt_hook_link(&packet_filter, iptable_filter_hook); ... } //再來調用 ipt_register_table (補充 IPV4 netns 對象 net->ipv4包含指向過濾表 iptable_filter 的指標) //所以 1. 設置過濾表結構 2.註冊nf_hook_ops 的callback 與這過濾表關聯 3. 把此過濾表與實際net 資源註冊關聯起來 //所以可以想像設置一個新結構是要處理的表或某資訊訊息再來為它設置參數調用函式等等最後要為其與某實際資源關聯起來 //(就像ch2 netlink msg, 先定hdr, msg結構, 後doit, dumpit, calit等調用, 然後還要註冊給kernel_netlink_socket 由特定函式處理!) static int __net_init iptalbe_filter_net_init(struct net *net){ ... net->ipv4.iptable_filter = ipt_register_table(net, &packet_filter, repl); ... return PTR_RET(net->ipv4.iptable_filter) } ``` 這個過濾表還支持LOG目標。唯一規則用於日志用iptables -A INPUT udp --dport=5001 -j LOG --log-level 1 將目標端口為5001之UDP入暫封包轉存到系統日誌中，log-level可指令0~7 等級 0最緊急。 -t指定要用的表 eg: iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE 若沒有用-t 就默認用過濾表 (netns_ipv4 結構內包含的 iptable_filter此表) 送至當前主機 (根據固定5個hook 檢查點與過濾表註冊的hook 去掉用) ```cpp int ip_local_deilver(struct sk_buff \*skb) { ... return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL, ip_local_deliver_finish); } ``` 此法包含了LOCAL_IN的hook 所以將調用iptable_filter_hook ```cpp static unsigned int iptable_filter_hook(unsigned int hook, struct sk_buff *skb, const struct net_device *in, const struct net_device *out, int (*okfn)(struct sk_buff *)) { const struct net *net; . . . net = dev_net((in != NULL) ? in : out); . . . return ipt_do_table(skb, hook, in, out, net->ipv4.iptable_filter); } //ipt_do_table 會調用LOG callback func ipt_log_packet() 將封包標頭寫入系統日誌中最後進deliver_finsih 交給L4 ``` ![iptables_data_flow](https://hackmd.io/_uploads/HksOLi86kl.png) 轉發數據包在路由子系統查找後，要轉發就用ip_forward() ```cpp int ip_forward(struct sk_buff *skb){ ... return NF_HOOK(NFPROTO_IPV4, NF_INET_FORWARD, skb, skb->dev, rt->dst.dev, ip_forward_finish); ... } //就如同前面過濾表在hook 點 NF_INET_FORWARD 註冊了hook callback 函式 iptable_filter_hook 此法將調用ipt_do_table 再調用ipt_log_packet ``` #### NAT 網路地址轉換 NAT: 一種IP動態偽裝，路由器在外部和內部IP 位址之間進行轉換。 SNAT 內網出外網的IP地址轉換修改來源IP DNAT 外網進內網的IP地址轉換修改目標IP NAT表也是一個xt_table對象，再除了NF_INET_FORWARD外的所有掛接點都註冊了它 (想像DNAT要找目標主機端口自然local out 與SNAT 自然local_in 以及進routing前要找出真正(或偽裝)IP 自然pre_routing, post_routing也註冊啦!) static const struct xt_table nf_nat_ipv4_table = { .name = "nat", .valid_hooks = (1 << NF_INET_PRE_ROUTING) | (1 << NF_INET_POST_ROUTING) | (1 << NF_INET_LOCAL_OUT) | (1 << NF_INET_LOCAL_IN), .me = THIS_MODULE, .af = NFPROTO_IPV4, }; 使用ipt_register_table註冊NAT表，netns_ipv4包含指向IPv4 nat表(nat_table)的指標，所以在註冊NAT表後會被分配給nat_table。然後註冊nf_hook_ops對象陣列，而註冊nf_nat_ipv4_ops在iptable_nat_init()中完成 ```cpp static struct nf_hook_ops nf_nat_ipv4_ops[] __read_mostly = { /* Before packet filtering, change destination */ { .hook = nf_nat_ipv4_in, .owner = THIS_MODULE, .pf = NFPROTO_IPV4, .hooknum = NF_INET_PRE_ROUTING, //對應DNAT 所以PRE_ROUTING .priority = NF_IP_PRI_NAT_DST, }, /* After packet filtering, change source */ { .hook = nf_nat_ipv4_out, .owner = THIS_MODULE, .pf = NFPROTO_IPV4, .hooknum = NF_INET_POST_ROUTING, //對應SNAT 所以POST_ROUTING .priority = NF_IP_PRI_NAT_SRC, }, /* Before packet filtering, change destination */ { .hook = nf_nat_ipv4_local_fn, .owner = THIS_MODULE, .pf = NFPROTO_IPV4, .hooknum = NF_INET_LOCAL_OUT, .priority = NF_IP_PRI_NAT_DST, }, /* After packet filtering, change source */ { .hook = nf_nat_ipv4_fn, .owner = THIS_MODULE, .pf = NFPROTO_IPV4, .hooknum = NF_INET_LOCAL_IN, .priority = NF_IP_PRI_NAT_SRC, }, }; ``` 就如同wiki netfilter data flow圖片: 有些NAT hook 與conntrack hook回掉同時在某些hook點註冊了要有優先順序，自然是先NAT處理找出真正IP 在conntrack 所以ipv4_conntrack_in 優先值NF_IP_PRI_CONNTRACK(-200) 與 nf_nat_ipv4_in 優先值NF_IP_PRI_NAT_DST (-100) 優先值越低越先被調用 ![DNAT rule](https://hackmd.io/_uploads/BJ0Y4IJ01x.png) 以上圖為例: 此規則是前往目標UDP端口9999的入站數據包改(DNAT) 目標地址為192.168.1.8 NAT實現的基本元素 nf_nat_l4proto 與nf_nat_l3proto 兩個結構都包含函數指標 manip_pkt() 會修改封包標頭，像是TCP的manip_pkt實現 ```cpp static bool tcp_manip_pkt(struct sk_buff *skb, const struct nf_nat_l3proto *l3proto, unsigned int iphdroff, unsigned int hdroff, const struct nf_conntrack_tuple *tuple, enum nf_nat_manip_type maniptype) { struct tcphdr *hdr; __be16 *portptr, newport, oldport; int hdrsize = 8; /* TCP connection tracking guarantees this much */ /* this could be an inner header returned in icmp packet; in such cases we cannot update the checksum field since it is outside of the 8 bytes of transport layer headers we are guaranteed */ if (skb->len >= hdroff + sizeof(struct tcphdr)) hdrsize = sizeof(struct tcphdr); if (!skb_make_writable(skb, hdroff + hdrsize)) return false; hdr = (struct tcphdr *)(skb->data + hdroff); //前面先sanity check 與抓標頭出來，後續根據maintype 修改來源(NF_NAT_MANIP_SRC) 或目的(NF_NAT_MANIP_DST) 然後我們從tuple抓出src/dst資訊 //(因NAT再連接跟蹤層查找匹配項目來執行轉換就像在ch9.3連接跟蹤開頭說目的是為NAT打下基礎，與紀錄參數(超時連接流量過濾)等狀態)來 //Set newport according to maniptype: //• If you need to change the source port, maniptype is NF_NAT_MANIP_SRC. So you extract the //port from the tuple->src. //• If you need to change the destination port, maniptype is NF_NAT_MANIP_DST. So you extract //the port from the tuple->dst: if (maniptype == NF_NAT_MANIP_SRC) { /* Get rid of src port */ newport = tuple->src.u.tcp.port; portptr = &hdr->source; } else { /* Get rid of dst port */ newport = tuple->dst.u.tcp.port; portptr = &hdr->dest; } //必須保留原來端口以供重新計算 checksum用 , checksum 16 bits //IPv4 的checksum計算流程例子 //一般IPv4 header size 20byts, chechsum16bits, for this example E3 4F 23 96 44 27 99 F3 [00 00] 中括號內為checksum初始值自動設0 //計算就已2bytes為單位兩兩相加 i.e E34F +2396 + 4427 + 99F3 = 1E4FF 再高16家到低16上 //i.e E4FF+0001 = E500 最後取NOT 得到~(E500) = 1AFF 就是我們checksum值若收方前18bytes不便就是E500 +1AFF 結果為FFFF代表正常 oldport = *portptr; *portptr = newport; if (hdrsize < sizeof(*hdr)) return true; //Recalculate the checksum: l3proto->csum_update(skb, iphdroff, &hdr->check, tuple, maniptype); inet_proto_csum_replace2(&hdr->check, skb, oldport, newport, 0); return true; } ``` ![nat and nf](https://hackmd.io/_uploads/ryPxDU1Aye.png) ![nat-ipv4-callback](https://hackmd.io/_uploads/Hk_Oh8JR1e.png) 最重要是nf_nat_ipv4_fn() 其他參法都調用它它會調用ipt_do_table()再指定表查找匹配項目找到就調用target callback function 總結: netfilter架構與nf hook五點與五返回值以及註冊方式(nf_register_hook 註冊nf_hook_ops對象)，並介紹conntrack機制(nf_conntrack_tuple, nf_conn結構, ipv4_conntrack_ops的nf_hook_ops對象陣列) ，以及連接跟蹤資料流PRE_ROUTING hook -> ipv4_conntrack_in -> nf_conntrack-in，與iptable(xt_table結構) 和NAT(nf_nat_ipv4_in, nf_nat_ipv4_out) ### ch10 IPsec AH認證標頭與 ESP 封裝安全負載運行模式分: 傳輸模式(對payload加密) 與隧道模式(對整個IP包加密) IKE 網路密鑰交換(v1 分兩階段主模式驗證彼此身分使用diffie-hellman密鑰交換協議或預定地key 確認會話key 第二階快速模式對加密算法達成一致), v2提供包括NAT穿越等重要功能只有一階段確認交換IPsec SA 密鑰 XRFM 是IPsec的基礎設施 #### XFRM架構支持network namespace (一種輕型的process 虛擬化，另process有自己的網路資源隔離)，每個net 包含netns_xfrm成員結構實例，此成員有狀態hash table, 垃圾收集器，計數器等等 ```cpp struct netns_xfrm { struct hlist_head *state_bydst; struct hlist_head *state_bysrc; struct hlist_head *state_byspi; . . . unsigned int state_num; . . . struct work_struct state_gc_work; . . . u32 sysctl_aevent_etime; u32 sysctl_aevent_rseqth; int sysctl_larval_drop; u32 sysctl_acq_expires; }; ``` XFRM init: 在ipv4 使用xfrm_init() 或是xfrm4_init() from ip_rt_init() method. 用戶空間與內核的溝通建構 NETLINK_XFRM netlink socket 做nl訊息收送 (此nlsk在 xfrm_user_net_init建構) ```cpp static int __net_init xfrm_user_net_init(struct net *net) { struct sock *nlsk; struct netlink_kernel_cfg cfg = { .groups = XFRMNLGRP_MAX, .input = xfrm_netlink_rcv, }; nlsk = netlink_kernel_create(net, NETLINK_XFRM, &cfg); . . . return 0; } ``` 用戶空間送的訊息由xfrm_netlink_rcv()接收做處理 XFRM policy 策略 (告知IPsec 對資料流處理的規則, 以下方結構selector為例重點就是saddr,sprot, sport_mask等等透過這些基礎資料在結構policy 做對應data flow processed or bypass) A Security Policy is a rule that tells IPsec whether a certain flow should be processed or whether it can bypass IPsec processing. ```cpp struct xfrm_selector { xfrm_address_t daddr;//為何是xfrm的addr? 普通或前面定的mac addr不可以嗎? xfrm_address_t saddr; __be16 dport; __be16 dport_mask; __be16 sport; __be16 sport_mask; __u16 family; __u8 prefixlen_d; __u8 prefixlen_s; __u8 proto; int ifindex; __kernel_uid32_t user; }; ``` 透過xfrm_selector_match方法以xfrm_selector等作為參數找出data flow match xfrm_selector ```cpp struct xfrm_policy { . . . struct hlist_node bydst; struct hlist_node byidx; /* This lock only affects elements except for entry. */ rwlock_t lock; atomic_t refcnt; struct timer_list timer; struct flow_cache_object flo; atomic_t genid; u32 priority; u32 index; struct xfrm_mark mark; struct xfrm_selector selector;//這個selector 對應的policy 重點包括 type, action, flags, refcnt struct xfrm_lifetime_cfg lft; struct xfrm_lifetime_cur curlft; struct xfrm_policy_walk_entry walk; struct xfrm_policy_queue polq; u8 type; u8 action; u8 flags; u8 xfrm_nr; u16 family; struct xfrm_sec_ctx *security; struct xfrm_tmpl xfrm_vec[XFRM_MAX_DEPTH]; }; ``` refcnt: XFRM policy reference counter在xfrm_policy_alloc中初始化在xfrm_pol_hold中遞增 timer: per-policy timer lft: policy life time, every xfrm policy has a lifetime tpye: usually XFRM_POLICY_TYPE_MAIN(0). 當內核支持子策略，兩策略可應用在同一封包上時就設定XFRM_POLICY_TYPE_MAIN(1), 常用於ipv6 action: 可有此二值 XFRM_POLICY_ALLOW, XFRM_POLICY_BLOCK 內核儲存 IPsec 安全策略於 SPD(security policy database) 管理由用戶空間Sokcet 送訊息包刮xfrm_add_policy, xfrm_get_policy (處理訊息XFRM_MSG_DELPOLICY)等 XFRM state (security Assiciations): xfrm_state structure represents an IPsec Security Association The kernel stores the IPsec Security Associations in the Security Associations Database (SAD). xfrm state 物件存在三個hash表中(in netns_xfrm): state_bydst, state_bysrc, state_byspi. 這些表的key 由 xfrm_dst_hash(), xfrm_src_hash等計算 //有沒有注意到真的很常見 linked list(table一項項指下去) 與hash table(key-value對應查找) 就是要根據資料類型特性，與操作去選適合的結構來做! 就如同leetcode 刷提常看到 set/vector/unordered_map/link_list/Tree(韌體比較少高級tree結構? 但VLAN的MIBs就是樹狀結構) Lookup in the SAD: 有xfrm_state_lookup_byaddr與xfrm_state_find #### IPv4 ESP實現與接收IPsec 包流程 ESP protocol 是IPsec中最常使用的協議支持加密與認證。 SPI: A 32-bit Security Parameter Index. Together with the source address, it identities an SA.(用來辨認關聯SA(使用xfrm_state結構表示資料流各種資訊 key, request id等等)的參數，由xfrm_state_add處理新增此結構) Sequence Number: 32 bits, incremented by 1 for each transmitted packet in order to protect against replay attacks.(序列號防止重放攻擊) ![ESP-format](https://hackmd.io/_uploads/BkFfHOJC1g.png) ESP init: 使用xfrm_type 與net_protocol結構設定ESP參數重要包括input,output callback 決定流向，以及state在後續處理(esp4_init 與接收(xfrm4_rcv 到xfrm_input) 與發送(xfrm_lookup)) ```cpp static const struct xfrm_type esp_type = { .description = "ESP4", .owner = THIS_MODULE, .proto = IPPROTO_ESP, .flags = XFRM_TYPE_REPLAY_PROT, .init_state = esp_init_state, .destructor = esp_destroy, .get_mtu = esp4_get_mtu, .input = esp_input, .output = esp_output }; static const struct net_protocol esp4_protocol = { .handler = xfrm4_rcv, .err_handler = esp4_err, .no_policy = 1, .netns_ok = 1, }; ``` ```cpp //簡易xfrm_input 省略諸多細節，請reference原書 int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type) { struct xfrm_state *x; do { . . . //Perform a lookup in the state_byspi hash table: x = xfrm_state_lookup(net, skb->mark, daddr, spi, nexthdr, family); ... //return protocol number of original packet, before it was enc by ESP nexthdr = x->type->input(x, skb); ... XFRM_MODE_SKB_CB(skb)->protocol = nexthdr; //原始的protocol number 存放於control buffer(cb) of the SKB ... ``` 在esp_input 結束後，使用xfrm_transport_finish 方法修改ipv4 header多個欄位, 重算checksum (使用ip_send_check)與Invoke any netfilter NF_INET_PRE_ROUTING hook callback IPv4 ESP packet接收資料流 ![ipv4-rcv-flow](https://hackmd.io/_uploads/ByF2Pd1Cke.png) 發送IPsec 封包先執行lookup in routing subsystem 再lookup for XFRM policy, which can be applied on this flow. ![ipsec-xmit](https://hackmd.io/_uploads/BktzOuyC1l.png) 使用xfrm_lookup方法只要有封包要送出去，為了有效使用綑綁(bundle) 來cache 重要資訊包刮route, policy 等等，這些bundle是 xfrm_dst結構的實例 and store by using flow cache. 較重要是 rtable, dst_entry, policy, xfrm_selector等 ```cpp struct xfrm_dst { union { struct dst_entry dst; struct rtable rt; struct rt6_info rt6; } u; struct dst_entry *route;//dst_entry是路由子系統查找結果 struct flow_cache_object flo; struct xfrm_policy *pols[XFRM_POLICY_TYPE_MAX]; int num_pols, num_xfrms; #ifdef CONFIG_XFRM_SUB_POLICY struct flowi *origin; struct xfrm_selector *partner; #endif u32 xfrm_genid; u32 policy_genid; u32 route_mtu_cached; u32 child_mtu_cached; u32 route_cookie; u32 path_cookie; }; ``` The xfrm_lookup() method is a very complex method. 作者不提細節 ![xfrm_lookup-dataflow](https://hackmd.io/_uploads/Hk0tKOk0Jl.png) 1.檢查對應policy 2.檢查首個封包就分派flow cache 並bundle資訊供後續加快處理 3.檢查id The xfrm_lookup() method handles only the Tx path; so you set the flow direction (dir) to be FLOW_DIR_OUT IPsec的NAT穿越功能因NAT做IP動態偽裝，要重算checksum等，但IPsec又會加密，所以為了解決，研發NAT travesal standard for IPsec (RFS 3948 for detailed) These SBC solutions perform NAT traversal of the media traffic—which is sent by Real Time Protocol (RTP)—and sometimes also for the signaling traffic—which is sent by Session Initiation Protocol (SIP). Openswan 提供IPsec功能於linux上 Openswan is an IPsec implementation for Linux 總結: 介紹IPsec基礎設施基於XFRM架構，介紹XFRM policy and state 作為XFRM架構的基礎結構，以及簡述ESP 封包接收與發送流程 ### ch11 L4 protocol socket type: SOCK_STREAM (RELIABLE FOR TCP) SOCK_DIAGRAM(UDP) SOCK_RAW(直接訪問ip層，支持與協議無關的傳輸層收發) SOCK_DCCP(不可靠兼具TCP, UDP) 用戶端socket_api: socket() : create socket bind() : 關聯IP與端口 listen() accpet() connect() 內核 socket結構為用戶空間提供接口 sock結構為L3提供接口 ```cpp= struct socket { socket_state state; kmemcheck_bitfield_begin(type); short type; //前述的幾個sock tpye kmemcheck_bitfield_end(type); unsigned long flags; . . . struct file *file; struct sock *sk; //與socket相關聯的sock對象為L3提供了接口 const struct proto_ops *ops; //包含socket大多數的callback function read(), write(), send(), recv()等 }; struct sock { struct sk_buff_head sk_receive_queue;//存儲入站封包的queue int sk_rcvbuf; //接收緩衝區的大小, 單位為byte unsigned long sk_flags; //標示狀態如SOCK_DEAD int sk_sndbuf; struct sk_buff_head sk_write_queue; . . . unsigned int sk_shutdown : 2, sk_no_check : 2, sk_protocol : 8, sk_type : 16; //這些怎麼都用了bitfield? type如前述 . . . void (*sk_data_ready)(struct sock *sk, int bytes); //callback用於通知socket有新數據包到了 void (*sk_write_space)(struct sock *sk); }; ``` #### socket create sockfd = socket(int socket_family, int socket_type, int protocol);//ipv4的AF_INET family, SOCKET_STREAM, protocol可以是0代表TCP或UDP 或是 IPPROTO_TCP/IP_PROTO_UDP 此socket() 方法由內核sys_socket()處理，調用sock_create()這依照socket_family而異做實際創建。對IPv4會調用inet_create()來創建與socket相關聯的sock對象。用戶空間收送通過在內和調用sendmsg(), recvmsg()將msghdr對象做參數來處理。(乎想然到跟內核通訊不是都netlink用訊息呼叫嗎? 那這裡是跟訊息有關嗎?) ```cpp= struct msghdr { void *msg_name; /* Socket name 目標socket地址為獲取目標socket 將void pointer msg_name 轉換為指向sockaddr_in結構的指針 */ int msg_namelen; /* Length of name */ struct iovec *msg_iov; /* Data blocks */ __kernel_size_t msg_iovlen; /* Number of blocks */ void *msg_control; /* Per protocol magic (eg BSD file descriptor passing) 控制訊息(也叫ancillary data)具體後面提作用 */ __kernel_size_t msg_controllen; /* Length of cmsg list */ unsigned int msg_flags; }; ``` #### UDP封包接收 ```cpp= struct udphdr { __be16 source; __be16 dest; //be16 -- big endian 16 bits data __be16 len; __sum16 check; }; ``` 關於[be16這些type參考](https://blog.csdn.net/eZiMu/article/details/55190206) 如同前面所學，inet_add_protocol(於inet_init()內)將TCP/IP端協議註冊而dev_add_pack將IP/MAC端協議註冊 eg IPv4 on page 63 設置ip_rcv()做接收處理以及ARP ```cpp= static const struct net_protocol udp_protocol = { .handler = udp_rcv, .err_handler = udp_err, .no_policy = 1, .netns_ok = 1, }; struct proto udp_prot = { .name = "UDP", .owner = THIS_MODULE, .close = udp_lib_close, .connect = ip4_datagram_connect, .disconnect = udp_disconnect, .ioctl = udp_ioctl, . . . .setsockopt = udp_setsockopt, //大量callback 最常見就send, recv, setsockopt這些 .getsockopt = udp_getsockopt, .sendmsg = udp_sendmsg, .recvmsg = udp_recvmsg, .sendpage = udp_sendpage, . . . }; static int __init inet_init(void) { . . . if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0) //註冊udp_protocol與指派udp_rcv作為接收函式 pr_crit("%s: Cannot add UDP protocol\n", __func__); . . . int rc = -EINVAL; . . . rc = proto_register(&udp_prot, 1); //用proto_register 註冊udf_port結構包含的一堆callback function } ``` 發送UDP封包多個API eg send, sendmsg等最終都調用 udp_sned_msg()處理創建msghdr對象並將其傳給kernel 一樣先完整性檢查，後續使用udp_send_skb()或ip_append_daa() 發送SKB所需的flowi4對象(包含目標端口目標地址等幫助routing所需資訊的結構 ref:ch5-2,p017) 最後udp_sendmsg()內跑若rt = (struct rtable*) sk_dst_check(sk, 0); 若路由選擇條目為NULL 就必須執行路由查找 ```cpp= int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t len) { ... if (len > 0xFFFF) return -EMSGSIZE; ... if (connected) rt = (struct rtable *)sk_dst_check(sk, 0); ... if (rt == NULL) { struct net *net = sock_net(sk); fl4 = &fl4_stack; //設置fl4, 以及tos, faddr, saddr ,dport等 flowi4_init_output(fl4, ipc.oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE, sk->sk_protocol, inet_sk_flowi_flags(sk)|FLOWI_FLAG_CAN_SLEEP, faddr, saddr, dport, inet->inet_sport); security_sk_classify_flow(sk, flowi4_to_flowi(fl4)); rt = ip_route_output_flow(net, fl4, sk); ... //然後ip_append_data()將數據加入緩衝區調用udp_push_pending_frams()來完成傳輸工作 err = ip_append_data(sk, fl4, getfrag, msg->msg_iov, ulen, sizeof(struct udphdr), &ipc, &rt, corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags); ... } ``` 接收UDP數據包: 用udp_rcv() 調用__udp4_lib_rcv() 從SKB取回UDP標頭長度園地址與目標地址等資訊再完整性檢查後再UDP socket hash table中查找有找到匹配的socket就調用udp_queue_rcv_skb()處理 ```cpp= int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable, int proto) { struct sock *sk; struct udphdr *uh; unsigned short ulen; struct rtable *rt = skb_rtable(skb); __be32 saddr, daddr; struct net *net = dev_net(skb->dev); ... sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable); if (sk != NULL) { int ret = udp_queue_rcv_skb(sk, skb); sock_put(sk); /* a return value > 0 means to resubmit the input, but * it wants the return to be -protocol, or 0 */ if (ret > 0) return -ret; //Everything is fine; return 0 to denote success: return 0; } ``` udp_rcv() data flow: ![rcv_udp](https://hackmd.io/_uploads/ryeieeLCJe.png) #### TCP封包接收 TCP 從不可靠IP 達到可靠藉由 `重送` , `序列號`，且為了避免單一送與等回復，效率差，使用sliding window來做一次做多個，透過送方buffer 與收方buffer控制size來為網路做流量控制 (eg 一堆重送(嚴重) 或缺某封包代表網路壅塞就調整buffer size不送那麼多了), 且TCP是connection導向，但似乎清大開放式課程電腦網路提到byte-oriented protocol這就沒很理解意思? TCP 功能由兩部分組成: `連接管理` 與 `數據收發` TCP送端將APP傳送的位元組儲存，等位元組數量夠形成資料段segments，再送至TCP目的。 TCP收端將segment放入接收buffer在等APP讀取。 TCP hdr 長20bytes, 有options時可達60bytes ```cpp struct tcphdr { __be16 source; __be16 dest; __be32 seq; __be32 ack_seq; #if define(__LITTLE_ENDIAN_BITFIELD) __u16 res1 :4, //保留需設0 doff :4, fin :1, syn :1, //SYN fin ack 用於initial(or end) connection rst :1, psh :1, //push 接收到得資料立刻傳至上層的應用程式 0則放buffer ack :1, urg :1, //urgent data 應有priority over un-urgent data傳輸 ece :1, cwr :1; //sliding window shrink ... #endif __be16 window; __sum16 check; __be16 urg_ptr; }; ``` ![tcp-hfr](https://hackmd.io/_uploads/SyaGf7LCJg.png) :::spoiler TCP control flag and byte oriented 清大電腦網路開放式課程黃能富大神 ACK/sequenceNum/AdvertisedWindow用於TCP sliding window algorithm 因TCP位元導向，`資料位元組有序號`，而sequenceNum為此segment上第一個位元組的序號 control flag 有SYN fin ack 用於initial(or end) connection，reset 用於當未預期的封包進時(attack. tcp close but come in package) 給回覆， URG 表示urgent data 應有priority over un-urgent data傳輸，push表示發送者要求啟動推擠動作(指示要將接收到得資料立刻傳至上層的應用程式)，只是收端通知APP有此動作 TCP送端將APP傳送的位元組儲存，等位元組數量夠形成資料段segments，再送至TCP目的主機 TCP收端將segment放入接收buffer在等APP讀取 ::: 定義tcp_protocol (net_protocol對象)，一樣用inet_add_protocol() 註冊該協議(注意到都是5個成員诶 .handler, .netns_ok, .err_handler,. .policy) 但對於policy還是沒很清晰 (我只能解釋 policy 是對屬性的額外描述? 應該是ch5 策略路由回去重複習) define a tcp_prot object and register it by calling the proto_register() ```cpp static const struct net_protocol tcp_protocol = { .early_demux = tcp_v4_early_demux, .handler = tcp_v4_rcv, .err_handler = tcp_v4_err, .no_policy = 1, .netns_ok = 1, }; static int __init inet_init(void) { . . . if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0) pr_crit("%s: Cannot add TCP protocol\n", __func__); . . . } struct proto tcp_prot = { .name = "TCP", .owner = THIS_MODULE, .close = tcp_close, //沒有bind, setsockopt嗎? 似乎跟在geek socket programing上練習的稍有不同? .connect = tcp_v4_connect, .disconnect = tcp_disconnect, .accept = inet_csk_accept, .ioctl = tcp_ioctl, .init = tcp_v4_init_sock, . . . }; static int __init inet_init(void) { int rc; . . . rc = proto_register(&tcp_prot, 1); . . . } ``` [geek socket programing practice](https://www.geeksforgeeks.org/socket-programming-cc/) 函式指標init 被設置為tcp_v4_init_sock() 用於執行各項初始化，如調用tcp_init_xmit_timers()來設置定時器、設置socket 狀態。 TCP用這四個定時器: * 重傳定時器: 重傳對時限內未得確認封包。若定時器過期後仍未收到ACK就取消該定時器 * 延遲確認定時器: 延遲發送確認封包 * 存活定時器: 檢查連接是否斷開 * 零窗口探測定時器(or called 持續定時器): buffer滿後，收方告知零窗口，送方得知收方buffer滿，用此定時器偵測收方窗口大小。 TCP socket初始化: socket() -> 內核調用 tcp_v4_init_sock() -> tcp_init_sock()完成實際工作(set state TCP_CLOSE; 調用tcp_init_xmit_timers()初始化TCP定時器; 初始化sk_sndbuf, sk_rcvbuf; 初始化無序隊列? 與預備隊列prequeue? ) 3 way handshanke: 客戶發SYN後其狀態設為TCP_SYN_SENT 偵聽的伺服socket創建 request socket to represent the new connection in the TCP_SYN_RECV state and sends back a SYN ACK. 客戶收SYN ACK後狀態變為TCP_ESTABLISH 並向伺服器發送ACK，伺服收到後 changes the request socket into a child socket in the TCP_ESTABLISHED state 接收來自L3 的TCP 數據包 *p298 not yet finished...* 發送TCP數據包 ### ch12 Linux無線子系統這本概述，要找詳細介紹的書籍來看，以及linux openwrt(公司進去 BSP做openwrt每天摸板子，可能了解下哪個內部用哪個晶片後自己買來練習玩玩看), linux device driver(這本書已有且是經典就要看!) 介紹block ack, power save, aggregate frame 等WiFi機制實現 ![WiFi-mac-hdr](https://hackmd.io/_uploads/HJukdlcRye.png) Duration: 包含單位為 `微秒`的 NAV(虛擬仔波偵測機制因為RTS/CTS先通知其他人別來亂溝通)，在省電模式下，這字段為客戶AID。地址1~4 就如清大電腦網路開放式課程講的，根據From DS/ To DS設置，基本準則是 1.送端 2.收端 3.目的端 4.來源端，而且會插入BSSID(WiFi AP 的MAC 位址來區分同個SSID(服務集)下的網路) 其中first 2 bytes的控制frame 包含TO/From DS, Type/STYPE, ps, retry等 ![frame-ctl](https://hackmd.io/_uploads/Sk_dFgqRkl.png) protocol: ulways 0 , since only 1 MAC version over wifi type: 00 control(IEEE80211_FTPYE_CTL), 01 management(IEEE80211_FTPYE_MGMT), 10 data (IEEE80211_FTPYE_DATA) subtype: 1011 (IEEE_STYPE_RTS), 0100 (IEEE80211_STYPE_NULLFUNC) retry, power save, 需要ACK 等機制其實也是因傳輸媒介差異(相比Eth) 而延伸出來的，常去思考電波不可靠, 移動性這兩大特點，就知道有那些東西是新的與要注意的。 power save機至於Linux中: 客戶進省電發null frame 告知AP ，當AP收到後就開始為該客戶存 frame於該客戶所屬的 ps_tx_buf 緩衝區，組播與廣播則存於 bc_buf緩衝區。而客戶會透過timer 喚醒檢查AP定期送的beacon frame(信標禎) 檢查TIM(traffic indication map) 是否有要AP幫存frame要接收，再去發送null frame去通知說改變省電狀態了可以收。 DTIM則是特殊TIM 代表間隔多少beacon interval 後會發送此DTIM，而後AP將發送緩衝的組播與廣播。 ![buf-ap](https://hackmd.io/_uploads/HkjwnlqRyg.png) ![pspoll-flow](https://hackmd.io/_uploads/HyXohlcR1e.png) #### 管理層: 802.11管理架構分此三組件: 物理層管理實體 PLME 系統管理實體 SME MAC層管理實體 MLME 掃描分被動或主動: 被動式客戶在channnel間跳，試圖收beacon frame來確認該訊道有AP存在主動式客戶在channel間跳，主動發probe-request frame (調用ieee80211_send_prob_req())的管理Frame 整個掃描是通過調用ieee80211_request_scan()完成，channel與freq存在對應關係要用ieee80211_channel_to_frequency()來找 (像是a是OFDM 於2.4G，n是OFDM with MIMO can 2.4G or 5G, ac是OFDM with MIMO (8天線) only on 5G 並且分主channel與次channel 要送40M 就須在主20與次20 都是channel 空閒下才可用 ) Beanforming 則是調整天線電波發射角度與順序來達到向特定方向有較強amplitude 可傳更遠(但反方向自然就能力弱)。身分驗證: 調用ieee80211_send_auth()方法做驗證分: 開放系統身分驗證(其實就只是關聯，甚至不算真正驗證身分，每個設備都要實現此方法) 與共享密鑰身分驗證(原來WEP 發現不夠 -> 改802.1X 配EAP 做身分驗證與交換密鑰而加密則靠TKIP為WEP加上安全帶像是RC加密弱點的 IV初始向量與主密鑰master key重複用改成混合key 與為每frame做新IV 詳細參考802.11 a definitive guild筆記 ) 關聯與重新關聯: 客戶調用 ieee80211_send_assoc() ，若是關聯成功 AP將向客戶發送AID mac80211 是一個用於與下層驅動程式交互的 API ，很複雜此處只討論一要點，有一重要結構是 ieee80211_hw 代表硬體訊息，內含一void \*ptr 指向一個私有區域(如marvell 無線驅動的 lbtf_private)，ieee80211_alloc_hw()為結構ieee80211_hw分配記憶體與初始化，下面是常見的與該結構相關方法: • int ieee80211_register_hw(struct ieee80211_hw *hw): Called by wireless drivers for registering the specified ieee80211_hw object. (註冊指定ieee80211_hw對象由驅動程式對其做調用 ) • void ieee80211_unregister_hw(struct ieee80211_hw *hw): Unregisters the specified 802.11 hardware device. • struct ieee80211_hw *ieee80211_alloc_hw(size_t priv_data_len, const struct ieee80211_ops *ops): Allocates an ieee80211_hw object and initializes it. (分配該對象且配置callback) • ieee80211_rx_irqsafe(): This method is for receiving a packet. It is implemented in net/mac80211/rx.c and called from low level wireless drivers. The ieee80211_ops object, which is passed to the ieee80211_alloc_hw() method as you saw earlier, consists of pointers to callbacks to the driver. Not all of these callbacks must be implemented by the drivers. The following is a short description of these methods: • tx(): The transmit handler called for each transmitted packet. It usually returns NETDEV_TX_OK (except for under certain limited conditions). (傳輸處理程序在傳輸個數據包都將調用他常返回NETDEV_TX_OK，疑問那收封包呢? 解碼很重要的呀在802.11 def guild那本也提到) • start(): Activates the hardware device and is called before the first hardware device is enabled. It turns on frame reception. • add_interface(): Called when a network device attached to the hardware is enabled. 與硬體相關聯的設備啟用 • config(): Handles configuration requests, such as hardware channel configuration. 處理配置請求，例如硬體的channel 配置 • configure_filter(): Configures the device’s Rx filter. Linux無線架構 ![Linux-wifi-arch](https://hackmd.io/_uploads/Hk3C-bqRkg.png) :::spoiler 關於ath層ath9k, ath5k ATH層是WIFI驅動的重要組成，負責硬體抽象層的處理，資料frame收送機制。[ath9k](https://blog.csdn.net/weixin_44258973/article/details/108003367) ![skb-rela](https://hackmd.io/_uploads/BkPP7f9Rkl.png) 幾個ath9k重要結構: sk_buff, ath_softc, ath_tx_control, ath_node, ath_buf, ath_txq... [Linux 802.11 驅動程式開發者指南](https://docs.kernel.org/driver-api/80211/index.html) [Linux 802.11 Data Flow](https://github.com/WeitaoZhu/wi-fi_books/blob/master/driver-mac80211_intro.pdf) [ath9k driver structure](https://blog.csdn.net/weixin_44258973/article/details/108003367) [ath9k 解析](https://blog.csdn.net/lizuobin2/article/details/53678299) 哇這些資源感覺很讚== data flow很清晰先把此書ch12讀完後讀data flow 再讀ath9k 而開發就研究開發者指南那篇 ::: 另一個重要結構sta_info 代表客戶端，包含多個counter 與緩衝單播數據包的ps_tx_buf等 #### 接收路徑與傳輸路徑概述 ieee80211_rx()是接收數據包的主要處理程序，驅動將數據包狀態ieee80211_rx_status 傳給mac80211 內嵌在SKB控制緩衝區cb中，使用IEEE80211_SKB_RXCB() macro獲取該狀態，指出是否通過FCS check等。 [關於MPDU 與MSDU](https://blog.csdn.net/kalexcyc/article/details/128107792) 重點是這篇提到進MAC曾處理前都是MSDU 處理完後變MPDU，這是讀802.11n survival guild未曾注意到的細節，要複習。 MSDU 各frame統一其DA, SA, FCS且只能具備一QoS屬性，也不能用於廣播或組播(所以有一MSDU壞整個AMSDU都要重傳但效率好(再有加密時效率差異更明顯) 相比AMPDU為每MPDU都上mac hdr, FCS)，MPDU (經過802.11協議封裝過後的數據frame) 使用HT(802.11n)時，將在必要時調用ieee80211_rx_reorder_ampdu()重排AMPDU，然後調用__ieee80211_rx_handle_packet()，此方法最終調用ieee80211_invoke_rx_handlers()。接下來使用 macro CALL_RXH 逐個調用各個接收處理程序(順序很重要，大家檢查要不要處理(eg PSPOLL接收方非AP自然要丟)，不處理返回RX_CONTINUE)。傳輸路徑: 使用ieee80211_tx() 負責傳輸數據包的主要處理，最終調用invoke_tx_handler() 使用macro CALL_TXH 逐個執行傳輸處理。 ```cpp static bool ieee80211_tx(struct ieee80211_sub_if_data *sdata, struct sk_buff *skb, bool txpending, enum ieee80211_band band) { struct ieee80211_local *local = sdata->local; struct ieee80211_tx_data tx; ieee80211_tx_result res_prepare; struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb); bool result = true; int led_len; //執行完整性檢查，若SKB長度小於10就丟 (但是回顧 SKB結構有sock,net_device, l4,l3,l2 hdr(sk_buff_data_t結構), protocol等等長度小於10 代表甚麼呢? 假設最小就是只有L2 與sock, net_device等可能有RFC要求小於10 代表連L2 frame都構不成? ) if (unlikely(skb->len < 10)) { dev_kfree_skb(skb); return true; } /* 初始化ieee80211_tx_data結構 tx 使用tx_prepare() */ led_len = skb->len; res_prepare = ieee80211_tx_prepare(sdata, &tx, skb); if (unlikely(res_prepare == TX_DROP)) { ieee80211_free_txskb(&local->hw, skb); return true; } else if (unlikely(res_prepare == TX_QUEUED)) { return true; } //調用傳輸處理程序，若過sanity check就調用__ieee80211_tx() . . . if (!invoke_tx_handlers(&tx)) result = __ieee80211_tx(local, &tx.skbs, led_len, tx.sta, txpending); return result; } ``` 802.11只對單播數據包分段，分段與聚合無法同時使用，由於現在無線網路速度很高(802.11n, 802.11ac)，因此很少用分段了。 debugfs 一種吃持將debugging information 導出到用戶空間的技術，在sysfs下創建條目，debugfs是一個專存儲dubugging information 的虛擬文件系統 (debugfs is a virtual filesystem devoted to debugging information) IEEE 802.11n: 支持legacy mode與 HT mode，可用2.4GHz, 5GHz頻段，使用MIMO(多天線收送與對應radio chain做交錯雜化等訊號處理) 與 beamforming，數據包改進最重要就 aggregiate 聚合分 A-MPDU, A-MSDU 並可用block ack 對整塊確認。 AMSDU: 聚合的MAC服務數據單位 AMPDU: 聚合的MAC協議數據單位要注意，只在接收路徑支持 AMSDU 在傳輸路徑上不支持(如前篇資源講的傳輸要在80211mac層做轉換變AMPDU)。 block ack session兩方: 發送與接收，每個session有不同TID流量標示符，送方調用ieee80211_start_tx_ba_session()來啟動block ack session，通常在驅動程式的速率控制算法中進行 (eg ath9k驅動程序中，ath_tx_status調用start_tx_ba_session())，送方將狀態設置為 HT_ADDBA_REQUESTED_MSK，並發送ADDBA請求數據包(包含會話參數會話TID等)，隨後收方將回ACK(wifi中每個數據包都要確認)，然後再傳ADDBA reply回應。 BAR 是 subtype為IEEE80211_STYPE_BACK_REQ的控制數據包包含SSN(起始序列號)，是block中需要確認的第一個MSDU序列號。 ![BAR-req](https://hackmd.io/_uploads/SyGRoUsRJe.png) ```cpp struct ieee80211_bar { __le16 frame_control; __le16 duration; __u8 ra[6]; __u8 ta[6]; __le16 control; __le16 start_seq_num; } __packed; ``` Block ack sender/receiver flow: ![block-ack-flow](https://hackmd.io/_uploads/r1AbnUoAJe.png) 而另一種延時回應block ack 則是用普通ack回應BAR 後等一段時間再發送BA。(立即回應由硬體處理延時回應由軟體處理) 802.11s 全互聯拓樸(full mesh) vs mesh拓樸(partial mesh): ![full-connect](https://hackmd.io/_uploads/S1z33LsRyl.png) ![paritial-mesh](https://hackmd.io/_uploads/rkw16LjC1l.png) 802.11s定義一種默認路由選擇協議 HWMP(hybrid wireless mesh protocol) 處理第二層frame 使用MAC地址而非IP HWMP routing is based on two types of routing (hence it is called hybrid). The first is on-demand routing, and the second is proactive routing. The main difference between the two mechanisms has to do with the time in which path establishment is initiated. :::spoiler 請解釋 wifi兩特點移動性與不可靠延伸出的要處理的技術，以及RX/TX data flow和block ACK 與聚合機制? ::: 總結: 介紹mac80211 stack, aggregate, block ack, RX/Tx in brief. 總算精讀完了，接下來做複習與anki字卡重點，而後看linux device driver LDD 重點要對BSP工作可能SPI, GPIO, memory 等SoC要了解，一定一堆知識但要重點發現跟工作相關性最大的作優先研讀與排程。