# Linux 核心專題筆記
## eBPF TCP 伺服器
- [2023開發紀錄](https://hackmd.io/@sysprog/ryBw0adH2)
- [2024開發紀錄](https://hackmd.io/@sysprog/H1AORs8I0)
- [Linux eBPF](https://hackmd.io/@sysprog/linux-ebpf)
### eBPF

- 在使用者層級編譯好程式碼BPF bytecode
- 透過loader載入到核心層級進行驗證,執行
- 結束後兩種方式輸出1. 輸出到per-event data, eBPF map
#### perf-event data 又稱Perf Event Buffer
- map型態`BPF_MAP_TYPE_PERF_EVENT_ARRAY`
- 每個CPU維護一個自己的Buffer
- 使用的[bpf helper](https://man7.org/linux/man-pages/man7/bpf-helpers.7.html)
```CPP=
long bpf_perf_event_output(void *ctx, struct bpf_map *map, u64
flags, void *data, u64 size)
```
#### bpf CO-RE
主要目標是解決不同核心版本下使用編譯過的bpf程式
透過BPF Type Format (BTF)定位現在核心版本下資料結構的offset
可以想像成提供一個虛擬索引去轉換不同核心版本下的實際位置
##### 參考資料
- [core guide](https://nakryiko.com/posts/bpf-core-reference-guide/)
- [btf hub](https://github.com/aquasecurity/btfhub)
## eBPF 函式庫
### BPF 函數定義
BPF_CALL2_5
定義BPF可以呼叫的函式
[核心程式碼位置](https://github.com/torvalds/linux/blob/master/kernel/bpf/helpers.c)
例子:bpg_map_update_elem實作
```cpp=
BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
void *, value, u64, flags)
{
WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
!rcu_read_lock_bh_held());
return map->ops->map_update_elem(map, key, value, flags);
}
```
### BPF Maps
Map 類型: [參見](https://docs.ebpf.io/linux/map-type/)
透過bpf syscall 存取
```cpp=
int map_fd;
union bpf_attr attr = {
.map_type = BPF_MAP_TYPE_ARRAY;
.key_size = sizeof(__u32);
.value_size = sizeof(__u32);
.max_entries = 256;
};
map_fd = bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
```
靜態產生
```cpp=
#define BPF_MAP(_name, _type, _key_type, _value_type, _max_entries) \
struct { \
__uint(type, _type); \
__uint(key_size, sizeof(_key_type)); \
__uint(value_size, sizeof(_value_type)); \
__uint(max_entries, _max_entries); \
} _name SEC(".maps");
#define BPF_PERF_OUTPUT(_name) \
BPF_MAP(_name, BPF_MAP_TYPE_PERF_EVENT_ARRAY, int, int, 2048);
```
### Environment
#### kernel config
```bash!
cat /boot/config-6.11.0-17-generic | grep -i bpf
CONFIG_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_ARCH_WANT_DEFAULT_BPF_JIT=y
# BPF subsystem
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_JIT_DEFAULT_ON=y
CONFIG_BPF_UNPRIV_DEFAULT_OFF=y
# CONFIG_BPF_PRELOAD is not set
CONFIG_BPF_LSM=y
# end of BPF subsystem
CONFIG_CGROUP_BPF=y
CONFIG_IPV6_SEG6_BPF=y
CONFIG_NETFILTER_BPF_LINK=y
CONFIG_NETFILTER_XT_MATCH_BPF=m
CONFIG_NET_CLS_BPF=m
CONFIG_NET_ACT_BPF=m
CONFIG_BPF_STREAM_PARSER=y
CONFIG_LWTUNNEL_BPF=y
# HID-BPF support
CONFIG_HID_BPF=y
# end of HID-BPF support
CONFIG_BPF_EVENTS=y
CONFIG_BPF_KPROBE_OVERRIDE=y
CONFIG_TEST_BPF=m
```
#### 執行環境
```bash!
uname -a
Linux brianpan-Aspire-A14-52MT 6.11.0-17-generic #17~24.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jan 20 22:48:29 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
sudo lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 42 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
Model name: Intel(R) Core(TM) Ultra 5 226V
BIOS Model name: Intel(R) Core(TM) Ultra 5 226V To Be Filled By O.E.M.
CPU @ 0.4GHz
BIOS CPU family: 773
CPU family: 6
Model: 189
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 1
Stepping: 1
CPU(s) scaling MHz: 40%
CPU max MHz: 4500.0000
CPU min MHz: 400.0000
BogoMIPS: 6220.80
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m
ca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 s
s ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc
art arch_perfmon pebs bts rep_good nopl xtopology nons
top_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq
dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma c
x16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt t
sc_deadline_timer aes xsave avx f16c rdrand lahf_lm ab
m 3dnowprefetch cpuid_fault epb intel_ppin ssbd ibrs i
bpb stibp ibrs_enhanced tpr_shadow flexpriority ept vp
id ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms
invpcid rdt_a rdseed adx smap clflushopt clwb intel_p
t sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_det
ect user_shstk avx_vnni lam wbnoinvd dtherm ida arat p
ln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_r
eq hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqd
q rdpid bus_lock_detect movdiri movdir64b fsrm md_clea
r serialize pconfig arch_lbr ibt flush_l1d arch_capabi
lities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 320 KiB (8 instances)
L1i: 512 KiB (8 instances)
L2: 14 MiB (5 instances)
L3: 8 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prct
l
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointe
r sanitization
Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditiona
l; RSB filling; PBRSB-eIBRS Not affected; BHI Not affe
cted
Srbds: Not affected
Tsx async abort: Not affected
```
#### 套件管理
```bash!
sudo apt install -y linux-headers-$(uname -r) bpfcc-tools python3-bpfcc libbpfcc libbpfcc-dev clang
```
## 上學期專案
[Project Link](https://github.com/SuNsHiNe-75/ebpf-tcp-server)
### Ideas
- By sockops type program `BPF_PROG_TYPE_SOCK_OPS`, program will be called during lifetime of the socket: [REF](https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_SOCK_OPS/)
- By stream_verdict type program `BPF_SK_SKB_VERDICT`, our program skips the kernel network stack, [REF](https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_SK_SKB/)
### stream_verdict
解釋: https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_SK_SKB/?utm_source=chatgpt.com#as-bpf_sk_skb_stream_verdict-program
stream parser 流程過後會呼叫stream_verdict 來過濾封包
### 沒辦法從bpftrace找到sk_skb/stream_verdict
Can't find the entry point of `sk_skb/stream_verdict` from listing available ebpf call
```bash!
sudo bpftrace -l
```
[Linux source code Readme](https://github.com/torvalds/linux/blob/a5806cd506af5a7c19bcd596e4708b5c464bfd21/Documentation/bpf/map_sockmap.rst#L49)
### compilation error fix
```
/usr/include/linux/types.h:5:10: fatal error: 'asm/types.h' file not found
5 | #include <asm/types.h>
| ^~~~~~~~~~~~~
1 error generated.
```
```bash!
sudo apt-get install -y gcc-multilib
```
### 載入程式
```bash!
# load sockmap_ops to file system
sudo bpftool prog load bpf_sockops.o /sys/fs/bpf/bpf_sockops
sudo bpftool prog show pinned /sys/fs/bpf/bpf_sockops
127: sock_ops name bpf_sockmap tag d1bb5f447965262d gpl
loaded_at 2025-05-18T22:31:14-0500 uid 0
xlated 376B jited 216B memlock 4096B map_ids 5,7
btf_id 204
```
mount a bpf fs on bpffs folder
```bash!
mkdir bpffs
sudo mount -t bpf none bpffs
```
ping map to bpf filesystem
```bash!
sudo bpftool map show name sockmap_ops
5: sockhash name sockmap_ops flags 0x0
key 16B value 4B max_entries 65535 memlock 1048912B
sudo bpftool map pin name sockmap_ops bpffs/sockmap_ops
# dump the map by name
sudo bpftool map dump name sockmap_ops
```
attach bpf_socksops to cgroup
```bash!
sudo bpftool cgroup attach /sys/fs/cgroup/ sock_ops pinned /sys/fs/bpf/bpf_sockops
```
load bpf_redir.o program with pinned map named sockmap_ops
```bash!
sudo bpftool prog load bpf_redir.o /sys/fs/bpf/bpf_redir map name sockmap_ops pinned bpffs/sockmap_ops
```
check if map and bpf program are loaded
```bash!
sudo bpftool prog list
127: sock_ops name bpf_sockmap tag d1bb5f447965262d gpl
loaded_at 2025-05-18T22:31:14-0500 uid 0
xlated 376B jited 216B memlock 4096B map_ids 5,7
btf_id 204
136: sk_skb name bpf_redir tag 8aae03b571c7bc42 gpl
loaded_at 2025-05-18T23:20:44-0500 uid 0
xlated 536B jited 307B memlock 4096B map_ids 5,11
btf_id 215
```
attach the program to stream_verdict
```bash!
sudo bpftool prog attach pinned /sys/fs/bpf/bpf_redir stream_verdict pinned bpffs/sockmap_ops
```
run the program
```bash!
./ebpf-echo-server
ss -l | grep 12345
tcp LISTEN 0 1024 0.0.0.0:12345 0.0.0.0:*
```
test echo
```bash!
telnet 192.168.1.139 12345
Trying 192.168.1.139...
Connected to 192.168.1.139.
Escape character is '^]'.
xxx
xxx
```
#### output of bpf_sockops.o
##### List program sections
```bash!
readelf -S bpf_sockops.o
There are 28 section headers, starting at offset 0x2b68:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .strtab STRTAB 0000000000000000 00002a05
000000000000015c 0000000000000000 0 0 1
[ 2] .text PROGBITS 0000000000000000 00000040
0000000000000000 0000000000000000 AX 0 0 4
[ 3] sockops PROGBITS 0000000000000000 00000040
0000000000000148 0000000000000000 AX 0 0 8
[ 4] .relsockops REL 0000000000000000 00002010
0000000000000030 0000000000000010 I 27 3 8
[ 5] license PROGBITS 0000000000000000 00000188
0000000000000004 0000000000000000 A 0 0 1
[ 6] .maps PROGBITS 0000000000000000 00000190
0000000000000028 0000000000000000 WA 0 0 8
[ 7] .rodata PROGBITS 0000000000000000 000001b8
000000000000002c 0000000000000000 A 0 0 1
[ 8] .debug_loclists PROGBITS 0000000000000000 000001e4
000000000000008d 0000000000000000 0 0 1
[ 9] .debug_abbrev PROGBITS 0000000000000000 00000271
00000000000001c6 0000000000000000 0 0 1
[10] .debug_info PROGBITS 0000000000000000 00000437
00000000000004fa 0000000000000000 0 0 1
[11] .rel.debug_info REL 0000000000000000 00002040
0000000000000050 0000000000000010 I 27 10 8
[12] .debug_str_o[...] PROGBITS 0000000000000000 00000931
00000000000001b4 0000000000000000 0 0 1
[13] .rel.debug_s[...] REL 0000000000000000 00002090
00000000000006b0 0000000000000010 I 27 12 8
[14] .debug_str PROGBITS 0000000000000000 00000ae5
0000000000000574 0000000000000001 MS 0 0 1
[15] .debug_addr PROGBITS 0000000000000000 00001059
0000000000000038 0000000000000000 0 0 1
[16] .rel.debug_addr REL 0000000000000000 00002740
0000000000000060 0000000000000010 I 27 15 8
[17] .BTF PROGBITS 0000000000000000 00001094
0000000000000a53 0000000000000000 0 0 4
[18] .rel.BTF REL 0000000000000000 000027a0
0000000000000040 0000000000000010 I 27 17 8
[19] .BTF.ext PROGBITS 0000000000000000 00001ae8
0000000000000170 0000000000000000 0 0 4
[20] .rel.BTF.ext REL 0000000000000000 000027e0
0000000000000140 0000000000000010 I 27 19 8
[21] .debug_frame PROGBITS 0000000000000000 00001c58
0000000000000028 0000000000000000 0 0 8
[22] .rel.debug_frame REL 0000000000000000 00002920
0000000000000020 0000000000000010 I 27 21 8
[23] .debug_line PROGBITS 0000000000000000 00001c80
000000000000011a 0000000000000000 0 0 1
[24] .rel.debug_line REL 0000000000000000 00002940
00000000000000c0 0000000000000010 I 27 23 8
[25] .debug_line_str PROGBITS 0000000000000000 00001d9a
00000000000000ac 0000000000000001 MS 0 0 1
[26] .llvm_addrsig LOOS+0xfff4c03 0000000000000000 00002a00
0000000000000005 0000000000000000 E 27 0 1
[27] .symtab SYMTAB 0000000000000000 00001e48
00000000000001c8 0000000000000018 1 16 8
```
#### list symbol table & hex of .maps section
```bash!
readelf -s bpf_sockops.o
Symbol table '.symtab' contains 19 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS bpf_sockops.c
2: 0000000000000000 0 SECTION LOCAL DEFAULT 3 sockops
3: 0000000000000138 0 NOTYPE LOCAL DEFAULT 3 LBB0_5
4: 0000000000000118 0 NOTYPE LOCAL DEFAULT 3 LBB0_4
5: 0000000000000000 23 OBJECT LOCAL DEFAULT 7 update_sockmap_o[...]
6: 0000000000000017 21 OBJECT LOCAL DEFAULT 7 update_sockmap_o[...]
7: 0000000000000000 0 SECTION LOCAL DEFAULT 7 .rodata
8: 0000000000000000 0 SECTION LOCAL DEFAULT 8 .debug_loclists
9: 0000000000000000 0 SECTION LOCAL DEFAULT 9 .debug_abbrev
10: 0000000000000000 0 SECTION LOCAL DEFAULT 12 .debug_str_offsets
11: 0000000000000000 0 SECTION LOCAL DEFAULT 14 .debug_str
12: 0000000000000000 0 SECTION LOCAL DEFAULT 15 .debug_addr
13: 0000000000000000 0 SECTION LOCAL DEFAULT 21 .debug_frame
14: 0000000000000000 0 SECTION LOCAL DEFAULT 23 .debug_line
15: 0000000000000000 0 SECTION LOCAL DEFAULT 25 .debug_line_str
16: 0000000000000000 328 FUNC GLOBAL DEFAULT 3 bpf_sockmap
17: 0000000000000000 40 OBJECT GLOBAL DEFAULT 6 sockmap_ops
18: 0000000000000000 4 OBJECT GLOBAL DEFAULT 5 __license
readelf -x .maps bpf_sockops.o
Hex dump of section '.maps':
0x00000000 00000000 00000000 00000000 00000000 ................
0x00000010 00000000 00000000 00000000 00000000 ................
0x00000020 00000000 00000000 ........
```
### bpf_redir.o
can't objdump bpf_redir.o
```bash!
objdump -d bpf_redir.o
bpf_redir.o: file format elf64-little
objdump: can't disassemble for architecture UNKNOWN!
```
use llvm-objdump to retrieve the content
```bash!
llvm-objdump-18 -d --section=sk_skb/stream_verdict bpf_redir.o
bpf_redir.o: file format elf64-bpf
Disassembly of section sk_skb/stream_verdict:
0000000000000000 <bpf_redir>:
0: b7 00 00 00 01 00 00 00 r0 = 0x1
1: 61 12 58 00 00 00 00 00 r2 = *(u32 *)(r1 + 0x58)
2: 55 02 37 00 02 00 00 00 if r2 != 0x2 goto +0x37 <LBB0_9>
3: 61 12 88 00 00 00 00 00 r2 = *(u32 *)(r1 + 0x88)
4: 55 02 35 00 39 30 00 00 if r2 != 0x3039 goto +0x35 <LBB0_9>
5: 61 12 00 00 00 00 00 00 r2 = *(u32 *)(r1 + 0x0)
6: 15 02 33 00 00 00 00 00 if r2 == 0x0 goto +0x33 <LBB0_9>
7: 61 13 60 00 00 00 00 00 r3 = *(u32 *)(r1 + 0x60)
8: 61 12 5c 00 00 00 00 00 r2 = *(u32 *)(r1 + 0x5c)
9: 5d 32 0f 00 00 00 00 00 if r2 != r3 goto +0xf <LBB0_5>
10: b7 03 00 00 39 30 00 00 r3 = 0x3039
11: 6b 3a fc ff 00 00 00 00 *(u16 *)(r10 - 0x4) = r3
12: 63 2a f8 ff 00 00 00 00 *(u32 *)(r10 - 0x8) = r2
13: 63 2a f4 ff 00 00 00 00 *(u32 *)(r10 - 0xc) = r2
14: b7 02 00 00 02 00 00 00 r2 = 0x2
15: 63 2a f0 ff 00 00 00 00 *(u32 *)(r10 - 0x10) = r2
16: 61 12 84 00 00 00 00 00 r2 = *(u32 *)(r1 + 0x84)
17: dc 02 00 00 20 00 00 00 r2 = be32 r2
18: 6b 2a fe ff 00 00 00 00 *(u16 *)(r10 - 0x2) = r2
19: bf a3 00 00 00 00 00 00 r3 = r10
20: 07 03 00 00 f0 ff ff ff r3 += -0x10
21: 18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
23: b7 04 00 00 01 00 00 00 r4 = 0x1
24: 05 00 0e 00 00 00 00 00 goto +0xe <LBB0_6>
00000000000000c8 <LBB0_5>:
25: 63 3a f8 ff 00 00 00 00 *(u32 *)(r10 - 0x8) = r3
26: 63 2a f4 ff 00 00 00 00 *(u32 *)(r10 - 0xc) = r2
27: b7 02 00 00 02 00 00 00 r2 = 0x2
28: 63 2a f0 ff 00 00 00 00 *(u32 *)(r10 - 0x10) = r2
29: 61 12 84 00 00 00 00 00 r2 = *(u32 *)(r1 + 0x84)
30: b7 03 00 00 39 30 00 00 r3 = 0x3039
31: 6b 3a fe ff 00 00 00 00 *(u16 *)(r10 - 0x2) = r3
32: dc 02 00 00 20 00 00 00 r2 = be32 r2
33: 6b 2a fc ff 00 00 00 00 *(u16 *)(r10 - 0x4) = r2
34: bf a3 00 00 00 00 00 00 r3 = r10
35: 07 03 00 00 f0 ff ff ff r3 += -0x10
36: 18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0x0 ll
38: b7 04 00 00 00 00 00 00 r4 = 0x0
0000000000000138 <LBB0_6>:
39: 85 00 00 00 48 00 00 00 call 0x48
40: bf 01 00 00 00 00 00 00 r1 = r0
41: 67 01 00 00 20 00 00 00 r1 <<= 0x20
42: 77 01 00 00 20 00 00 00 r1 >>= 0x20
43: 15 01 09 00 01 00 00 00 if r1 == 0x1 goto +0x9 <LBB0_8>
44: bf 03 00 00 00 00 00 00 r3 = r0
45: 87 03 00 00 00 00 00 00 r3 = -r3
46: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
48: b7 02 00 00 2a 00 00 00 r2 = 0x2a
49: bf 06 00 00 00 00 00 00 r6 = r0
50: 85 00 00 00 06 00 00 00 call 0x6
51: bf 60 00 00 00 00 00 00 r0 = r6
52: 05 00 05 00 00 00 00 00 goto +0x5 <LBB0_9>
00000000000001a8 <LBB0_8>:
53: 18 01 00 00 2a 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x2a ll
55: b7 02 00 00 2c 00 00 00 r2 = 0x2c
56: 85 00 00 00 06 00 00 00 call 0x6
57: b7 00 00 00 01 00 00 00 r0 = 0x1
00000000000001d0 <LBB0_9>:
58: 95 00 00 00 00 00 00 00 exit
```
### eBPF 暫存器使用
- r0: 回傳值
- r1-r5: hold arguments from ebpf programs
assembly
```cpp!
1: 61 12 58 00 00 00 00 00 r2 = *(u32 *)(r1 + 0x58)
```
c code
```cpp!
skb->family
```
- r6-r10: callee saved registers that will be preserved on helper function call
`Callee-saved registers (AKA non-volatile registers, or call-preserved) are used to hold long-lived values that should be preserved across calls.`
- r10: Read-only with frame pointer address
## Experiments
### Check bpf_printk messages
```bash!
sudo cat /sys/kernel/debug/tracing/trace_pipe
```
### Running bench
Timeout from modified bench
```bash!
brianpan@brianpan-Aspire-A14-52MT:~/kernel/ebpf-tcp-server$ ./bench
Generating String...
Connecting...
Getting the socket name...
Send & Recv...
Finish Sending.
recv timeout occurred
```
Cat debugging pipe
```bash!
cat /sys/kernel/debug/tracing/trace_pipe
irq/181-iwlwifi-564 [004] ..s31 11110.660072: bpf_trace_printk: Update map success.
bench-16639 [000] ...11 11158.604765: bpf_trace_printk: Update map success.
bench-16639 [000] ..s31 11158.604794: bpf_trace_printk: Update map success.
bench-16639 [000] ..s31 11159.621702: bpf_trace_printk: bpf_sk_redirect_hash() failed 0, error
```
#### Debugging
Try finding the key by bpf_map_lookup_elem, but BPF verifier disables the action to conversion address to uintptr_t
```cpp!
uintptr_t r = (uintptr_t) bpf_map_lookup_elem(&sockmap_ops, &skm_key);
if (!r) {
bpf_printk("key found");
}
```
#### Failure tcpdump
```bash!
22:51:03.777244 IP brianpan-Aspire-A14-52MT.lan.39950 > brianpan-Aspire-A14-52MT.lan.12345: Flags [S], seq 3130719987, win 65495, options [mss 65495,sackOK,TS val 3153089712 ecr 0,nop,wscale 7], length 0
22:51:03.777275 IP brianpan-Aspire-A14-52MT.lan.12345 > brianpan-Aspire-A14-52MT.lan.39950: Flags [S.], seq 4067401096, ack 3130719988, win 65483, options [mss 65495,sackOK,TS val 3153089712 ecr 3153089712,nop,wscale 7], length 0
22:51:03.777314 IP brianpan-Aspire-A14-52MT.lan.39950 > brianpan-Aspire-A14-52MT.lan.12345: Flags [.], ack 1, win 512, options [nop,nop,TS val 3153089712 ecr 3153089712], length 0
22:51:03.777380 IP brianpan-Aspire-A14-52MT.lan.39950 > brianpan-Aspire-A14-52MT.lan.12345: Flags [P.], seq 1:50, ack 1, win 512, options [nop,nop,TS val 3153089712 ecr 3153089712], length 49
22:51:03.777387 IP brianpan-Aspire-A14-52MT.lan.12345 > brianpan-Aspire-A14-52MT.lan.39950: Flags [.], ack 50, win 512, options [nop,nop,TS val 3153089712 ecr 3153089712], length 0
22:51:04.824177 IP brianpan-Aspire-A14-52MT.lan.39950 > brianpan-Aspire-A14-52MT.lan.12345: Flags [F.], seq 50, ack 1, win 512, options [nop,nop,TS val 3153090759 ecr 3153089712], length 0
22:51:04.864823 IP brianpan-Aspire-A14-52MT.lan.12345 > brianpan-Aspire-A14-52MT.lan.39950: Flags [.], ack 51, win 512, options [nop,nop,TS val 3153090800 ecr 3153090759], length 0
```
Success call
```bash!
22:54:11.986839 IP brianpan-Aspire-A14-52MT.lan.39392 > brianpan-Aspire-A14-52MT.lan.12345: Flags [S], seq 4060196293, win 65495, options [mss 65495,sackOK,TS val 3153277922 ecr 0,nop,wscale 7], length 0
22:54:11.986882 IP brianpan-Aspire-A14-52MT.lan.12345 > brianpan-Aspire-A14-52MT.lan.39392: Flags [S.], seq 1682861897, ack 4060196294, win 65483, options [mss 65495,sackOK,TS val 3153277922 ecr 3153277922,nop,wscale 7], length 0
22:54:11.986949 IP brianpan-Aspire-A14-52MT.lan.39392 > brianpan-Aspire-A14-52MT.lan.12345: Flags [.], ack 1, win 512, options [nop,nop,TS val 3153277922 ecr 3153277922], length 0
22:54:11.987079 IP brianpan-Aspire-A14-52MT.lan.39392 > brianpan-Aspire-A14-52MT.lan.12345: Flags [P.], seq 1:50, ack 1, win 512, options [nop,nop,TS val 3153277922 ecr 3153277922], length 49
22:54:11.987115 IP brianpan-Aspire-A14-52MT.lan.12345 > brianpan-Aspire-A14-52MT.lan.39392: Flags [.], ack 50, win 512, options [nop,nop,TS val 3153277922 ecr 3153277922], length 0
22:54:11.987190 IP brianpan-Aspire-A14-52MT.lan.39392 > brianpan-Aspire-A14-52MT.lan.12345: Flags [F.], seq 50, ack 1, win 512, options [nop,nop,TS val 3153277922 ecr 3153277922], length 0
22:54:12.027846 IP brianpan-Aspire-A14-52MT.lan.12345 > brianpan-Aspire-A14-52MT.lan.39392: Flags [.], ack 51, win 512, options [nop,nop,TS val 3153277963 ecr 3153277922], length 0
```
#### Trace source
```cpp!
static struct sock *__sock_hash_lookup_elem(struct bpf_map *map, void *key)
{
struct bpf_shtab *htab = container_of(map, struct bpf_shtab, map);
u32 key_size = map->key_size, hash;
struct bpf_shtab_bucket *bucket;
struct bpf_shtab_elem *elem;
WARN_ON_ONCE(!rcu_read_lock_held());
hash = sock_hash_bucket_hash(key, key_size);
bucket = sock_hash_select_bucket(htab, hash);
elem = sock_hash_lookup_elem_raw(&bucket->head, hash, key, key_size);
return elem ? elem->sk : NULL;
}
BPF_CALL_4(bpf_sock_hash_update, struct bpf_sock_ops_kern *, sops,
struct bpf_map *, map, void *, key, u64, flags)
{
WARN_ON_ONCE(!rcu_read_lock_held());
if (likely(sock_map_sk_is_suitable(sops->sk) &&
sock_map_op_okay(sops)))
return sock_hash_update_common(map, key, sops->sk, flags);
return -EOPNOTSUPP;
}
```
[REF](https://github.com/torvalds/linux/blob/cd2e103d57e5615f9bb027d772f93b9efd567224/net/core/sock_map.c#L991)
```cpp!
static int sock_hash_update_common(struct bpf_map *map, void *key,
struct sock *sk, u64 flags)
{
struct bpf_shtab *htab = container_of(map, struct bpf_shtab, map);
u32 key_size = map->key_size, hash;
struct bpf_shtab_elem *elem, *elem_new;
struct bpf_shtab_bucket *bucket;
struct sk_psock_link *link;
struct sk_psock *psock;
int ret;
WARN_ON_ONCE(!rcu_read_lock_held());
if (unlikely(flags > BPF_EXIST))
return -EINVAL;
link = sk_psock_init_link();
if (!link)
return -ENOMEM;
ret = sock_map_link(map, sk);
if (ret < 0)
goto out_free;
psock = sk_psock(sk);
WARN_ON_ONCE(!psock);
hash = sock_hash_bucket_hash(key, key_size);
bucket = sock_hash_select_bucket(htab, hash);
spin_lock_bh(&bucket->lock);
elem = sock_hash_lookup_elem_raw(&bucket->head, hash, key, key_size);
if (elem && flags == BPF_NOEXIST) {
ret = -EEXIST;
goto out_unlock;
} else if (!elem && flags == BPF_EXIST) {
ret = -ENOENT;
goto out_unlock;
}
elem_new = sock_hash_alloc_elem(htab, key, key_size, hash, sk, elem);
if (IS_ERR(elem_new)) {
ret = PTR_ERR(elem_new);
goto out_unlock;
}
sock_map_add_link(psock, link, map, elem_new);
/* Add new element to the head of the list, so that
* concurrent search will find it before old elem.
*/
hlist_add_head_rcu(&elem_new->node, &bucket->head);
if (elem) {
hlist_del_rcu(&elem->node);
sock_map_unref(elem->sk, elem);
sock_hash_free_elem(htab, elem);
}
spin_unlock_bh(&bucket->lock);
return 0;
out_unlock:
spin_unlock_bh(&bucket->lock);
sk_psock_put(sk, psock);
out_free:
sk_psock_free_link(link);
return ret;
}
```
Guess if `sock_hash_lookup_elem` and `bpf_sock_hash_update` are running on different CPU, RCU took old data and failed to find the element
#### usleep() between connect(), send() call in bench.c
Tested 100 times and no more error
```cpp!
// add usleep to sleep 0.1s
usleep(100000);
printf("Send & Recv...\n");
```
### Benchmark
#### Setup
```bash!
sudo sysctl net.core.somaxconn=4096
sudo sysctl net.ipv4.tcp_max_syn_backlog=4096
ulimit -n 32768
```
#### Running benchmark between ebpf, kecho, user
https://github.com/Brianpan/ebpf-tcp-server/blob/main/benchmark/echo_bench.png
kecho performs better than ebpf echo
Guess if it is related to threads are contending ebpf map