# eBPF: The Berkeley Packet Filter, take RAID5 for example
###### tags: `ebpf` `raid` `raid5`
## :memo: What's bpf?
The Berkeley Packet Filter is a technology used in certain computer operating systems for programs that need to, among other things, analyze network traffic. It provides a raw interface to data link layers, permitting raw link-layer packets to be sent and received. It is available on most Unix-like operating systems.
## :memo: What's bcc?
BPF Compiler Collection (BCC). BCC is a toolkit for creating efficient kernel tracing and manipulation programs

references:
* https://blog.csdn.net/luckyapple1028/article/details/52972315
* https://cloud.tencent.com/developer/article/1634120
### Example: Summarize block device I/O latency as a histogram
```python
from __future__ import print_function
from bcc import BPF
...
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>
typedef struct disk_key {
char disk[DISK_NAME_LEN];
u64 slot;
} disk_key_t;
typedef struct flag_key {
u64 flags;
u64 slot;
} flag_key_t;
BPF_HASH(start, struct request *);
STORAGE
// time block I/O
int trace_req_start(struct pt_regs *ctx, struct request *req)
{
u64 ts = bpf_ktime_get_ns();
start.update(&req, &ts);
return 0;
}
// output
int trace_req_done(struct pt_regs *ctx, struct request *req)
{
u64 *tsp, delta;
// fetch timestamp and calculate delta
tsp = start.lookup(&req);
if (tsp == 0) {
return 0; // missed issue
}
delta = bpf_ktime_get_ns() - *tsp;
FACTOR
// store as histogram
STORE
start.delete(&req);
return 0;
}
"""
...
# load BPF program
b = BPF(text=bpf_text)
if args.queued:
b.attach_kprobe(event="blk_account_io_start", fn_name="trace_req_start")
else:
if BPF.get_kprobe_functions(b'blk_start_request'):
b.attach_kprobe(event="blk_start_request", fn_name="trace_req_start")
b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_req_start")
b.attach_kprobe(event="blk_account_io_done",
fn_name="trace_req_done")
print("Tracing block device I/O... Hit Ctrl-C to end.")
```
### Example: hook kprobe to `md_make_request`
The example demonstrate how to trace bio (block device IO)
```python=
#!/usr/bin/env python
from bcc import BPF
# BPF Program
bpf_text = """
#include <bcc/proto.h>
#include <linux/blkdev.h>
#include "/bin/raid5.h"
BPF_HASH(handle_stripe_timer, sector_t);
BPF_HISTOGRAM(dist_stripe_active_time);
int kprobe__handle_stripe(struct pt_regs *ctx, struct stripe_head *sh)
{
sector_t sector = sh->sector;
u64 curr = bpf_ktime_get_ns();
u64 *plast = handle_stripe_timer.lookup(§or);
if (!plast) {
handle_stripe_timer.update(§or, &curr);
}
return 0;
}
int kprobe__do_release_stripe(struct pt_regs *ctx, struct r5conf *conf, struct stripe_head *sh)
{
sector_t sector = sh->sector;
unsigned long state = sh->state;
if (!test_bit(STRIPE_HANDLE, &state)) {
u64 *plast = handle_stripe_timer.lookup(§or);
if (plast) {
u64 curr = bpf_ktime_get_ns();
u64 delta = (curr - *plast) / 1000; //ms
dist_stripe_active_time.increment(bpf_log2l(delta));
handle_stripe_timer.delete(§or);
}
}
return 0;
}
"""
# Build and Inject BPF
b = BPF(text=bpf_text)
print("Tracing block device I/O... Hit Ctrl-C to end.")
while True:
try:
print(b.trace_readline())
except KeyboardInterrupt:
break
dist = b.get_table("dist_stripe_active_time")
print("\n dist_stripe_active_time :")
dist.print_log2_hist("ns", "")
```
## RAID
### :memo: What's RAID?
RAID ("Redundant Array of Inexpensive Disks" or "Redundant Array of Independent Disks") is a data storage virtualization technology that combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy, performance improvement, or both.
There are different type of RAID:
* RAID-N, where N is a number presenting different usage. e.g: RAID1, RAID0, RAID5 ... etc




* RAID-AB, where A, B are number, A stands for `buttom` and B for `top`



* Special, eg. JBOD, [SHR (synology hybrid RAID)](https://www.synology.com/en-global/knowledgebase/DSM/tutorial/Storage/What_is_Synology_Hybrid_RAID_SHR)


### :memo: RAID5
#### RAID5 algorithm
RAID5 use XOR to ensure parity.

Take the picture above for example. There are five disk, and the last one is responsible for parity. The red one is broken but it can be recovered with parity and the others block. **1(parity) 1(block1) ^ 0(block2) ^ 1(block4) = 1(block3)**
And the distribution of RAID5 could be different, for example
Left symmetric

Right symmetric

#### RAID5 structure

* sector: basic IO unit, 512 bytes
* stripe head: In the linux kernel this is the real unit (fine grain), 4 KB. That is said that in each member disk system takes 4 KB to conduct XOR
* chunk
* stripe:

#### RAID5 kernel workflow


* `struct stripe_head` in the basic unit of IO handle
* `raid5_make_request`
* `handle_active_stripes`: handle the stripes data, which calls `handle_stripe` which calls `raid_run_ops`, `ops_run_io`
* `raid_run_ops`: is for check parity block
* `ops_run_io`: submit request to disk
### :memo: RAID in Linux kernel
Here are some references:
* [What is RAID and why should you want it?](https://raid.wiki.kernel.org/index.php/What_is_RAID_and_why_should_you_want_it%3F): explains different types of RAID
* [RAID superblock formats](https://raid.wiki.kernel.org/index.php/RAID_superblock_formats)
* [drivers defined in linux source code](https://elixir.bootlin.com/linux/latest/source/drivers/md)
* [Linux內核中RAID5源碼詳解之寫過程剖析(二)](https://www.itdaan.com/tw/c86b2ff4c4636e34cada04ecfa267758)
* [原理到实现,RAID5原理详解及代码实现浅析](https://zhuanlan.zhihu.com/p/80361528)
So, according to the examples above.
Now, what if we know there is a call name `ops_run_io`, and the structure defined in linux source code is `struct sh`.
How can we start to conduct a related experient?
#### Example: `linux/drivers/md/raid5.h`
```cpp
struct stripe_head {
struct hlist_node hash;
struct list_head lru; /* inactive_list or handle_list */
struct llist_node release_list;
struct r5conf *raid_conf;
short generation; /* increments with every
* reshape */
sector_t sector; /* sector of this row */
short pd_idx; /* parity disk index */
short qd_idx; /* 'Q' disk index for raid6 */
short ddf_layout;/* use DDF ordering to calculate Q */
short hash_lock_index;
unsigned long state; /* state flags */
atomic_t count; /* nr of active thread/requests */
int bm_seq; /* sequence number for bitmap flushes */
int disks; /* disks in stripe */
int overwrite_disks; /* total overwrite disks in stripe,
* this is only checked when stripe
* has STRIPE_BATCH_READY
*/
enum check_states check_state;
enum reconstruct_states reconstruct_state;
spinlock_t stripe_lock;
int cpu;
struct r5worker_group *group;
struct stripe_head *batch_head; /* protected by stripe lock */
spinlock_t batch_lock; /* only header's lock is useful */
struct list_head batch_list; /* protected by head's batch lock*/
union {
struct r5l_io_unit *log_io;
struct ppl_io_unit *ppl_io;
};
struct list_head log_list;
sector_t log_start; /* first meta block on the journal */
struct list_head r5c; /* for r5c_cache->stripe_in_journal */
struct page *ppl_page; /* partial parity of this stripe */
/**
* struct stripe_operations
* @target - STRIPE_OP_COMPUTE_BLK target
* @target2 - 2nd compute target in the raid6 case
* @zero_sum_result - P and Q verification flags
* @request - async service request flags for raid_run_ops
*/
struct stripe_operations {
int target, target2;
enum sum_check_flags zero_sum_result;
} ops;
struct r5dev {
/* rreq and rvec are used for the replacement device when
* writing data to both devices.
*/
struct bio req, rreq;
struct bio_vec vec, rvec;
struct page *page, *orig_page;
struct bio *toread, *read, *towrite, *written;
sector_t sector; /* sector of this page */
unsigned long flags;
u32 log_checksum;
unsigned short write_hint;
} dev[1]; /* allocated with extra space depending of RAID geometry */
};
```
Q: How does raid request works in linux kernel?
A:

Q: Where to find codes related to `raid`?
A: `linux-4.4.x/drivers/md`, where you can see `raid5.h`
Q: How to start?
A: Think like this,
- [ ] hook a `kprobe` to `ops_run_io`
```c
static struct kprobe kp = {
.symbol_name = "ops_run_io",};
```
- [ ] traverse disks in a RAID
```c
// Get sh from register
struct stripe_head *sh = (struct stripe_head *) regs->di;
static void dump_dev_state(char *buf, unsigned int size, struct r5dev *dev){
static int r5dev_state_size = sizeof(r5dev_flags_strs) / sizeof(char*);
int i;
char *p;
p = &buf[0];
for (i = 0; i < r5dev_state_size; i++) {
if (test_bit(i, &dev->flags)) {
strcpy(p, r5dev_flags_strs[i]);
p += strlen(r5dev_flags_strs[i]);
*(p++) = ',';
*(p++) = ' ';
}
}
if (p != buf) {
p -= 2;
}
*p = '\0';}
static void dump_devs_state(char *buf, unsigned int size, struct stripe_head *sh){
int i;
char state[256];
char *p;
p = &buf[0];
for (i = 0; i < sh->disks; i++) {
dump_dev_state(state, sizeof(state), &sh->dev[i]);
*(p++) = 'd';
*(p++) = '0' + i / 10;
*(p++) = '0' + i % 10;
*(p++) = ':';
*(p++) = ' ';
strcpy(p, state);
p += strlen(state);
*(p++) = '\n';
}
*p = '\0';
}
```
Tip: user [kobject](https://www.kernel.org/doc/Documentation/kobject.txt).
```python=
#!/usr/bin/env python
from bcc import BPF
# BPF Program
bpf_text = """
#include <bcc/proto.h>
#include <linux/blkdev.h>
#include <linux/types.h>
#include "md.h"
int kprobe__raid5_make_request(struct pt_regs *ctx, struct mddev *mdv, struct bio *bio)
{
bpf_trace_printk("BIO start: %llu length: %lu(sectors)\\n",
bio->bi_iter.bi_sector, bio->bi_iter.bi_size >> 9);
bpf_trace_printk("Get the mddev structure of dev%d", mdv->md_minor);
/*
struct list_head *list;
list_for_each(list, &mdv->disks) {
struct mddev *md1 = list_entry(list,struct mddev, disks);
bpf_trace_printk("Get the mddev structure of dev%d", md1->md_minor);
}
*/
return 0;
};
"""
print("This is an example of kprobe hook of raid5_make_request().")
# Build and Inject BPF
b = BPF(text=bpf_text)
# Print debug output
while True:
print(b.trace_readline())
```
```bash=
length: 0(sectors)'
b' jbd2/dm-0-8-8197 [003] d... 87355.461591: : Get the mddev structure of dev2 kworker/2:0-9326 [002] d... 87355.484612: : BIO start: 1942298520 length: 8(sectors)'
b' kworker/2:0-9326 [002] d... 87355.484616: : Get the mddev structure of dev2 kworker/u8:0-4422 [003] d... 87389.553970: : BIO start: 3556769920 length: 8(sectors)'
b' kworker/u8:0-4422 [003] d... 87389.553973: : Get the mddev structure of dev2 kworker/u8:0-4422 [003] d... 87389.586068: : BIO start: 3556770048 length: 8(sectors)'
b' kworker/u8:0-4422 [003] d... 87389.586072: : Get the mddev structure of dev2 kworker/u8:0-4422 [003] d... 87389.586093: : BIO start: 3556835608 length: 8(sectors)'
b' kworker/u8:0-4422 [003] d... 87389.586095: : Get the mddev structure of dev2 kworker/u8:0-4422 [003] d... 87389.586115: : BIO start: 0 length: 8(sectors)'
b' kworker/u8:0-4422 [003] d... 87389.586117: : Get the mddev structure of dev2 kworker/u8:0-4422 [003] d... 87389.586127: : BIO start: 8 length: 8(sectors)'
b' kworker/u8:0-4422 [003] d... 87389.586129: : Get the mddev structure of dev2 kworker/u8:0-4422 [003] d... 87389.586138: : BIO start: 616 length: 8(sectors)'
b' kworker/u8:0-4422 [003] d... 87389.586140: : Get the mddev structure of dev2 kworker/u8:0-4422 [003] d... 87389.586149: : BIO start: 1704 length: 8(sectors)'
b' kworker/u8:0-4422 [003] d... 87389.586151: : Get the mddev structure of dev2 kworker/u8:0-4422 [003] d... 87389.586161: : BIO start: 10064 length: 8(sectors)'
```