Try โ€‚โ€‰HackMD

eBPF: The Berkeley Packet Filter, take RAID5 for example

tags: ebpf raid raid5

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
What's bpf?

The Berkeley Packet Filter is a technology used in certain computer operating systems for programs that need to, among other things, analyze network traffic. It provides a raw interface to data link layers, permitting raw link-layer packets to be sent and received. It is available on most Unix-like operating systems.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
What's bcc?

BPF Compiler Collection (BCC). BCC is a toolkit for creating efficient kernel tracing and manipulation programs

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

references:

Example: Summarize block device I/O latency as a histogram

from __future__ import print_function
from bcc import BPF

...
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>
typedef struct disk_key {
    char disk[DISK_NAME_LEN];
    u64 slot;
} disk_key_t;
typedef struct flag_key {
    u64 flags;
    u64 slot;
} flag_key_t;
BPF_HASH(start, struct request *);
STORAGE
// time block I/O
int trace_req_start(struct pt_regs *ctx, struct request *req)
{
    u64 ts = bpf_ktime_get_ns();
    start.update(&req, &ts);
    return 0;
}
// output
int trace_req_done(struct pt_regs *ctx, struct request *req)
{
    u64 *tsp, delta;
    // fetch timestamp and calculate delta
    tsp = start.lookup(&req);
    if (tsp == 0) {
        return 0;   // missed issue
    }
    delta = bpf_ktime_get_ns() - *tsp;
    FACTOR
    // store as histogram
    STORE
    start.delete(&req);
    return 0;
}
"""
...
# load BPF program
b = BPF(text=bpf_text)
if args.queued:
    b.attach_kprobe(event="blk_account_io_start", fn_name="trace_req_start")
else:
    if BPF.get_kprobe_functions(b'blk_start_request'):
        b.attach_kprobe(event="blk_start_request", fn_name="trace_req_start")
    b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_req_start")
b.attach_kprobe(event="blk_account_io_done",
    fn_name="trace_req_done")

print("Tracing block device I/O... Hit Ctrl-C to end.")

Example: hook kprobe to md_make_request

The example demonstrate how to trace bio (block device IO)

#!/usr/bin/env python from bcc import BPF # BPF Program bpf_text = """ #include <bcc/proto.h> #include <linux/blkdev.h> #include "/bin/raid5.h" BPF_HASH(handle_stripe_timer, sector_t); BPF_HISTOGRAM(dist_stripe_active_time); int kprobe__handle_stripe(struct pt_regs *ctx, struct stripe_head *sh) { sector_t sector = sh->sector; u64 curr = bpf_ktime_get_ns(); u64 *plast = handle_stripe_timer.lookup(&sector); if (!plast) { handle_stripe_timer.update(&sector, &curr); } return 0; } int kprobe__do_release_stripe(struct pt_regs *ctx, struct r5conf *conf, struct stripe_head *sh) { sector_t sector = sh->sector; unsigned long state = sh->state; if (!test_bit(STRIPE_HANDLE, &state)) { u64 *plast = handle_stripe_timer.lookup(&sector); if (plast) { u64 curr = bpf_ktime_get_ns(); u64 delta = (curr - *plast) / 1000; //ms dist_stripe_active_time.increment(bpf_log2l(delta)); handle_stripe_timer.delete(&sector); } } return 0; } """ # Build and Inject BPF b = BPF(text=bpf_text) print("Tracing block device I/O... Hit Ctrl-C to end.") while True: try: print(b.trace_readline()) except KeyboardInterrupt: break dist = b.get_table("dist_stripe_active_time") print("\n dist_stripe_active_time :") dist.print_log2_hist("ns", "")

RAID

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
What's RAID?

RAID ("Redundant Array of Inexpensive Disks" or "Redundant Array of Independent Disks") is a data storage virtualization technology that combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy, performance improvement, or both.

There are different type of RAID:

  • RAID-N, where N is a number presenting different usage. e.g: RAID1, RAID0, RAID5 โ€ฆ etc

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More โ†’

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More โ†’

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More โ†’

  • RAID-AB, where A, B are number, A stands for buttom and B for top



  • Special, eg. JBOD, SHR (synology hybrid RAID)


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
RAID5

RAID5 algorithm

RAID5 use XOR to ensure parity.

Take the picture above for example. There are five disk, and the last one is responsible for parity. The red one is broken but it can be recovered with parity and the others block. 1(parity) 1(block1) ^ 0(block2) ^ 1(block4) = 1(block3)

And the distribution of RAID5 could be different, for example
Left symmetric


Right symmetric

RAID5 structure

  • sector: basic IO unit, 512 bytes
  • stripe head: In the linux kernel this is the real unit (fine grain), 4 KB. That is said that in each member disk system takes 4 KB to conduct XOR
  • chunk
  • stripe:

RAID5 kernel workflow


  • struct stripe_head in the basic unit of IO handle
  • raid5_make_request
  • handle_active_stripes: handle the stripes data, which calls handle_stripe which calls raid_run_ops, ops_run_io
  • raid_run_ops: is for check parity block
  • ops_run_io: submit request to disk

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
RAID in Linux kernel

Here are some references:

So, according to the examples above.
Now, what if we know there is a call name ops_run_io, and the structure defined in linux source code is struct sh.
How can we start to conduct a related experient?

Example: linux/drivers/md/raid5.h

struct stripe_head {
	struct hlist_node	hash;
	struct list_head	lru;	      /* inactive_list or handle_list */
	struct llist_node	release_list;
	struct r5conf		*raid_conf;
	short			generation;	/* increments with every
						 * reshape */
	sector_t		sector;		/* sector of this row */
	short			pd_idx;		/* parity disk index */
	short			qd_idx;		/* 'Q' disk index for raid6 */
	short			ddf_layout;/* use DDF ordering to calculate Q */
	short			hash_lock_index;
	unsigned long		state;		/* state flags */
	atomic_t		count;	      /* nr of active thread/requests */
	int			bm_seq;	/* sequence number for bitmap flushes */
	int			disks;		/* disks in stripe */
	int			overwrite_disks; /* total overwrite disks in stripe,
						  * this is only checked when stripe
						  * has STRIPE_BATCH_READY
						  */
	enum check_states	check_state;
	enum reconstruct_states reconstruct_state;
	spinlock_t		stripe_lock;
	int			cpu;
	struct r5worker_group	*group;

	struct stripe_head	*batch_head; /* protected by stripe lock */
	spinlock_t		batch_lock; /* only header's lock is useful */
	struct list_head	batch_list; /* protected by head's batch lock*/

	union {
		struct r5l_io_unit	*log_io;
		struct ppl_io_unit	*ppl_io;
	};

	struct list_head	log_list;
	sector_t		log_start; /* first meta block on the journal */
	struct list_head	r5c; /* for r5c_cache->stripe_in_journal */

	struct page		*ppl_page; /* partial parity of this stripe */
	/**
	 * struct stripe_operations
	 * @target - STRIPE_OP_COMPUTE_BLK target
	 * @target2 - 2nd compute target in the raid6 case
	 * @zero_sum_result - P and Q verification flags
	 * @request - async service request flags for raid_run_ops
	 */
	struct stripe_operations {
		int 		     target, target2;
		enum sum_check_flags zero_sum_result;
	} ops;
	struct r5dev {
		/* rreq and rvec are used for the replacement device when
		 * writing data to both devices.
		 */
		struct bio	req, rreq;
		struct bio_vec	vec, rvec;
		struct page	*page, *orig_page;
		struct bio	*toread, *read, *towrite, *written;
		sector_t	sector;			/* sector of this page */
		unsigned long	flags;
		u32		log_checksum;
		unsigned short	write_hint;
	} dev[1]; /* allocated with extra space depending of RAID geometry */
};

Q: How does raid request works in linux kernel?
A:

Q: Where to find codes related to raid?
A: linux-4.4.x/drivers/md, where you can see raid5.h

Q: How to start?
A: Think like this,

  • hook a kprobe to ops_run_io
    โ€‹โ€‹โ€‹โ€‹static struct kprobe kp = {
    โ€‹.symbol_name	= "ops_run_io",};
    
  • traverse disks in a RAID
    โ€‹โ€‹โ€‹โ€‹// Get sh from register
    โ€‹โ€‹โ€‹โ€‹struct stripe_head *sh = (struct stripe_head *) regs->di;
    โ€‹โ€‹โ€‹โ€‹
    โ€‹โ€‹โ€‹โ€‹static void dump_dev_state(char *buf, unsigned int size, struct r5dev *dev){
    โ€‹static int r5dev_state_size = sizeof(r5dev_flags_strs) / sizeof(char*);
    โ€‹int i;
    โ€‹char *p;
    
    โ€‹p = &buf[0];
    โ€‹for (i = 0; i < r5dev_state_size; i++) {
    โ€‹	if (test_bit(i, &dev->flags)) {
    โ€‹		strcpy(p, r5dev_flags_strs[i]);
    โ€‹		p += strlen(r5dev_flags_strs[i]);
    โ€‹		*(p++) = ',';
    โ€‹		*(p++) = ' ';
    โ€‹	}
    โ€‹}
    โ€‹if (p != buf) {
    โ€‹	p -= 2;
    โ€‹}
    โ€‹*p = '\0';}
    
    โ€‹โ€‹โ€‹โ€‹static void dump_devs_state(char *buf, unsigned int size, struct stripe_head *sh){
    โ€‹int i;
    โ€‹char state[256];
    โ€‹char *p;
    
    โ€‹p = &buf[0];
    โ€‹for (i = 0; i < sh->disks; i++) {
    โ€‹	dump_dev_state(state, sizeof(state), &sh->dev[i]);
    โ€‹	*(p++) = 'd';
    โ€‹	*(p++) = '0' + i / 10;
    โ€‹	*(p++) = '0' + i % 10;
    โ€‹	*(p++) = ':';
    โ€‹	*(p++) = ' ';
    โ€‹	strcpy(p, state);
    โ€‹	p += strlen(state);
    โ€‹	*(p++) = '\n';
    โ€‹}
    โ€‹*p = '\0';
    โ€‹โ€‹โ€‹โ€‹}
    

Tip: user kobject.

#!/usr/bin/env python from bcc import BPF # BPF Program bpf_text = """ #include <bcc/proto.h> #include <linux/blkdev.h> #include <linux/types.h> #include "md.h" int kprobe__raid5_make_request(struct pt_regs *ctx, struct mddev *mdv, struct bio *bio) { bpf_trace_printk("BIO start: %llu length: %lu(sectors)\\n", bio->bi_iter.bi_sector, bio->bi_iter.bi_size >> 9); bpf_trace_printk("Get the mddev structure of dev%d", mdv->md_minor); /* struct list_head *list; list_for_each(list, &mdv->disks) { struct mddev *md1 = list_entry(list,struct mddev, disks); bpf_trace_printk("Get the mddev structure of dev%d", md1->md_minor); } */ return 0; }; """ print("This is an example of kprobe hook of raid5_make_request().") # Build and Inject BPF b = BPF(text=bpf_text) # Print debug output while True: print(b.trace_readline())
length: 0(sectors)' b' jbd2/dm-0-8-8197 [003] d... 87355.461591: : Get the mddev structure of dev2 kworker/2:0-9326 [002] d... 87355.484612: : BIO start: 1942298520 length: 8(sectors)' b' kworker/2:0-9326 [002] d... 87355.484616: : Get the mddev structure of dev2 kworker/u8:0-4422 [003] d... 87389.553970: : BIO start: 3556769920 length: 8(sectors)' b' kworker/u8:0-4422 [003] d... 87389.553973: : Get the mddev structure of dev2 kworker/u8:0-4422 [003] d... 87389.586068: : BIO start: 3556770048 length: 8(sectors)' b' kworker/u8:0-4422 [003] d... 87389.586072: : Get the mddev structure of dev2 kworker/u8:0-4422 [003] d... 87389.586093: : BIO start: 3556835608 length: 8(sectors)' b' kworker/u8:0-4422 [003] d... 87389.586095: : Get the mddev structure of dev2 kworker/u8:0-4422 [003] d... 87389.586115: : BIO start: 0 length: 8(sectors)' b' kworker/u8:0-4422 [003] d... 87389.586117: : Get the mddev structure of dev2 kworker/u8:0-4422 [003] d... 87389.586127: : BIO start: 8 length: 8(sectors)' b' kworker/u8:0-4422 [003] d... 87389.586129: : Get the mddev structure of dev2 kworker/u8:0-4422 [003] d... 87389.586138: : BIO start: 616 length: 8(sectors)' b' kworker/u8:0-4422 [003] d... 87389.586140: : Get the mddev structure of dev2 kworker/u8:0-4422 [003] d... 87389.586149: : BIO start: 1704 length: 8(sectors)' b' kworker/u8:0-4422 [003] d... 87389.586151: : Get the mddev structure of dev2 kworker/u8:0-4422 [003] d... 87389.586161: : BIO start: 10064 length: 8(sectors)'