# RDMA programming introduction ###### tags: `RDMA` ## RDMA introduction * ### RDMA -> Remote Direct Memory Access * A client server mechanism * Direct access remote memory to achieve kernel bypass * ### Need NIC supports * Mellanox, Broadcom, Intel... * ### Can be not only executed in kernel space and user space * Ceph * Serverless container platform ![](https://i.imgur.com/zjRDSZS.jpg) * ### Three main kinds of RDMA implementation * Infiniband: Only work under Infiniband network, supported by Mellanox CX series NIC * iWARP: TCP based RDMA under ethernet network, supported by Intel X722 and E810 * RoCE: Based on ethernet network, supported by Mellanox CX series and Intel E810 NIC * RoCEv1: L2 field is ethernet header, but L3 field is Infiniband header * RoCEv2: L2 field is same as RoCEv1, but L3 and L4 field is IP and UDP header(UDP port 4791) ![](https://i.imgur.com/ju5zeVa.png) * ### Besides, there is a software version implementation of RoCEv2, Soft-RoCE * kernel 4.9 and above ## RDMA programming knowledge * ### Need set property RDMA node ID to NIC port * NIC port/node GID must be non-zero * In Mellanox ofed driver * ### Each RDMA program is based on two libraries: * libibverbs * The main point of RDMA programming, actually, a RDMA program can be build just by libibverbs * Based on "completion channel" mechanism, a little similar with io_uring * librdmacm * A high level library to doing RDMA programming * Some features are same as in libibverbs * Also based on "event channel" mechanism * Similar to tcp/ip socket programming * Need to use rdma_create_event_channel() to create the event channel * Originally provided by MLNX OFED driver or some Linux distribution * ### The main point of RDMA programming: * Based on "events" and these three components: * QP, Queue Pair: buffer for receive and send data * WR, Work Request: The job we want NIC to do, e.g. send a data to remote side * CQ, Completion Queue: NIC will create an event when an action is done and enqueue into a CQ * ### Commonly, the procedure is use librdmacm to establish a connection to treat as control plane * ### Then use libibverbs to do transmission process ## RDMA programming essential * ### Connection prepare * Need to use <font color=#0000FF>rdma_create_event_channel()</font> to create the event channel * Using <font color=#0000FF>rdma_get_cm_event()</font> and <font color=#0000FF>rdma_ack_cm_event()</font> to check events from channel * <font color=#0000FF>rdma_resolve_addr()</font> and <font color=#0000FF>rdma_resolve_route()</font> to resolve peer address * <font color=#0000FF>ibv_create_cq()</font> ```C= struct ibv_cq *ibv_create_cq(struct ibv_context *context, int cqe, void *cq_context, struct ibv_comp_channel *channel, int comp_vector); ``` * <font color=#0000FF>ibv_reg_mr()</font> ```c= struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr, size_t length, enum ibv_access_flags access) ``` * IBV_ACCESS_LOCAL_WRITE, IBV_ACCESS_REMOTE_WRITE, IBV_ACCESS_REMOTE_READ, IBV_ACCESS_REMOTE_ATOMIC ```c= enum ibv_access_flags { IBV_ACCESS_LOCAL_WRITE = 1, IBV_ACCESS_REMOTE_WRITE = (1<<1), IBV_ACCESS_REMOTE_READ = (1<<2), IBV_ACCESS_REMOTE_ATOMIC = (1<<3), IBV_ACCESS_MW_BIND = (1<<4), IBV_ACCESS_ZERO_BASED = (1<<5), IBV_ACCESS_ON_DEMAND = (1<<6), IBV_ACCESS_HUGETLB = (1<<7), IBV_ACCESS_RELAXED_ORDERING = IBV_ACCESS_OPTIONAL_FIRST, }; ``` * <font color=#0000FF>ibv_create_qp()</font> ```c= struct ibv_qp *ibv_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *qp_init_attr); ``` * struct ibv_qp_init_attr ```c= struct ibv_qp_init_attr { void *qp_context; struct ibv_cq *send_cq; // specific a send completion queue struct ibv_cq *recv_cq; // specific a recv completion queue struct ibv_srq *srq; struct ibv_qp_cap cap; enum ibv_qp_type qp_type; // RC/UC/UD int sq_sig_all; }; ``` * struct ibv_qp_cap ```c= struct ibv_qp_cap { uint32_t max_send_wr; uint32_t max_recv_wr; //max work request, total should be less than max_qp_wr value uint32_t max_send_sge; uint32_t max_recv_sge; // max scatter/gather entry, means the max memory region of data can be scattered, total should be less than max_sge value uint32_t max_inline_data; }; ``` * cleint: <font color=#0000FF>rdma_connect()</font> or <font color=#0000FF>ibv_modify_qp()</font> ```c= int rdma_connect (struct rdma_cm_id *id, struct rdma_conn_param *conn_param); int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, int attr_mask); ``` * When using ibv_modify_qp(), need to change qp state from RESET->INIT->RTR->RTS and fill rdma port info * server: <font color=#0000FF>rdma_bind_addr()</font> and <font color=#0000FF>rdma_listen()</font> or <font color=#0000FF>ibv_modify_qp()</font> * <font color=#0000FF>ibv_post_recv()</font> ```c= int ibv_post_recv(struct ibv_qp *qp, struct ibv_recv_wr *wr, struct ibv_recv_wr **bad_wr); ``` * struct ibv_qp * struct ibv_recv_wr ```c= struct ibv_recv_wr { uint64_t wr_id; /* User defined WR ID, will be used once this recv action complete */ struct ibv_recv_wr *next; /* Pointer to next WR in list, NULL if last WR */ struct ibv_sge *sg_list; /* Pointer to the s/g array */ int num_sge; /* Size of the s/g array */ }; ``` * struct ibv_sge *sg_list; ```c= struct ibv_sge { uint64_t addr; // registered memory address uint32_t length; // registered memory length uint32_t lkey; // local key, match to registered memory }; ``` * ### Start transmission * <font color=#0000FF>ibv_post_send()</font> ```c= int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr); ``` * struct ibv_qp * struct ibv_send_wr ```c= struct ibv_send_wr { uint64_t wr_id; // wr id, will be used once this recv action complete struct ibv_send_wr *next; struct ibv_sge *sg_list; int num_sge; enum ibv_wr_opcode opcode; // opcode for rdma action int send_flags; uint32_t imm_data; union { struct { uint64_t remote_addr; // remote memory address uint32_t rkey; // remote key match to remote_addr } rdma; struct { uint64_t remote_addr; uint64_t compare_add; uint64_t swap; uint32_t rkey; } atomic; struct { struct ibv_ah *ah; uint32_t remote_qpn; uint32_t remote_qkey; } ud; } wr; }; ``` ```c= enum ibv_wr_opcode { IBV_WR_RDMA_WRITE, // send data with remote memory address IBV_WR_RDMA_WRITE_WITH_IMM, // send data with remote memory address IBV_WR_SEND, // send data without remote memory address IBV_WR_SEND_WITH_IMM, // send data without remote memory address IBV_WR_RDMA_READ, // read data without remote memory address IBV_WR_ATOMIC_CMP_AND_SWP, IBV_WR_ATOMIC_FETCH_AND_ADD, }; enum ibv_send_flags { IBV_SEND_FENCE = 1 << 0, IBV_SEND_SIGNALED = 1 << 1, IBV_SEND_SOLICITED = 1 << 2, IBV_SEND_INLINE = 1 << 3, IBV_SEND_IP_CSUM = 1 << 4 }; ``` * <font color=#0000FF>ibv_get_cq_event()</font>, <font color=#0000FF>ibv_ack_cq_event()</font> // get event from completion queue ```c= int ibv_get_cq_event(struct ibv_comp_channel *channel, struct ibv_cq **cq, void **cq_context); void ibv_ack_cq_events(struct ibv_cq *cq, unsigned int nevents); ``` * <font color=#0000FF>ibv_req_notify_cq()</font> ```c= int ibv_req_notify_cq(struct ibv_cq *cq, int solicited_only); ``` * <font color=#0000FF>ibv_poll_cq()</font> // poll the cq, need to poll it until empty ```c= int ibv_poll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc); struct ibv_wc { uint64_t wr_id; // wr id associated with previous enum ibv_wc_status status; // IBV_WC_SUCCESS when action sucesses enum ibv_wc_opcode opcode; // means what kind of work complete uint32_t vendor_err; uint32_t byte_len; uint32_t imm_data; uint32_t qp_num; uint32_t src_qp; int wc_flags; uint16_t pkey_index; uint16_t slid; uint8_t sl; uint8_t dlid_path_bits; }; ``` ## Debug ![](https://i.imgur.com/z6mSCz1.png) * Check NIC ip * ``` # cat /sys/bus/pci/devices/<pci address>/roce_enable ``` * ibv_rc_ping_pong, rdma_client, rdma_server * Check GID * Use switch port mirroring * SoftRoCE and dedicate Wireshark ## Demo * Establish rdma client and server * Send 2 numbers to server and return sum * Capture RDMA pkts ![](https://i.imgur.com/LKdOvG1.png) ## Reference * ### Useful refer: * https://www.rdmamojo.com * https://www.ibm.com/docs/en/aix/7.1?topic=subroutines-librdmacm-library * https://www.ibm.com/docs/en/aix/7.1?topic=subroutines-libibverbs-library * https://github.com/w180112/RDMA-example ## Appendix - SoftRoCE * ### Refer to * https://community.mellanox.com/s/article/howto-configure-soft-roce * Kernel version greater than 4.9 only needs userspace library * Remove ofed driver first