# RDMA programming introduction
###### tags: `RDMA`
## RDMA introduction
* ### RDMA -> Remote Direct Memory Access
* A client server mechanism
* Direct access remote memory to achieve kernel bypass
* ### Need NIC supports
* Mellanox, Broadcom, Intel...
* ### Can be not only executed in kernel space and user space
* Ceph
* Serverless container platform

* ### Three main kinds of RDMA implementation
* Infiniband: Only work under Infiniband network, supported by Mellanox CX series NIC
* iWARP: TCP based RDMA under ethernet network, supported by Intel X722 and E810
* RoCE: Based on ethernet network, supported by Mellanox CX series and Intel E810 NIC
* RoCEv1: L2 field is ethernet header, but L3 field is Infiniband header
* RoCEv2: L2 field is same as RoCEv1, but L3 and L4 field is IP and UDP header(UDP port 4791)

* ### Besides, there is a software version implementation of RoCEv2, Soft-RoCE
* kernel 4.9 and above
## RDMA programming knowledge
* ### Need set property RDMA node ID to NIC port
* NIC port/node GID must be non-zero
* In Mellanox ofed driver
* ### Each RDMA program is based on two libraries:
* libibverbs
* The main point of RDMA programming, actually, a RDMA program can be build just by libibverbs
* Based on "completion channel" mechanism, a little similar with io_uring
* librdmacm
* A high level library to doing RDMA programming
* Some features are same as in libibverbs
* Also based on "event channel" mechanism
* Similar to tcp/ip socket programming
* Need to use rdma_create_event_channel() to create the event channel
* Originally provided by MLNX OFED driver or some Linux distribution
* ### The main point of RDMA programming:
* Based on "events" and these three components:
* QP, Queue Pair: buffer for receive and send data
* WR, Work Request: The job we want NIC to do, e.g. send a data to remote side
* CQ, Completion Queue: NIC will create an event when an action is done and enqueue into a CQ
* ### Commonly, the procedure is use librdmacm to establish a connection to treat as control plane
* ### Then use libibverbs to do transmission process
## RDMA programming essential
* ### Connection prepare
* Need to use <font color=#0000FF>rdma_create_event_channel()</font> to create the event channel
* Using <font color=#0000FF>rdma_get_cm_event()</font> and <font color=#0000FF>rdma_ack_cm_event()</font> to check events from channel
* <font color=#0000FF>rdma_resolve_addr()</font> and <font color=#0000FF>rdma_resolve_route()</font> to resolve peer address
* <font color=#0000FF>ibv_create_cq()</font>
```C=
struct ibv_cq *ibv_create_cq(struct ibv_context *context, int cqe,
void *cq_context,
struct ibv_comp_channel *channel,
int comp_vector);
```
* <font color=#0000FF>ibv_reg_mr()</font>
```c=
struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr,
size_t length, enum ibv_access_flags access)
```
* IBV_ACCESS_LOCAL_WRITE, IBV_ACCESS_REMOTE_WRITE, IBV_ACCESS_REMOTE_READ, IBV_ACCESS_REMOTE_ATOMIC
```c=
enum ibv_access_flags {
IBV_ACCESS_LOCAL_WRITE = 1,
IBV_ACCESS_REMOTE_WRITE = (1<<1),
IBV_ACCESS_REMOTE_READ = (1<<2),
IBV_ACCESS_REMOTE_ATOMIC = (1<<3),
IBV_ACCESS_MW_BIND = (1<<4),
IBV_ACCESS_ZERO_BASED = (1<<5),
IBV_ACCESS_ON_DEMAND = (1<<6),
IBV_ACCESS_HUGETLB = (1<<7),
IBV_ACCESS_RELAXED_ORDERING = IBV_ACCESS_OPTIONAL_FIRST,
};
```
* <font color=#0000FF>ibv_create_qp()</font>
```c=
struct ibv_qp *ibv_create_qp(struct ibv_pd *pd,
struct ibv_qp_init_attr *qp_init_attr);
```
* struct ibv_qp_init_attr
```c=
struct ibv_qp_init_attr {
void *qp_context;
struct ibv_cq *send_cq; // specific a send completion queue
struct ibv_cq *recv_cq; // specific a recv completion queue
struct ibv_srq *srq;
struct ibv_qp_cap cap;
enum ibv_qp_type qp_type; // RC/UC/UD
int sq_sig_all;
};
```
* struct ibv_qp_cap
```c=
struct ibv_qp_cap {
uint32_t max_send_wr;
uint32_t max_recv_wr; //max work request, total should be less than max_qp_wr value
uint32_t max_send_sge;
uint32_t max_recv_sge; // max scatter/gather entry, means the max memory region of data can be scattered, total should be less than max_sge value
uint32_t max_inline_data;
};
```
* cleint: <font color=#0000FF>rdma_connect()</font> or <font color=#0000FF>ibv_modify_qp()</font>
```c=
int rdma_connect (struct rdma_cm_id *id, struct rdma_conn_param *conn_param);
int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, int attr_mask);
```
* When using ibv_modify_qp(), need to change qp state from RESET->INIT->RTR->RTS and fill rdma port info
* server: <font color=#0000FF>rdma_bind_addr()</font> and <font color=#0000FF>rdma_listen()</font> or <font color=#0000FF>ibv_modify_qp()</font>
* <font color=#0000FF>ibv_post_recv()</font>
```c=
int ibv_post_recv(struct ibv_qp *qp, struct ibv_recv_wr *wr,
struct ibv_recv_wr **bad_wr);
```
* struct ibv_qp
* struct ibv_recv_wr
```c=
struct ibv_recv_wr {
uint64_t wr_id; /* User defined WR ID, will be used once this recv action complete */
struct ibv_recv_wr *next; /* Pointer to next WR in list, NULL if last WR */
struct ibv_sge *sg_list; /* Pointer to the s/g array */
int num_sge; /* Size of the s/g array */
};
```
* struct ibv_sge *sg_list;
```c=
struct ibv_sge {
uint64_t addr; // registered memory address
uint32_t length; // registered memory length
uint32_t lkey; // local key, match to registered memory
};
```
* ### Start transmission
* <font color=#0000FF>ibv_post_send()</font>
```c=
int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr,
struct ibv_send_wr **bad_wr);
```
* struct ibv_qp
* struct ibv_send_wr
```c=
struct ibv_send_wr {
uint64_t wr_id; // wr id, will be used once this recv action complete
struct ibv_send_wr *next;
struct ibv_sge *sg_list;
int num_sge;
enum ibv_wr_opcode opcode; // opcode for rdma action
int send_flags;
uint32_t imm_data;
union {
struct {
uint64_t remote_addr; // remote memory address
uint32_t rkey; // remote key match to remote_addr
} rdma;
struct {
uint64_t remote_addr;
uint64_t compare_add;
uint64_t swap;
uint32_t rkey;
} atomic;
struct {
struct ibv_ah *ah;
uint32_t remote_qpn;
uint32_t remote_qkey;
} ud;
} wr;
};
```
```c=
enum ibv_wr_opcode {
IBV_WR_RDMA_WRITE, // send data with remote memory address
IBV_WR_RDMA_WRITE_WITH_IMM, // send data with remote memory address
IBV_WR_SEND, // send data without remote memory address
IBV_WR_SEND_WITH_IMM, // send data without remote memory address
IBV_WR_RDMA_READ, // read data without remote memory address
IBV_WR_ATOMIC_CMP_AND_SWP,
IBV_WR_ATOMIC_FETCH_AND_ADD,
};
enum ibv_send_flags {
IBV_SEND_FENCE = 1 << 0,
IBV_SEND_SIGNALED = 1 << 1,
IBV_SEND_SOLICITED = 1 << 2,
IBV_SEND_INLINE = 1 << 3,
IBV_SEND_IP_CSUM = 1 << 4
};
```
* <font color=#0000FF>ibv_get_cq_event()</font>, <font color=#0000FF>ibv_ack_cq_event()</font> // get event from completion queue
```c=
int ibv_get_cq_event(struct ibv_comp_channel *channel,
struct ibv_cq **cq, void **cq_context);
void ibv_ack_cq_events(struct ibv_cq *cq, unsigned int nevents);
```
* <font color=#0000FF>ibv_req_notify_cq()</font>
```c=
int ibv_req_notify_cq(struct ibv_cq *cq, int solicited_only);
```
* <font color=#0000FF>ibv_poll_cq()</font> // poll the cq, need to poll it until empty
```c=
int ibv_poll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc);
struct ibv_wc {
uint64_t wr_id; // wr id associated with previous
enum ibv_wc_status status; // IBV_WC_SUCCESS when action sucesses
enum ibv_wc_opcode opcode; // means what kind of work complete
uint32_t vendor_err;
uint32_t byte_len;
uint32_t imm_data;
uint32_t qp_num;
uint32_t src_qp;
int wc_flags;
uint16_t pkey_index;
uint16_t slid;
uint8_t sl;
uint8_t dlid_path_bits;
};
```
## Debug

* Check NIC ip
* ```
# cat /sys/bus/pci/devices/<pci address>/roce_enable
```
* ibv_rc_ping_pong, rdma_client, rdma_server
* Check GID
* Use switch port mirroring
* SoftRoCE and dedicate Wireshark
## Demo
* Establish rdma client and server
* Send 2 numbers to server and return sum
* Capture RDMA pkts

## Reference
* ### Useful refer:
* https://www.rdmamojo.com
* https://www.ibm.com/docs/en/aix/7.1?topic=subroutines-librdmacm-library
* https://www.ibm.com/docs/en/aix/7.1?topic=subroutines-libibverbs-library
* https://github.com/w180112/RDMA-example
## Appendix - SoftRoCE
* ### Refer to
* https://community.mellanox.com/s/article/howto-configure-soft-roce
* Kernel version greater than 4.9 only needs userspace library
* Remove ofed driver first