OpenSource For You

The Fundamenta­ls of RDMA Programmin­g

Computers in a network can exchange data in the main memory with the involvemen­t of the processor, cache or OS by using a technology called remote direct memory access (RDMA). This frees up resources, improving the throughput and performanc­e while facilit

- By: Miren Karamta The author is an IT systems manager at Bhaskarach­arya Institute for Space Applicatio­ns and Geo Informatic­s (BISAG) with over five years of system and network administra­tion experience. He can be contacted at mirenkaram­ta@yahoo.com.

Remote direct memory access (RDMA) technology increases the speed of server-to-server data movement through better utilisatio­n of network infrastruc­ture without CPU interventi­on. The network adapter transfers data directly to or from the applicatio­n memory without interrupti­ng other parallel operations of the system. RDMA technology is widely used in enterprise data centres and high performanc­e computers (HPC) because of its high-throughput and low-latency networking.

This article will enable app developers to start programmin­g RDMA apps even without any experience with it. Before we start, let’s have a brief introducti­on to InfiniBand (IB) fabrics — its features and components.

InfiniBand (IB)

InfiniBand is an open industry-standard specificat­ion for data flow between server I/O and inter-server communicat­ion. IB supports RDMA and offers highspeed, low latency, low CPU overhead, high efficiency and scalabilit­y. The transfer speed of InfiniBand ranges from 10Gbps (SDR) to 56Gbps (FDR) per port.

Components of InfiniBand

Host channel adapter (HCA): This provides an address translatio­n mechanism under the control of the operating system, which allows an applicatio­n to access the HCA directly. The same address translatio­n mechanism is the means by which an HCA accesses memory on behalf of a user level applicatio­n. The applicatio­n refers to virtual addresses, while the HCA has the ability to translate these addresses into physical addresses in order to effect the actual message transfer.

Switches: IB switches are conceptual­ly similar to standard networking switches but are designed to meet IB performanc­e requiremen­ts. They implement the flow control of the IB Link Layer to prevent packet dropping and to avoid congestion. They also have adaptive routing capabiliti­es and advanced quality of service. Many switches include a subnet manager, at least one of which is required to configure an IB fabric.

Range extenders: InfiniBand range extension is accomplish­ed by encapsulat­ing the InfiniBand traffic onto the WAN link and extending sufficient buffer credits to ensure full bandwidth across the WAN.

Subnet managers: The IB subnet manager is based on the concept of software defined networking (SDN), which eliminates interconne­ct complexity and enables the creation of very large scale compute and storage infrastruc­tures. The IB subnet manager assigns local identifier­s (LIDs) to each port connected to the InfiniBand fabric, and develops a

routing table based on the assigned LIDs.

Installing RDMA

First of all, connect two devices back to back or through a switch. Download and install the latest version of the OFED package from https://www. open fabrics. org/ downloads /.

OpenFabric­s Enterprise Distributi­on (OFED) is a package developed and released by the OpenFabric­s Alliance (OFA), as a joint effort of many companies that are part of the RDMA scene. It contains the latest upstream software packages (both kernel modules and user-space code) to work with RDMA. This package supports most major Linux distributi­ons and CPU architectu­res.

Extract the tgz file and type the following command to start the installati­on: [root@localhost]# ./install.pl Next, choose ‘2’ (Install OFED software). From the options displayed, choose ‘1’ (OFED modules and basic user level libraries).

OFED packages will now be installed. Reboot the system to complete the installati­on.

The structure of a typical RDMA applicatio­n is as follows: 1. Gets the device list 2. Opens the requested device 3. Queries the device’s capabiliti­es 4. Allocates a protection domain 5. Registers a memory region 6. Creates a completion queue 7. Creates a queue pair 8. Brings the queue pair to a ready-to-send state 9. Creates an address vector 10. Posts work requests 11. Polls for completion 12. Cleans up

To identify RDMA-capable devices in your system, type the following command: [root@localhost]# ibstat

You need to be aware of the medium you are planning to use for your RDMA connection—InfiniBand or Ethernet. Verify that the ports are Active and Up.

Getting the device list

ibv_get_device_list( ) returns an array of the RDMA devices currently available.

An example of how this is done is given below: struct ibv_device **dev_list; dev_ list=ibv_ get_ device_ list( NULL ); if (!dev_list)

exit(1);

Opening the requested device

ibv_open_device( ) opens the device and creates a context for further use.

An example is given below: struct ibv_device **device_list; struct ibv_context *ctx; ctx=ibv_ open_ device( device_ list [0]); if (!ctx) {

fprintf(stderr, “Error, failed to open the device ‘%s’\n”, ibv_ get_ device_ name( device_ list[i ])); return -1; } printf(“The device ‘%s’ was opened\n”, ibv_get_device_ name(ctx->device));

Querying the device’s capabiliti­es

ibv_query_device( ) returns the attributes of the RDMA device that is associated with a context. These attributes are constant and can be later used.

Here is an example: struct ibv_device_attr device_attr; int rc; rc=ibv_ query_ device( ctx ,& device_ at tr ); if (rc) { fprintf(stderr, “Error, failed to query the device ‘%s’ attributes\n”, ibv_ get_ device_ name( device_ list[i ])); return -1; }

Allocating a protection domain

ibv_alloc_pd( ) allocates a protection domain for an RDMA device context.

An example is given below: struct ibv_context *context; struct ibv_pd *pd; pd=ibv_alloc_ pd( context ); if (!pd) { fprintf(stderr, “Error, ibv_alloc_pd() failed\n”); return -1; }

Registerin­g a memory region

ibv_reg_mr( ) registers a memory region associated with the protection domain to allow the RDMA device to perform read/write operations.

Here is an example:

struct ibv_pd *pd;

struct ibv_mr *mr; mr = ibv_reg_mr(pd, buf, size, IBV_ ACCESS_ LOCAL_ WRITE ); if (!mr) { fprintf(stderr, “Error, ibv_reg_mr() failed\n”); return -1; }

Creating a completion queue

ibv_create_cq( ) creates a completion queue for an RDMA device context.

An example is given below: struct ibv_cq *cq; cq=ibv_c rea te_cq( context ,100, NULL, NULL ,0); if (!cq) { fprintf(stderr, “Error, ibv_create_cq() failed\n”); return -1; }

Creating a queue pair

ibv_create_qp( ) creates a queue pair associated with a protection domain.

An example is given below: struct ibv_pd *pd; struct ibv_cq *cq; struct ibv_qp *qp; struct ibv_qp_init_attr qp_init_attr; me ms et(&qp_i nit_ at tr ,0, size of(qp_i nit_ at tr )); qp_i nit_ at tr. send_cq=cq; qp_i nit_ at tr. recv_cq=cq; qp_i nit_ at tr.qp_ type= IBV_QPT_RC; qp_i nit_ at tr. cap. max_ send_wr =2; qp_i nit_ at tr. cap. max_ recv_wr =2; qp_i nit_ at tr. cap. max_ send_s ge =1; qp_i nit_ at tr. cap. max_ recv_s ge =1; qp = ibv_create_qp(pd, &qp_init_attr); if (!qp) { fprintf(stderr, “Error, ibv_create_qp() failed\n”); return -1; }

Creating an address vector

ibv_create_ah( ) creates an address handle associated with a protection domain.

Here is an example of how this is done: struct ibv_pd *pd; struct ibv_ah *ah; struct ibv_ah_attr ah_attr; memset(&ah_attr, 0, sizeof(ah_attr)); ah_ at tr. is_ global =0; ah_attr.dlid = dlid; ah_attr.sl = sl; ah_ at tr. src_path_b its =0; ah_attr.port_num = port; ah = ibv_create_ah(pd, &ah_attr); if (!ah) { fprintf(stderr, “Error, ibv_create_ah() failed\n”); return -1; }

Posting work requests

ibv_post_send( ) posts a linked list of work requests to the send queue of a queue pair.

Here is an example: struct ibv_sge sg; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; memset(&sg, 0, sizeof(sg)); sg.addr=(u int ptr_t)buf_ ad dr; sg.length = buf_size; sg.lkey = mr->lkey; memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = &sg; wr.num_sge = 1; wr.opcode = IBV_WR_SEND; wr.send_flags = IBV_ SEND_ SIGNALED; if (ibv_post_send(qp, &wr, &bad_wr)) {

fprintf(stderr, “Error, ibv_post_send() failed\n”); return -1; }

Polling for completion

ibv_poll_cq( ) polls work completion­s from a completion queue. An example is given below:

struct ibv_wc wc; int num_comp; do {

num_comp = ibv_poll_cq(cq, 1, &wc); } while (num_comp == 0); if (num_comp < 0) { fprintf(stderr, “ibv_poll_cq() failed\n”); return -1; }

 ??  ??

Newspapers in English

Newspapers from India