rds-rdma Subroutine

Purpose

Reliable Datagram Sockets (RDS) zerocopy provides an interface for remote direct memory access (RDMA) over RDS.

Description

The zerocopy interface of RDS was added in RDS Version3. In the RDS zerocopy, the client initiates a direct transfer to or from an area of the memory in its process address space. This memory need not be aligned.

The client obtains a handle for this region of memory, and passes it to the server. This cookie is called the RDMA cookie. To the application, the cookie is an opaque 64-bit data type.

The client sends this handle to the server application, along with other details of the RDMA request such as the data to transfer to the RDMA memory area. This message is called the RDMA request.

The server uses the RDMA cookie to initiate the requested RDMA transfer. The RDMA transfer is combined atomically with a normal RDS message, which is delivered to the client. This message is called the RDMA ACK. Atomic refers to both the RDMA succeeds and the RDMA ACK delivered, or they do not succeed.

When the client receives the RDMA ACK, it means that the RDMA completed successfully. If required, it can then release the RDMA cookie for this memory region.

RDMA operations are not reliable. Unlike normal RDS messages, RDS RDMA operations fail and get dropped.

Interface

The interface is based on control messages that are sent or received through the sendmsg and recvmsg system calls. Optionally, a previous interface can be used that is based on the setsockopt system call. The control messages must be used as it reduces the number of system calls required.

Control Message Interface

With the control message interface, the RDMA cookie is passed to the server out-of-band that is included in an extension header that is attached to the RDS message.

Initially, the client sends RDMA requests along with a RDS_CMSG_RDMA_MAP control message. The control message contains the address and length of the memory region to obtain a handle, flags, and a pointer to a memory location in the address space of the caller where the kernel stores the RDMA cookie.

If the application has an RDMA cookie for the memory range to or from an RDMA request, it can give this cookie to the kernel by using the RDS_CMSG_RDMA_DEST control message.

The kernel includes the resulting RDMA cookie in an extension header that is transmitted as part of the RDMA request to the server.

When the server receives the RDMA request, the kernel delivers the cookie within a RDS_CMSG_RDMA_DEST message. The server initiates the data transfer by sending the RDMA ACK message along with a RDS_CMSG_RDMA_ARGS control message. This message contains the RDMA cookie, and the local memory that can be copied.

The server process can request a notification when an RDMA operation completes. The notifications are delivered as the RDS_CMSG_RDMA_STATUS control messages. When an application calls the recvmsg call , it receives a regular RDS message with other RDMA-related control messages, or an empty message with one or more status control messages. When an RDMA operation fails and is discarded, the application can ask notifications for failed messages, regardless of the success notification of an individual message.

To activate the option for receiving failed notification, you must set the RDS_RECVERR socket option.

Setsockopt Interface

A process can register and release memory ranges for RDMA through the setsockopt calls with the help of RDS.

RDS_GET_MR: To obtain an RDMA cookie for a memory range, the application can use the setsockopt call with the RDS_GET_MR option. This cookie operates as the RDS_CMSG_RDMA_MAP control message. The argument contains the address and length of the memory range to be registered, and a pointer to an RDMA cookie variable where the system call stores the cookie for the registered range.
RDS_FREE_MR: Memory ranges are released by calling the setsockopt call with the RDS_FREE_MR option. You can specify the RDMA cookie with flags as arguments.
RDS_RECVERR: This is a Boolean option that is set and queried by using the getsockopt call. When enabled, RDS sends RDMA notification messages to the application for any RDMA operation that fails. This option by default is set to off.

For all the calls, the level argument to the setsockopt call is SOL_RDS.

RDMA Macros and types

RDMA cookie

typedef u_int64_t       rds_rdma_cookie_t

This cookie contains a memory location in the client process. The cookie contains the R_Key of the remote memory region, and the offset into it so that the alignment is not a concern for the application. The RDMA cookie is used in several struct types. The RDS_CMSG_RDMA_DEST control message contains a rds_rdma_cookie_t as payload.

Mapping arguments

The following data type is used with the RDS_CMSG_RDMA_MAP control messages and with the RDS_GET_MR socket option:

struct rds_iovec {
        u_int64_t       addr;
        u_int64_t       bytes;
};

struct rds_get_mr_args {
        struct rds_iovec vec;
        u_int64_t       cookie_addr;
        uint64_t        flags;
};

The cookie_addr parameter specifies a memory location to store the RDMA cookie.

The flags value is a bitwise OR of any of the following flags:

RDS_RDMA_USE_ONCE

This flag specifies to the kernel that the allocated RDMA cookie must be used one time. When the RDMA ACK message is received, the kernel automically unbinds the memory area and releases any resources that are associated with the cookie. If this flag is not set, the application must release the memory region by using the RDS_FREE_MR socket option.

RDS_RDMA_INVALIDATE

The RDMA memory mappings are not invalidated because it requires synchronization with the HCA, which is not cost effective. However, the server application can access the registered memory for any amount of time. The RDS code invalidates the mapping at the time it is released, and this can happen in two ways:

When an RDMA ACK and the RDS_RDMA_USE_ONCE flag is set
When the application releases the memory by using the RDS_FREE_MR socket option.

RDMA Operation

RDMA operations are initiated by the server by using the RDS_CMSG_RDMA_ARGS control message, which takes the following data as payload:

struct rds_rdma_args {
        rds_rdma_cookie_t cookie;
        struct rds_iovec remote_vec;
        u_int64_t       local_vec_addr;
        u_int64_t       nr_local;
        u_int64_t       flags;
        u_int32_t       user_token;
};

The cookie argument contains the RDMA cookie received from the client. The local memory has an array of rds_iovecs. The array address is specified in the local_vec_addr option, and its number of elements is specified in the nr_local option. The struct member remote_vec specifies a location relative to the memory area that is identified by the remote_vec.addr cookie as an offset into that region, and remote_vec.bytes is the length of the memory window that can be copied. This length must match the size of the local memory area that is the sum of bytes in all members of the local iovec call. The flags field contains the bitwise or the following flags:

RDS_RDMA_READWRITE: Performs an RDMA WRITE from the memory of the server to the client when the flag is set. If not set, RDS does an RDMA READ from the memory of the client to the memory of the server.
RDS_RDMA_FENCE: The order of an RDMA READ in reference to the subsequent SEND operations is not decided by InfiniBand. When this flag is set, the RDMA READ is separated from the subsequent RDS ACK message. Setting this flag requires an additional round trip of the InfiniBand. Set this flag by default.
RDS_RDMA_NOTIFY_ME: This flag requests a notification on completion of the RDMA operation whether successful or otherwise. The notification contains the value of the user_token field that is passed by the application. This flag allows the application to release resources such as buffers that are associated with the RDMA transfer. The user_token can be used to pass an application-specific identifier to the kernel. This token is returned to the application when a status notification is generated.

RDMA Notification

The RDS kernel code is able to notify the server application when an RDMA operation completes. These notifications are delivered through the RDS_CMSG_RDMA_STATUS control messages. By default, no notifications are generated. There are two ways an application can request for the messages. The status notifications can be enabled for every operation by setting the RDS_RDMA_NOTIFY_ME flag in the RDMA arguments. The application can request notifications for all RDMA operations that fail by setting the RDS_RECVERR socket option. In both cases, the format of the notification is the same and one notification is sent for the completed operation. The format of the message is as shown:

 struct rds_rdma_notify {
        u_int32_t       user_token;
        int32_t         status;
};

The user_token field contains the value that was previously stored in the kernel in the RDS_CMSG_RDMA_ARGS control message. The status field contains a status value, with 0 indicating success, and non-zero indicating an error. The following status codes are defined:

RDS_RDMA_SUCCESS: The RDMA operation succeeded.
RDS_RDMA_REMOTE_ERROR: The RDMA operation failed due to a remote access error. This error is because of an invalid R_key, offset, or transfer size.
RDS_RDMA_CANCELED: The RDMA operation was canceled by the application.
RDS_RDMA_DROPPED: RDMA operations was discarded after the connection failed and was reestablished. The RDMA operation is processed partially.
RDS_RDMA_OTHER_ERROR: Any other failure.

RDMA setsockopt arguments

When you use the RDS_GET_MR socket option to register a memory range, the application passes a pointer to a struct rds_get_mr_args variable. The RDS_FREE_MR call accepts an argument of type rds_free_mr_args struct:

struct rds_free_mr_args {
        rds_rdma_cookie_t cookie;
        u_int64_t       flags;
};

Where cookie specifies the RDMA cookie to be released. RDMA access to the memory range is not received instantly because the operation is costly. However, if the flags argument contains RDS_RDMA_INVALIDATE, RDS invalidates the mapping immediately. If the cookie argument is 0, and RDS_RDMA_INVALIDATE is set, RDS invalidates old memory mappings on all devices.

Errors

In addition to the usual error codes returned by sendmsg, recvmsg and setsockopt system calls, RDS returns the following error codes:

EAGAIN: RDS was unable to map a memory range because the limit exceeded (returned by RDS_CMSG_RDMA_MAP and RDS_GET_MR) .
EINVAL: When a message is sent, there were conflicting control messages (For example, two RDMA_MAP messages, or a RDMA_MAP and a RDMA_DEST message). In a RDS_CMSG_RDMA_MAP or RDS_GET_MR operation, the application that is specified by the memory range is greater than the maximum size supported. The size of the local memory specified in the rds_iovec call does not match the size of the remote memory range when an RDMA operation with the RDS_CMSG_RDMA_ARGS was set up.
EBUSY: RDS was unable to obtain a DMA mapping for the indicated memory.