rds-rdma Subroutine
Purpose
Reliable Datagram Sockets (RDS) zerocopy provides an interface for remote direct memory access (RDMA) over RDS.
Description
The zerocopy interface of RDS was added in RDS Version3. In the RDS zerocopy, the client initiates a direct transfer to or from an area of the memory in its process address space. This memory need not be aligned.
The client obtains a handle for this region of memory, and passes it to the server. This cookie is called the RDMA cookie. To the application, the cookie is an opaque 64-bit data type.
The client sends this handle to the server application, along with other details of the RDMA request such as the data to transfer to the RDMA memory area. This message is called the RDMA request.
The server uses the RDMA cookie to initiate the requested RDMA transfer. The RDMA transfer is combined atomically with a normal RDS message, which is delivered to the client. This message is called the RDMA ACK. Atomic refers to both the RDMA succeeds and the RDMA ACK delivered, or they do not succeed.
When the client receives the RDMA ACK, it means that the RDMA completed successfully. If required, it can then release the RDMA cookie for this memory region.
RDMA operations are not reliable. Unlike normal RDS messages, RDS RDMA operations fail and get dropped.
Interface
The interface is based on control messages that are sent or received through the sendmsg and recvmsg system calls. Optionally, a previous interface can be used that is based on the setsockopt system call. The control messages must be used as it reduces the number of system calls required.
Control Message Interface
With the control message interface, the RDMA cookie is passed to the server out-of-band that is included in an extension header that is attached to the RDS message.
Initially, the client sends RDMA requests along with
a RDS_CMSG_RDMA_MAP
control message. The control
message contains the address and length of the memory region to obtain
a handle, flags, and a pointer to a memory location in the address
space of the caller where the kernel stores the RDMA cookie.
If
the application has an RDMA cookie for the memory range to or from
an RDMA request, it can give this cookie to the kernel by using the RDS_CMSG_RDMA_DEST
control
message.
The kernel includes the resulting RDMA cookie in an extension header that is transmitted as part of the RDMA request to the server.
When the server receives the RDMA request, the kernel
delivers the cookie within a RDS_CMSG_RDMA_DEST
message.
The server initiates the data transfer by sending the RDMA ACK message
along with a RDS_CMSG_RDMA_ARGS
control message.
This message contains the RDMA cookie, and the local memory that can
be copied.
The server process can request a notification when
an RDMA operation completes. The notifications are delivered as the RDS_CMSG_RDMA_STATUS
control
messages. When an application calls the recvmsg call
, it receives a regular RDS message with other RDMA-related control
messages, or an empty message with one or more status control messages.
When an RDMA operation fails and is discarded, the application can
ask notifications for failed messages, regardless of the success notification
of an individual message.
To activate the option for receiving
failed notification, you must set the RDS_RECVERR
socket
option.
Setsockopt Interface
- RDS_GET_MR
- To obtain an RDMA cookie for a memory range, the application
can use the setsockopt call with the
RDS_GET_MR
option. This cookie operates as theRDS_CMSG_RDMA_MAP
control message. The argument contains the address and length of the memory range to be registered, and a pointer to an RDMA cookie variable where the system call stores the cookie for the registered range. - RDS_FREE_MR
- Memory ranges are released by calling the setsockopt call
with the
RDS_FREE_MR
option. You can specify the RDMA cookie with flags as arguments. - RDS_RECVERR
- This is a Boolean option that is set and queried by using the getsockopt call. When enabled, RDS sends RDMA notification messages to the application for any RDMA operation that fails. This option by default is set to off.
SOL_RDS
.RDMA Macros and types
typedef u_int64_t rds_rdma_cookie_t
This
cookie contains a memory location in the client process. The cookie
contains the R_Key
of the remote memory region, and
the offset into it so that the alignment is not a concern for the
application. The RDMA cookie is used in several struct types. The RDS_CMSG_RDMA_DEST
control
message contains a rds_rdma_cookie_t
as payload.
Mapping arguments
RDS_CMSG_RDMA_MAP
control
messages and with the RDS_GET_MR
socket option:struct rds_iovec {
u_int64_t addr;
u_int64_t bytes;
};
struct rds_get_mr_args {
struct rds_iovec vec;
u_int64_t cookie_addr;
uint64_t flags;
};
The cookie_addr
parameter specifies
a memory location to store the RDMA cookie. - RDS_RDMA_USE_ONCE
- This flag specifies to the kernel that the allocated RDMA cookie
must be used one time. When the RDMA ACK message is received, the
kernel automically unbinds the memory area and releases any resources
that are associated with the cookie. If this flag is not set, the
application must release the memory region by using the
RDS_FREE_MR
socket option. - RDS_RDMA_INVALIDATE
- The RDMA memory mappings are not invalidated because it requires
synchronization with the HCA, which is not cost effective. However,
the server application can access the registered memory for any amount
of time. The RDS code invalidates the mapping at the time it is released,
and this can happen in two ways:
- When an RDMA ACK and the
RDS_RDMA_USE_ONCE
flag is set - When the application releases the memory by using the
RDS_FREE_MR
socket option.
- When an RDMA ACK and the
RDS_CMSG_RDMA_ARGS
control message,
which takes the following data as payload:struct rds_rdma_args {
rds_rdma_cookie_t cookie;
struct rds_iovec remote_vec;
u_int64_t local_vec_addr;
u_int64_t nr_local;
u_int64_t flags;
u_int32_t user_token;
};
rds_iovecs
.
The array address is specified in the local_vec_addr
option,
and its number of elements is specified in the nr_local
option.
The struct member remote_vec specifies a location relative
to the memory area that is identified by the remote_vec.addr cookie
as an offset into that region, and remote_vec.bytes
is
the length of the memory window that can be copied. This length must
match the size of the local memory area that is the sum of bytes in
all members of the local iovec call. The flags
field contains the bitwise or the following flags:- RDS_RDMA_READWRITE
- Performs an RDMA WRITE from the memory of the server to the client when the flag is set. If not set, RDS does an RDMA READ from the memory of the client to the memory of the server.
- RDS_RDMA_FENCE
- The order of an RDMA READ in reference to the subsequent SEND operations is not decided by InfiniBand. When this flag is set, the RDMA READ is separated from the subsequent RDS ACK message. Setting this flag requires an additional round trip of the InfiniBand. Set this flag by default.
- RDS_RDMA_NOTIFY_ME
- This flag requests a notification on completion of the RDMA operation
whether successful or otherwise. The notification contains the value
of the
user_token
field that is passed by the application. This flag allows the application to release resources such as buffers that are associated with the RDMA transfer. Theuser_token
can be used to pass an application-specific identifier to the kernel. This token is returned to the application when a status notification is generated.
RDS_CMSG_RDMA_STATUS
control
messages. By default, no notifications are generated. There are two
ways an application can request for the messages. The status notifications
can be enabled for every operation by setting the RDS_RDMA_NOTIFY_ME
flag
in the RDMA arguments. The application can request notifications for
all RDMA operations that fail by setting the RDS_RECVERR
socket
option. In both cases, the format of the notification is the same
and one notification is sent for the completed operation. The format
of the message is as shown: struct rds_rdma_notify {
u_int32_t user_token;
int32_t status;
};
The user_token
field contains the value
that was previously stored in the kernel in the RDS_CMSG_RDMA_ARGS
control
message. The status field contains a status value, with 0 indicating
success, and non-zero indicating an error. The following status codes
are defined:- RDS_RDMA_SUCCESS
- The RDMA operation succeeded.
- RDS_RDMA_REMOTE_ERROR
- The RDMA operation failed due to a remote access error. This error is because of an invalid R_key, offset, or transfer size.
- RDS_RDMA_CANCELED
- The RDMA operation was canceled by the application.
- RDS_RDMA_DROPPED
- RDMA operations was discarded after the connection failed and was reestablished. The RDMA operation is processed partially.
- RDS_RDMA_OTHER_ERROR
- Any other failure.
RDS_GET_MR
socket
option to register a memory range, the application passes a pointer
to a struct rds_get_mr_args variable. The RDS_FREE_MR call
accepts an argument of type rds_free_mr_args struct: struct rds_free_mr_args {
rds_rdma_cookie_t cookie;
u_int64_t flags;
};
Where cookie specifies the RDMA cookie to be released.
RDMA access to the memory range is not received instantly because
the operation is costly. However, if the flags argument contains RDS_RDMA_INVALIDATE
,
RDS invalidates the mapping immediately. If the cookie argument is
0, and RDS_RDMA_INVALIDATE
is set, RDS invalidates
old memory mappings on all devices.Errors
- EAGAIN
- RDS was unable to map a memory range because the limit exceeded
(returned by
RDS_CMSG_RDMA_MAP
andRDS_GET_MR
) . - EINVAL
- When a message is sent, there were conflicting control messages
(For example, two
RDMA_MAP
messages, or aRDMA_MAP
and aRDMA_DEST
message). In aRDS_CMSG_RDMA_MAP
orRDS_GET_MR
operation, the application that is specified by the memory range is greater than the maximum size supported. The size of the local memory specified in therds_iovec
call does not match the size of the remote memory range when an RDMA operation with theRDS_CMSG_RDMA_ARGS
was set up. - EBUSY
- RDS was unable to obtain a DMA mapping for the indicated memory.