rds Subroutine

Purpose

Reliable Datagram Sockets (RDS) provides reliable, in-order datagram delivery between sockets across various network transport.

Library

#include <sys/socket.h>
#include <netinet/in.h>
#include <sys/bypass.h>
#include <net/rds_rdma.h>

Description

RDS is an implementation of the RDS Application Programming Interface (API). RDS can be transported through InfiniBand and loopback. RDS through TCP is disabled. RDS uses the standard AF_INET addresses to identify the endpoints.

Socket Creation

RDS sockets are created as follows:

rds_socket = socket(AF_BYPASS, SOCK_SEQPACKET, BYPASSPROTO_RDS);

Socket Options

RDS supports multiple socket options through the setsockopt and getsockopt calls. The following options with the SOL_SOCKET socket level are important.

SO_RCVBUF: Specifies the size of the receive buffer. See Congestion Control.

SO_SNDBUF: Specifies the size of the send buffer. See Message Transmission.

SO_SNDTIMEO: Specifies the send timeout of the socket when you enqueue a message on a socket with a full queue in the blocking mode.

RDS also supports multiple protocol-specific options with the SOL_RDS socket level .

Binding

A new RDS has no local address when it is initially returned from the socket call. The socket must be bound to a local address by running the bind system call before any messages are sent or received. The bind call attaches the socket to a specific network transport, which is based on the type of interface the local address is attached to. From the point the call is attached to the socket, the socket can reach the destinations that are available through this network transport.

For instance, when binding to the address of an InfiniBand interface, such as ib0, the socket uses the InfiniBand transport system. If RDS is not able to associate a transport system with the specific address, it returns the EADDRNOTAVAIL value.

An RDS socket can only be bound to one address and only one socket can be bound to a specific address or port pair. If no port is specified in the binding address, an unbound port is selected at random.

RDS does not permit the application to bind a previously bound socket to another address. Binding to the INADDR_ANY wildcard address is not allowed.

Connecting

In the default mode of operation RDS uses unconnected sockets, and specifies destination address as an argument to the sendmsg subroutine. However, RDS allows sockets to be connected to a remote end point by using the connect subroutine. If a socket is connected, you can call the sendmsg subroutine without specifying a destination address and the subroutine uses the remote address that was previously provided.

Congestion Control

RDS does not have an explicit congestion control mechanism like the common streaming protocols such as TCP. The sockets have two queue limits that are the send queue size and the receive queue size. Messages are accounted based on the number of bytes of payload.

The send queue size limits the data that the local processes can queue on a local socket. If the limit exceeds, the kernel does not accept messages until the queue is free and messages are delivered and acknowledged by the remote host.

The receive queue size limits the data that RDS stores on the receive queue of a socket before marking the socket as congested. When a socket becomes congested, RDS sends a congestion map update to the other participating hosts, which are then expected to stop sending more messages to this port.

There is a timing window during which a remote host can continue to send messages to a congested port. RDS resolves the timing window by accepting messages even when the receive queue of the socket exceeds the limit.

When the application receives incoming messages from the receive queue by using the recvmsg system call, the number of bytes on the receive queue reduces below the receive queue size and the port is marked as uncongested. A congestion update is sent to all the participating hosts.

The values for the send buffer size and receive buffer size can be tuned by the application through the SO_SNDBUF and SO_RCVBUF socket options.

Blocking Behavior

The sendmsg and recvmsg calls can be blocked in various situations. A call can be blocked or returned with an error depending on the non-blocking setting of the file descriptor and the MSG_DONTWAIT message flag. If the file descriptor is set to blocking mode (which is the default), and the MSG_DONTWAIT flag is not specified, the call is blocked.

The SO_SNDTIMEO and SO_RCVTIMEO socket options are used to specify a timeout (in seconds) after which the call ends and returns an error. The default timeout is 0, which allows RDS to block indefinitely.

Message Transmission

Messages can be sent by using the sendmsg call after the RDS socket is bound. Message length cannot exceed 4 GB as the wire protocol uses an unsigned 32-bit integer to express the message length.

RDS does not support data that is out-of-band. Applications can send data to unicast addresses only, where broadcast or multicast are not supported.

A successful sendmsg call places the message in the transmit queue of the socket where it remains until the destination acknowledges that the message is no longer in the network or the application removes the message from the send queue.

Messages can be removed from the send queue with the RDS_CANCEL_SENT_TO socket option.

When a message is in the transmit queue, its payload bytes are considered. If an attempt is made to send a message when the transmit queue is not free, the call blocks or returns the EAGAIN value.

When messages are sent to a destination that is marked as congested, the call is blocked or theENOBUFS value is returned.

A message that is sent with no payload bytes does not require any space in the send buffer of the destination but a message receipt is sent to the destination. The receiver cannot get any payload data but the address of the sender can be viewed.

Messages sent to a port to which no socket is bound is discarded by the destination host. No error messages are reported to the sender.

Message Receipt

Messages can be received with the recvmsg call on RDS after it is bound to a source address. RDS returns messages in the same order that the sender sent the messages.

The address of the sender is returned in the sockaddr_in structure pointed by the msg_name field, if the field is set.

If the MSG_PEEK flag is set, the first message on the receive queue is returned without removing the message from the queue.

The memory that is used by messages waiting to be delivered does not limit the number of messages that can be queued to be received. RDS attempts to control congestion.

If the length of the message exceeds the size of the buffer that is provided to recvmsg call, then the remaining bytes in the message are discarded and the MSG_TRUNC flag is set in the msg_flags field. In this case the recvmsg call, returns the number of bytes copied. It does not return the length of the entire message. If MSG_TRUNC is set in the flags argument to recvmsg, it returns the number of bytes in the entire message. You can view the size of the next message in the receive queue without providing a zero length buffer and setting the MSG_PEEK and MSG_TRUNC options in the flags argument.

The sending address of a zero-length message is provided in the msg_name field.

Control Messages

RDS uses control messages that is the ancillary data by using the msg_control and msg_controllen fields in the sendmsg and recvmsg calls. Control messages that are generated by RDS have a cmsg_level value of sol_rds. Most control messages are related to the zerocopy interface added in RDS version 3, and are described in the rds-rdma subroutine.

The only exception is the RDS_CMSG_CONG_UPDATE message.

Polling

Support for the poll interface is limited. POLLIN is returned when there is an RDS message, or a control message waiting in the receive queue of the socket. POLLOUT is returned when there is space on the send queue of the socket.

Sending messages to the congested ports requires special handling mechanism. When an application tries to send message to a congested destination, the system call returns the ENOBUFS value. RDS cannot poll for POLLOUT because the transmit queue can still accommodate the messages and the call to the poll interface might return immediately, even though the destination is congested.

You can perform one of the method to handle the congestion:

Poll for the POLLIN option. By default, a process sleeping in the poll interface is activated when the congestion map is updated. The application can retry any previously congested send operation.
Monitor the explicit congestion, which gives the application greater control.

With explicit monitoring, the application polls for POLLIN option as before, and additionally uses the RDS_CONG_MONITOR socket option to install a 64-bit mask value in the socket, where each bit corresponds to a group of ports. When a congestion update is received, RDS socket checks the set of ports that became uncongested against the bit mask that is installed in the socket. If they overlap, a control message is enqueued on the socket, and the application is activated. When recvmsg call is called, RDS gives the control message that contains the bitmap on the socket.

The congestion monitor bitmask can be set and queried by using the setsockopt call with the RDS_CONG_MONITOR option, and a pointer to the 64-bit mask variable.

Congestion updates are delivered to the application through the RDS_CMSG_CONG_UPDATE control messages. The control messages are delivered separately, but never with RDS data message. The cmsg_data field of the control message is an eight byte data that contains the 64-bit mask value.

Applications can use the following macros to test for and set bits in the bitmask:

#define RDS_CONG_MONITOR_SIZE   64
#define RDS_CONG_MONITOR_BIT(port)  (((unsigned int) port) % RDS_CONG_MONITOR_SIZE)
#define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port))

Canceling Messages

An application can cancel messages from the send queue by using the RDS_CANCEL_SENT_TO socket option with the setsockopt call. The setsockopt call uses an optional sockaddr_in address structure as an argument. Only messages to the destination address that is specified by the sockaddr_in address are discarded. If no address is provided, all pending messages are discarded.

Note: This call affects messages that are not transmitted and messages that are transmitted but no acknowledgment is received from the remote host.

Reliability

If the sendmsg succeeds, RDS guarantees that the message is visible to recvmsg on a socket that is bound to the destination address as long as that destination socket remains open.

If there is no socket bound on the destination, the message is dropped. If the RDS that is sending messages is not sure that a socket is bound, it tries to send the message indefinitely until it is sure or the sent message is canceled.

If a socket is closed, the pending sent messages on the socket are canceled and can or cannot be seen by the receiver.

The RDS_CANCEL_SENT_TO socket option can be used to cancel all the pending messages to a given destination.

If a receiving socket is closed with pending messages, then the sender considers those messages as having left the network and will not retransmit them.

A message is seen by the recvmsg call unless the MSG_PEEK is specified. When the message is delivered it is removed from the transmit queue of the sending socket.

All messages sent from the same socket to the same destination is delivered in the order they are sent. Messages sent from different sockets, or to different destinations, are delivered randomly.