rds Subroutine
Purpose
Reliable Datagram Sockets (RDS) provides reliable, in-order datagram delivery between sockets across various network transport.
Library
#include <sys/socket.h>
#include <netinet/in.h>
#include <sys/bypass.h>
#include <net/rds_rdma.h>
Description
RDS is an implementation of the RDS Application Programming Interface (API). RDS can be transported through InfiniBand and loopback. RDS through TCP is disabled. RDS uses the standard AF_INET addresses to identify the endpoints.
Socket Creation
rds_socket = socket(AF_BYPASS, SOCK_SEQPACKET, BYPASSPROTO_RDS);
Socket Options
- SO_RCVBUF
- Specifies the size of the receive buffer. See Congestion Control.
- SO_SNDBUF
- Specifies the size of the send buffer. See Message Transmission.
- SO_SNDTIMEO
- Specifies the send timeout of the socket when you enqueue a message on a socket with a full queue in the blocking mode.
SOL_RDS
socket level . Binding
A new RDS has no local address when it is initially returned from the socket call. The socket must be bound to a local address by running the bind system call before any messages are sent or received. The bind call attaches the socket to a specific network transport, which is based on the type of interface the local address is attached to. From the point the call is attached to the socket, the socket can reach the destinations that are available through this network transport.
For instance, when binding to the address of an InfiniBand interface, such as ib0, the socket uses the InfiniBand transport system. If RDS is not able to associate a transport system with the specific address, it returns the EADDRNOTAVAIL value.
An RDS socket can only be bound to one address and only one socket can be bound to a specific address or port pair. If no port is specified in the binding address, an unbound port is selected at random.
RDS does not permit the application to bind a previously bound socket to another address. Binding to the INADDR_ANY wildcard address is not allowed.
Connecting
In the default mode of operation RDS uses unconnected sockets, and specifies destination address as an argument to the sendmsg subroutine. However, RDS allows sockets to be connected to a remote end point by using the connect subroutine. If a socket is connected, you can call the sendmsg subroutine without specifying a destination address and the subroutine uses the remote address that was previously provided.
Congestion Control
RDS does not have an
explicit congestion control mechanism like the common streaming protocols
such as TCP. The sockets have two queue limits that are the send
queue size
and the receive queue size
. Messages
are accounted based on the number of bytes of payload.
The send
queue size
limits the data that the local processes can queue
on a local socket. If the limit exceeds, the kernel does not accept
messages until the queue is free and messages are delivered and acknowledged
by the remote host.
The receive queue size
limits
the data that RDS stores on the receive queue of a socket before marking
the socket as congested. When a socket becomes congested, RDS
sends a congestion map update to the other participating hosts,
which are then expected to stop sending more messages to this port.
There is a timing window during which a remote host can continue to send messages to a congested port. RDS resolves the timing window by accepting messages even when the receive queue of the socket exceeds the limit.
When
the application receives incoming messages from the receive queue
by using the recvmsg system call, the number of
bytes on the receive queue reduces below the receive queue size and
the port is marked as uncongested
. A congestion update
is sent to all the participating hosts.
The values for the
send buffer size and receive buffer size can be tuned by the application
through the SO_SNDBUF
and SO_RCVBUF
socket
options.
Blocking Behavior
The sendmsg and recvmsg calls
can be blocked in various situations. A call can be blocked or returned
with an error depending on the non-blocking setting of the file descriptor
and the MSG_DONTWAIT
message flag. If the file descriptor
is set to blocking mode (which is the default), and the MSG_DONTWAIT
flag
is not specified, the call is blocked.
The SO_SNDTIMEO and SO_RCVTIMEO socket options are used to specify a timeout (in seconds) after which the call ends and returns an error. The default timeout is 0, which allows RDS to block indefinitely.
Message Transmission
Messages can be sent by using the sendmsg call after the RDS socket is bound. Message length cannot exceed 4 GB as the wire protocol uses an unsigned 32-bit integer to express the message length.
RDS does not support data that is out-of-band. Applications can send data to unicast addresses only, where broadcast or multicast are not supported.
A successful sendmsg call places the message in the transmit queue of the socket where it remains until the destination acknowledges that the message is no longer in the network or the application removes the message from the send queue.
Messages can be removed from the send queue with the RDS_CANCEL_SENT_TO socket option.
When a message is in the transmit queue, its payload
bytes are considered. If an attempt is made to send a message when
the transmit queue is not free, the call blocks or returns the EAGAIN
value.
When
messages are sent to a destination that is marked as congested, the
call is blocked or theENOBUFS
value is returned.
A message that is sent with no payload bytes does not require any space in the send buffer of the destination but a message receipt is sent to the destination. The receiver cannot get any payload data but the address of the sender can be viewed.
Messages sent to a port to which no socket is bound is discarded by the destination host. No error messages are reported to the sender.
Message Receipt
Messages can be received with the recvmsg call on RDS after it is bound to a source address. RDS returns messages in the same order that the sender sent the messages.
The address of the sender is returned in the sockaddr_in structure pointed by the msg_name field, if the field is set.
If the MSG_PEEK flag is set, the first message on the receive queue is returned without removing the message from the queue.
The memory that is used by messages waiting to be delivered does not limit the number of messages that can be queued to be received. RDS attempts to control congestion.
If
the length of the message exceeds the size of the buffer that is provided
to recvmsg call, then the remaining bytes in the
message are discarded and the MSG_TRUNC flag is set in the msg_flags field.
In this case the recvmsg call, returns the number
of bytes copied. It does not return the length of the entire message.
If MSG_TRUNC
is set in the flags argument to recvmsg,
it returns the number of bytes in the entire message. You can view
the size of the next message in the receive queue without providing
a zero length buffer and setting the MSG_PEEK
and MSG_TRUNC
options
in the flags argument.
The sending address of a zero-length message is provided in the msg_name field.
Control Messages
RDS uses control messages
that is the ancillary data by using the msg_control and msg_controllen fields
in the sendmsg and recvmsg calls.
Control messages that are generated by RDS have a cmsg_level value
of sol_rds
. Most control messages are related to
the zerocopy interface added in RDS version 3, and are described in
the rds-rdma subroutine.
The only exception
is the RDS_CMSG_CONG_UPDATE
message.
Polling
Support for the poll interface
is limited. POLLIN
is returned when there is an RDS
message, or a control message waiting in the receive queue of the
socket. POLLOUT
is returned when there is space on
the send queue of the socket.
Sending messages to the congested
ports requires special handling mechanism. When an application tries
to send message to a congested destination, the system call returns
the ENOBUFS value. RDS cannot poll for POLLOUT
because
the transmit queue can still accommodate the messages and the call
to the poll interface might return immediately,
even though the destination is congested.
- Poll for the
POLLIN
option. By default, a process sleeping in the poll interface is activated when the congestion map is updated. The application can retry any previously congested send operation. - Monitor the explicit congestion, which gives the application greater control.
POLLIN
option
as before, and additionally uses the RDS_CONG_MONITOR socket
option to install a 64-bit mask value in the socket, where each bit
corresponds to a group of ports. When a congestion update is received,
RDS socket checks the set of ports that became uncongested
against
the bit mask that is installed in the socket. If they overlap, a control
message is enqueued on the socket, and the application is activated.
When recvmsg call is called, RDS gives the control
message that contains the bitmap on the socket.The congestion monitor bitmask can be set and queried by using the setsockopt call with the RDS_CONG_MONITOR option, and a pointer to the 64-bit mask variable.
Congestion updates are delivered
to the application through the RDS_CMSG_CONG_UPDATE control
messages. The control messages are delivered separately, but never
with RDS data message. The cmsg_data
field of the
control message is an eight byte data that contains the 64-bit mask
value.
#define RDS_CONG_MONITOR_SIZE 64
#define RDS_CONG_MONITOR_BIT(port) (((unsigned int) port) % RDS_CONG_MONITOR_SIZE)
#define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port))
Canceling Messages
sockaddr_in
address structure as
an argument. Only messages to the destination address that is specified
by the sockaddr_in
address are discarded. If no address
is provided, all pending messages are discarded.Reliability
If the sendmsg succeeds, RDS guarantees that the message is visible to recvmsg on a socket that is bound to the destination address as long as that destination socket remains open.
If there is no socket bound on the destination, the message is dropped. If the RDS that is sending messages is not sure that a socket is bound, it tries to send the message indefinitely until it is sure or the sent message is canceled.
If a socket is closed, the pending sent messages on the socket are canceled and can or cannot be seen by the receiver.
The RDS_CANCEL_SENT_TO socket option can be used to cancel all the pending messages to a given destination.
If a receiving socket is closed with pending messages, then the sender considers those messages as having left the network and will not retransmit them.
A message is seen by the recvmsg call
unless the MSG_PEEK
is specified. When the message
is delivered it is removed from the transmit queue of the sending
socket.
All messages sent from the same socket to the same destination is delivered in the order they are sent. Messages sent from different sockets, or to different destinations, are delivered randomly.