Transmission Control Protocol

Introduction

Transmission Control Protocol (TCP) is a means for building a reliable communications stream on top of the unreliable packet Internet Protocol (IP). TCP is the protocol that supports nearly all Internet applications. The combination of TCP and IP is referred to as TCP/IP and many people imagine, incorrectly, that TCP/IP is a single protocol.
The basic method of operation involves

wrapping higher level application data in segments
wrapping the segments into IP datagrams
associating port numbers with particular applications
associating a sequence number with every byte in the data stream
exchanging special segments to start up and close down a data flow between two hosts
using acknowledgments and timeouts to ensure the integrity of the data flow


Segment Format

TCP segments are constructed from 32 bit words and include a 20 byte (5 word) header. The basic layout is shown below.




source port number
The source (and destination) port numbers are used for demultiplexing the data stream to applications. It is entirely possible for there to be multiple simultaneous TCP data streams between two hosts. A TCP data stream is uniquely identified by a group of four numbers. These are the two hosts addresses and the two port numbers.

The source port number is the one to be used as destination in any reply to the segment.

destination port number
This is the "target" port number on the remote system.


sequence number
This 32 bit number identifies the first byte of the data in the segment.


acknowledgment number
This 32 bit number is the byte number of the next byte that the sender expects to receive from the remote host. The remote host can infer that all bytes up to this number minus one have been safely received and the remote host's local copies can be discarded.


header length
This 4-bit field specifies the header length in 32 bit words. Clearly the maximum value is 15 words (60 bytes) allowing for 10 (40 bytes) of options.


flag bits
This group of 6 bits identify various special states in the protocol. Several of the bits may be set simultaneously. The bits are discussed in more detail later.


URG Indicates that the Urgent pointer is valid. I.e. there is urgent data.
ACK The acknowledgment number is valid. This will usually be set.
PSH The data shold be passed to the application as soon as possible. This will typically involve flushing buffers.
RST Reset the connection. This involves marking the sequence numbers as invalid.
SYN The synchronize bit is used to establish initial agreement on the sequence numbers.
FIN The sender has finished sending data. This fact will, normally, be passed on to the application as close.



window size
This is, basically, the amount of space that the receiver has available for the storage of unacknowledged data. The units are bytes unless the window scale factor option is used. The maximum value is 65535.


checksum
This covers both the header and the data. It is calculated by prepending a pseudo-header to the TCP segment, this consists of 3 32 bit words which contain the source and destination IP addresses, a byte set to 0, a byte set to 6 (the protocol number for TCP in an IP datagram header) and the segment length (in words). The checksum field of the TCP segment is set to zero and the following algorithm applied to the prepended segment treated as a sequence of 16 bit (unsigned) words.


unsigned long cksum = 0;
unsigned short *sptr;
while(sptr points to part of prepended segment)
cksum += *sptr++;
cksum = (cksum >> 16) + (cksum & 0xffff);
cksum += (cksum >> 16);
cksum = (short)(~cksum & 0xffff);

Note The inclusion of the IP addresses in the checksum calculation means that TCP cannot easily be carried on lower level protocols other than IP (version 4).
urgent pointer
This is part of TCP's mechanism for sending urgent data that will overtake the normal data stream. If the URG flag bit is set this field indcates the position within the data of the last byte of the urgent data. There is no way of indicating where the urgent data starts.


options
There are a number of options defined in various RFCs. The most useful is the Maximum Segment Size (MSS) specification facility. The format of options will be discussed later.


Flow Control
Flow control is associated with the current byte sequence numbers at each end of the data flow. Whenever a segment is sent it includes the sequence number of the last byte sent. A segment will also include the sequence number of the next byte that the sending host expects to receive, this is called the acknowledgment number (ACK). A host receiving a segment can assume that the remote host has safely received all bytes up to and including byte ACK-1, local copies may now be discarded.


The difference between the number of the last byte sent and the acknowledgment number is known as the window. The maximum size of the window is advertised by a host as part of every TCP segment the host sends, a host can quench the flow of data from a remote host by advertising a window size of zero. Once a zero window size advertisement has been received a host can no longer send data. A host may not, under any circumstances, send data with byte sequence numbers greater than the sum of the remote acknowledgment number and the remote window. Under normal circumstances the remote window can be thought of as a buffer where out-of-sequence segments are held temporarily awaiting the filling in of gaps in the sequence when delayed data turns up.


Window size is not negotiated. It is up to the sender not to over-run the receiver's buffers. A small sender buffer constraint will only mean that the sender cannot take full advantage of the receiver's buffering capabilities.


The following portion of a byte stream is broken into segments as indicated below. It is here assumed that sequence numbering starts at zero.



The building up of a copy of the byte stream by the receiver is shown below. The dotted box shows the receiver window.



The following steps are shown


Segment 1 has arrived at the receiver which acknowledges it by sending an ACK segment with ACK=1000 and WIN=1600. Since ACK+WIN=2600 the sender can send segments 2,3 and 4. Segments 2 and 3 are sent.
Segment 3 has arrived but segment 2 has been delayed. The receiver sends another ACK segment with ACK=1000 and WIN=1600. The sender sends segment 4.
Segment 4 has arrived but segment 2 is still outstanding. Again the receiver sends ACK=1000, WIN=1600. However the sender cannot send segment 5 since it would take the sent sequence number to 2800 which is greater than ACK+WIN. The sender cannot do anything further unless
Segment 5 is split into smaller segments to use up the remaining part of the window. This leads to the silly window syndrome.
There is a time out.
Segment 2 reaches the receiver and the receiver announces a new ACK value.

The delayed segment 2 now reaches the receiver which sends ACK=2400 and WIN=1600 allowing the sender to send up to 4000.
The initial values of the sequence numbers are exchanged during the connection establishment sequence.

The flow control mechanism shown above is usually called a sliding window protocol. The numbered packets used by X.25 are also handled in this fashion, however in X.25 the numbering is applied on a per packet basis rather than a per byte basis.

A receiver is not required to explicitly and separately acknowledge every incoming segment. A receiver may typically wait up to 200 mS before sending an acknowledgment, this can be troublesome for interactive applications. Acknowledgments can, of course, be included in the return data flow.

If a receiver has advertised a window size of zero as a flow quenching mechanism, it will subsequently "open" the window by sending a further ACK with the updated window value, this is known as a window update, it need not necessarily carry any data. A particular problem arises if the window update is lost, this problem is handled by the sender sending peridoic probes as determined by the persist timer. Such probes will include the next character which the receiver can discard by not acknowledging.

Opening a TCP connection
A TCP connection is opened by a three-way handshake to establish a common view of the sequence numbers. A connection will be initiated by an active client, the other end of the connection is described as the passive client, although in terms of the client/server software model this is likely to be a server. The passive client should be in a state known as LISTEN which simply means that it is expecting an incoming connection request.

The three way exchange involves the active client sending a SYN segment with the sequence number set to an arbitrary value (J). The passive client responds with a SYN segment with the acknowledgment number set to J+1 and the sequence number set to a further arbitrary value (K). The active client responds to the SYN segment by sending an ACK segment with the acknowledgment number set to K+1.



Closing a TCP connection
The orderly close down of a TCP connection requires the four way exchange illustrated in the diagram below.




At the active end the application initiates the closure sequence, possibly by a close() system call on a socket. At the passive end receipt of the FIN segment causes the software to pass an "end-of-file" indication to the server software.

It should be noted that the exchange is really two independent exchanges and it is possible to close the connection in one direction but not the other. This is known as a half close. The following example (due to Stevens) demonstrates the use of the half-close.

Consider the Unix command

rsh remote sort > datafile

The effect of this is that the local file datafile is sorted on the remote host and the results transferred back to the local host. The data flow is shown in the following diagram.



The problem here is that the sort program on the remote host will not start sorting the data until it has read all the data, this event is indicated by the local host closing the connection and the sort program responding to the corresponding EOF indication. However, the "back" connection must remain open for the return of data.

Stevens suggests that the library call shutdown() be used with sockets programming to achieve a half close.

Once the final ACK has been sent on an active close, the port/connection cannot be relaeased and re-used for the time period 2MSL. This is twice the maximum segment life and this constraint is imposed in case the the final ACK is lost. If the final ACK is lost then the passive closing host will time out awaiting an ACK in response to the closing FIN and will resend the FIN. If this arrives before the 2MSL time has expired there is no problem, after this time the FIN does not appear to belong to whatever connection might exist between the two clients.



RFC 793 defines MSL (Maximum Segment Lifetime) as 120 seconds but some implementations use 30 or 60 seconds. It is, basically, the maximum time for which it is reasonable to wait for a segment, i.e. if a segment doesn't reach its destination in MSL, it probably won't get there at all at it can be assumed that it has been lost.

TCP link states
The behaviour of a TCP connection can be shown using a state transition diagram such as that shown below.




The dashed lines show the normal transitions for a server and the heavy lines show the normal states for a client. The TIME_WAIT state is also sometimes known as the 2MSL state.

Nagle algorithm
As explained in the discussion of flow control it is not required or necessary that every segment be explicitly acknowledged. Given the overheads associated with both TCP and IP explicit acknowledgment of every segment could generate significant extra traffic.

A general rule, known as the Nagle Algorithm says that a sender must not have more than one unacknowledged "small" segment. Any further data from the application is held by the sender until the outstanding segment is acknowledged. Here small means of size less than the maximum segment size.

This is not desirable in highly interactive environments such an X window client/server interaction or a screen based terminal session, under these circumstances Nagle can be turned off by using the TCP_NODELAY socket option. Nagle may also sometimes be turned off anyway for a connection with both hosts on the same LAN.

Silly window syndrome
If a receiver advertises a small window, perhaps due to congestion between the receiver's buffer and the application, then the sender may split its data into to small segments and send them, resulting in an even smaller receiver window and the sender sending data in still smaller segments.

The overheads associated with this behaviour, known as the silly window syndrome are significant. It is avoided by a rule which states that a receiver can't increase its advertised window size unless the increase is either a full segment size or ? the current buffer space whichever is smaller. The sender should also refrain from sending segments unless

A full size segment can be sent
A segment can be sent that is at least half the size of the largest window ever advetised by the receiver
All outstanding can be sent and no ACK is expected or Nagle is disabled


Timers

There are a number of timers associated with a TCP connection. These are


The 2MSL timer
This is associated with closing connections. It was discussed earlier.
The persist timer
This is associated with periodic window probes generated in case a window update opening a zero sized window has been lost. It normally has a minimum value of 5 seconds and increases exponentially to a maximum of 60 seconds.
The keepalive timer.
An idle TCP connection is just that, idle, i.e. no data flows. The keepalive timer is an optional but common feature of TCP implementations. When the timer times out a sgement containing no data and a sequence number one less than the last sequence number sent is sent. The keepalive timer is normally set to about 2 hours, once it expires several (typically 10) probe segments are sent at 75 second intervals, if none of these elicit a response the connection is assumed to be dead.
The RTT timer
The RTT (Round Trip Time) timer maintains a smoothed average of the time between sending a segment and it being ackowledged. If an RTT measurement of M has been made then a new value of R (the RTT estimator) can be calculated using the formula.

R = a ? R + (1-a) ? M

RFC 793 then recommends an acknowldegment time out of

RTO = b ? R

Typical values are 0.9 for a and 2 for b. More recent implementations have been more sensitive to the variablity of RTTs and have used the expression

RTO = R + 4 ? D

where D is the smoothed mean deviation of the RTTs.

 

Hosted by Mywebcities.com



MyWebCities.Com hosted sites top list
The sites with 50 visitors per day listed here
Ferdin page Polio page Munkey page Perla page Dext page
Emv page Virus page Travel page Tcp-flow page TCP page
PC term page Test page Mike page Geog page Hanker page
Science page PC tips page Money page Med page Astro page