For many of us, TCP just hums in the background, sending our data efficiently and effectively without another thought. And what exactly does TCP do for us anyway? Let's recap:
- reliable, ordered delivery via independent, bidirectional sequence numbers
- ports so multiple connections between systems can occur simultaneously
- connection setup and teardown
- flow control via a sliding congestion window
On this topic, a curiosity recently arose while I was working where some of our intercontinental WAN links just weren't living up to their advertised capability. What was going on? The links should be 2Mbps, but we were barely getting 600-700Kbps over them with our FTP transfers. And this was between two Solaris 8 systems, which should have a decent, if not one of the best TCP/IP stacks in the industry.
So after a bit of investigating performance problems on WAN links I came across upon a plethora of writeups on TCP window sizes, the bandwidth delay product (BDP), and networks with high BDPs, or "Long Fat Networks" (LFNs, read "elephants"). Bottom line, it turns out most TCP stacks in the industry just aren't by default setup for use over today's WAN links, satellite links, even gigabit ethernet--anything with either a high bandwidth or delay, or both.
Let's consider what makes an LFN an LFN. We'll use the above WAN link as an example. Say you measure the round trip time (RTT) to be 300ms using ping. This means it's taking about 150ms for the data you're sending to get to the other side, and another 150ms for the other side's acknowledgment to get back to you. 300ms is a pretty long delay, almost a third of a second. Click the screenshot of Figure 1 below to watch a flash animation I created demonstrating a ping over this LFN.
So let's consider the BDP--what is it? The BDP tells us the optimal TCP window size needed to fully utilize the line. To keep the pipe fully utilized you must push data onto the wire, at your given bandwidth, for as long as an entire RTT. That is, the receiver must advertise a window size big enough to allow the sender to keep sending data right up until he begins receiving acknowledgments. Say the WAN link is 2 megabits per second (Mbps). In our case, we need a window size big enough to allow for 300ms of data. Let's compute the BDP:
Bandwidth * Delay = Product 2000000 bits/s * .3 s = 600,000 bits * 1 byte/8 bits = 75,000 bytes
So in this example our TCP window size should be a minimum of 75,000 bytes. But let's say that our systems haven't been properly tuned for this type of bandwidth and delay, and their TCP stack have a maximum TCP window size of 24,000 bytes. Oh no! Click and watch the simulated behavior in Figure 2.
You can clearly visualize in the animation above how inefficient this is. The sender can only send so much data before it's completely filled the window and has to stop and wait for some acknowledgments to come back. As soon as that first ACK comes back it can resume sending data--until it prematurely fills the window again, and so on.
As it turns out, I didn't just pick 24,000 bytes out of the air--that is Solaris 8's default maximum TCP window size (24576 to be exact). And this is the reason we were only getting 600-700Kbps on a 2Mbps line. The fact is Solaris 8, by default, just isn't tuned for performance on a high bandwidth-delay network. But this problem doesn't just affect the fairly antiquated Solaris 8--most OSes today have a default maximum TCP window size of at most 64KB, which would still be insufficient.
Fortunately, the TCP window size is easily adjustable on Solaris or pretty much any OS. Instructions on how to do this are widely available (see links at the end of this article). It's probably worth noting that the adjustments you make are actually to the default socket buffer size for each application, which indirectly allows the system to advertise larger TCP windows. The following commands are for Solaris:
ndd -set /dev/tcp tcp_xmit_hiwat 75000 ndd -set /dev/tcp tcp_recv_hiwat 75000
So we adjusted our systems to use 75,000 byte TCP window sizes. Click and watch Figure 3 for a demonstration of an optimized TCP flow.
Much better. And in fact now we're getting a full 1.9-2Mbps on our WAN link.
The next logical questions would be: what are the dangers of increasing the TCP window size, and how much is too much? The only real downside to turning up the default socket buffer sizes is increased memory usage, which in this age of cheap and plentiful memory doesn't seem like a big deal. But also the argument comes up that with more unacknowleged data in flight, the risk of clogging your networks increases as more data might be transmitted, lost, and retransmitted. This could be the case if your network is having errors, in which case you should probably fix the problem anyway. But if you don't increase the window size you're underutilizing your network. So which is worse?
As a side note, we found turning up the TCP window size on our gigabit ethernet-capable systems had a significant performance boost on LAN throughput as well. Using Iperf (see links at end of article), a great tool for testing different TCP window sizes and their effect on throughput, we realized an increase from about 300Mbps to 850-900Mbps.
Let's take one last look at the BDP formula and observe the following, essentially the point of this article: TCP bandwidth is limited by the round trip time of the line and the size of the TCP window. While the former is out of your control, the latter is not. You could in fact prove, using the BDP formula, that with a TCP window of 24,576 bytes, it was impossible for me to get any more than 655Kbps on a line with a RTT of 300ms.
So while TCP is normally a well-oiled machine, there are still opportunities for performance tuning and considerable payoff for understanding how things work under the hood. Of all the networking settings you can tweak on your machine, tuning the TCP window size is hands down the biggest bang-for-buck optimization you can make for improving throughput over high bandwidth-delay networks.
While that's the end of the discussion on sizing and tuning TCP windows, there's actually quite a bit more for anyone interested. Here are some tidbits: The original designers of TCP made the window size 16 bits, which allow a maximum of 65,536 bytes. After all, who would ever need a bigger window than that? :-). In order to accommodate larger window sizes, the systems must both support TCP window scaling (RFC1323), which basically just specifies a multiplier (2,4,8,16 etc) to the advertised window size. Also, when the window sizes start getting huge, and thus the amount of unacknowledged data in flight also gets huge, the ability to do selective ACKs (SACKs) becomes important. This allows recipients to selectively request retransmits of particular sequence ranges, instead of throwing away and requiring retransmits of everything past the lost segment, as was the nature of the original TCP spec. Most modern OSes should be capable of both of the above, I know Linux, Solaris, and FreeBSD 5.x all are, however surprisingly in almost all cases these are not enabled by default.
TCP is a complicated yet fascinating protocol, here are some links which may be of interest:
- A User's Guide to TCP Windows (excellent)
- TCP Window calculator
- Iperf: measure maximum TCP bandwidth
- Enabling High Performance Data Transfers
- RFC1072 (scaled windows, selective acknowledgments, and round-trip timing, in order to provide efficient operation over large-bandwidth*delay-product paths)
- RFC1323 (supercedes 1072)
- SUNET Internet2 Land Speed Record Broken (18,000 miles, 4.311 Gbit/sec, RTT of 436ms, 250MB TCP windows used)