Tuning TCP for High Bandwidth-Delay Networks -- Steve Kehlet's Pages

Steve Kehlet's Pages

Articles:

What happened to your blog?
Tuning TCP for High Bandwidth-Delay Networks
Subnetting is Fun

Tuning TCP for High Bandwidth-Delay Networks

Tue Nov 9th 2004, 8:23pm

Like the Energizer bunny, TCP just keeps on going without any need for tuning or maintenance. Right?

For many of us, TCP just hums in the background, sending our data efficiently and effectively without another thought. And what exactly does TCP do for us anyway? Let's recap:

reliable, ordered delivery via independent, bidirectional sequence numbers
ports so multiple connections between systems can occur simultaneously
connection setup and teardown
flow control via a sliding congestion window

Let's focus for a moment on TCP's sliding congestion window. It's the feature that allows a sender to send a certain amount of data on the wire without having yet received an acknowledgment from the receiver. It "slides" open and shut as the receiver lets the sender know how much data can be safely transmitted at any given moment, based on network conditions, free memory on the receiver, etc. Without this windowing feature, we'd have to send a byte, wait for an acknowledgment, send another byte, and so on. You can imagine that this "Stop and Wait" type of protocol would be a performance nightmare. But who picks the size of this window? How big is big enough, or too big?

On this topic, a curiosity recently arose while I was working where some of our intercontinental WAN links just weren't living up to their advertised capability. What was going on? The links should be 2Mbps, but we were barely getting 600-700Kbps over them with our FTP transfers. And this was between two Solaris 8 systems, which should have a decent, if not one of the best TCP/IP stacks in the industry.

So after a bit of investigating performance problems on WAN links I came across upon a plethora of writeups on TCP window sizes, the bandwidth delay product (BDP), and networks with high BDPs, or "Long Fat Networks" (LFNs, read "elephants"). Bottom line, it turns out most TCP stacks in the industry just aren't by default setup for use over today's WAN links, satellite links, even gigabit ethernet--anything with either a high bandwidth or delay, or both.

Let's consider what makes an LFN an LFN. We'll use the above WAN link as an example. Say you measure the round trip time (RTT) to be 300ms using ping. This means it's taking about 150ms for the data you're sending to get to the other side, and another 150ms for the other side's acknowledgment to get back to you. 300ms is a pretty long delay, almost a third of a second. Click the screenshot of Figure 1 below to watch a flash animation I created demonstrating a ping over this LFN.

Figure 1: Ping behavior over an LFN (click to watch)

So let's consider the BDP--what is it? The BDP tells us the optimal TCP window size needed to fully utilize the line. To keep the pipe fully utilized you must push data onto the wire, at your given bandwidth, for as long as an entire RTT. That is, the receiver must advertise a window size big enough to allow the sender to keep sending data right up until he begins receiving acknowledgments. Say the WAN link is 2 megabits per second (Mbps). In our case, we need a window size big enough to allow for 300ms of data. Let's compute the BDP:

Bandwidth      * Delay   = Product
2000000 bits/s * .3 s    = 600,000 bits * 1 byte/8 bits = 75,000 bytes

So in this example our TCP window size should be a minimum of 75,000 bytes. But let's say that our systems haven't been properly tuned for this type of bandwidth and delay, and their TCP stack have a maximum TCP window size of 24,000 bytes. Oh no! Click and watch the simulated behavior in Figure 2.

Figure 2: Untuned TCP session over an LFN (click to watch)

You can clearly visualize in the animation above how inefficient this is. The sender can only send so much data before it's completely filled the window and has to stop and wait for some acknowledgments to come back. As soon as that first ACK comes back it can resume sending data--until it prematurely fills the window again, and so on.

As it turns out, I didn't just pick 24,000 bytes out of the air--that is Solaris 8's default maximum TCP window size (24576 to be exact). And this is the reason we were only getting 600-700Kbps on a 2Mbps line. The fact is Solaris 8, by default, just isn't tuned for performance on a high bandwidth-delay network. But this problem doesn't just affect the fairly antiquated Solaris 8--most OSes today have a default maximum TCP window size of at most 64KB, which would still be insufficient.

Fortunately, the TCP window size is easily adjustable on Solaris or pretty much any OS. Instructions on how to do this are widely available (see links at the end of this article). It's probably worth noting that the adjustments you make are actually to the default socket buffer size for each application, which indirectly allows the system to advertise larger TCP windows. The following commands are for Solaris:

     ndd -set /dev/tcp tcp_xmit_hiwat 75000
     ndd -set /dev/tcp tcp_recv_hiwat 75000

So we adjusted our systems to use 75,000 byte TCP window sizes. Click and watch Figure 3 for a demonstration of an optimized TCP flow.

Figure 3: Optimal TCP session over an LFN (click to watch)

Much better. And in fact now we're getting a full 1.9-2Mbps on our WAN link.

The next logical questions would be: what are the dangers of increasing the TCP window size, and how much is too much? The only real downside to turning up the default socket buffer sizes is increased memory usage, which in this age of cheap and plentiful memory doesn't seem like a big deal. But also the argument comes up that with more unacknowleged data in flight, the risk of clogging your networks increases as more data might be transmitted, lost, and retransmitted. This could be the case if your network is having errors, in which case you should probably fix the problem anyway. But if you don't increase the window size you're underutilizing your network. So which is worse?

As a side note, we found turning up the TCP window size on our gigabit ethernet-capable systems had a significant performance boost on LAN throughput as well. Using Iperf (see links at end of article), a great tool for testing different TCP window sizes and their effect on throughput, we realized an increase from about 300Mbps to 850-900Mbps.

Let's take one last look at the BDP formula and observe the following, essentially the point of this article: TCP bandwidth is limited by the round trip time of the line and the size of the TCP window. While the former is out of your control, the latter is not. You could in fact prove, using the BDP formula, that with a TCP window of 24,576 bytes, it was impossible for me to get any more than 655Kbps on a line with a RTT of 300ms.

So while TCP is normally a well-oiled machine, there are still opportunities for performance tuning and considerable payoff for understanding how things work under the hood. Of all the networking settings you can tweak on your machine, tuning the TCP window size is hands down the biggest bang-for-buck optimization you can make for improving throughput over high bandwidth-delay networks.

While that's the end of the discussion on sizing and tuning TCP windows, there's actually quite a bit more for anyone interested. Here are some tidbits: The original designers of TCP made the window size 16 bits, which allow a maximum of 65,536 bytes. After all, who would ever need a bigger window than that? :-). In order to accommodate larger window sizes, the systems must both support TCP window scaling (RFC1323), which basically just specifies a multiplier (2,4,8,16 etc) to the advertised window size. Also, when the window sizes start getting huge, and thus the amount of unacknowledged data in flight also gets huge, the ability to do selective ACKs (SACKs) becomes important. This allows recipients to selectively request retransmits of particular sequence ranges, instead of throwing away and requiring retransmits of everything past the lost segment, as was the nature of the original TCP spec. Most modern OSes should be capable of both of the above, I know Linux, Solaris, and FreeBSD 5.x all are, however surprisingly in almost all cases these are not enabled by default.

TCP is a complicated yet fascinating protocol, here are some links which may be of interest:

A User's Guide to TCP Windows (excellent)
TCP Window calculator
Iperf: measure maximum TCP bandwidth
Enabling High Performance Data Transfers
RFC1072 (scaled windows, selective acknowledgments, and round-trip timing, in order to provide efficient operation over large-bandwidth*delay-product paths)
RFC1323 (supercedes 1072)
SUNET Internet2 Land Speed Record Broken (18,000 miles, 4.311 Gbit/sec, RTT of 436ms, 250MB TCP windows used)

Visitor comments

On Fri Jul 14th 2006, 1:35pm, Visitor posted:

Very informative.

On Fri Jul 14th 2006, 3:46pm, Steve Kehlet posted:

Thanks Visitor! It's nice to know that someone finds this article informative.

On Mon Feb 12th 2007, 9:46pm, Visitor posted:

Yay TCP windows

On Thu Oct 11th 2007, 5:22pm, Visitor posted:

Great article. Much easier to understand than most of the other ones out there.

On Wed Oct 24th 2007, 9:48am, Visitor posted:

Very clear. With satellites, the BDP is of the order of 10's of MB's. Would you know what settings and tuning people use? Thx

On Wed Oct 24th 2007, 12:07pm, Steve Kehlet posted:

Hi Visitor, thanks for posting. I'd be curious to know too, but no, I haven't done any work with TCP over satellite links. I'd guess, just like any other link, you'd want to measure the RTT and then just use a TCP window calculator (like the one I have linked above) to determine what the optimal TCP window size would be. That should be a ballpark figure at least.

On Fri Dec 14th 2007, 7:24am, Rohit posted:

the best i have found on net ..thanks
can u tell me what shud we do when network bandwidth is low and sender sytem buffer gets full ?? should we increase delay (RTT)

On Wed Sep 10th 2008, 12:51am, Dharmendra Tripathi posted:

Great article. Thanks much.
I want a suggestion. We have a web application (deployed on Tomcat)on Sun Sparc m/c with Solaris 9 and 2 GB of RAM. Sometimes in excessive load, response times becomes too slow. How can we improve the server response time by setting TCP config parameters? Please suggest.

On Thu Nov 6th 2008, 5:15pm, Mike posted:

Good article...

So, taken into consideration the BDP (RCV Window, Latency) what is the calculation needed to answer questions like:
a) how long it would take to send 1GBYTES over a 100Mbit Link?
b) What size link is required to send 1GBYTE in 10 seconds?

Thanks,

On Thu Jan 8th 2009, 5:54pm, Visitor (Glenn) posted:

This is a fantastic, well-written article on a very important subject. Well done!

On Thu Mar 26th 2009, 5:20pm, Jackie posted:

Good artical!

On Thu Apr 16th 2009, 12:08pm, Visitor posted:

Your article, Tuning TCP for High Bandwidth-Delay Networks, is really good! I send this article to clients that think throwing bandwidth at a TCP transmission issue is the answer. Once they see this and optimize their TCP window size, things get much better. Some opt for WAN accelleration which does this and much more.

Thanks again! This truly is a great resource!!!!

On Thu Jul 2nd 2009, 3:15pm, Mohan posted:

Excellent article on this subject. Nice flash animations.

I have one question. Available link bandwidth and delay of a link can vary depending on how many others are using the link. In that case, how do we calculate the BDP?

Thanks again for the wonderful artile. I have forwarded it my collegues.

On Tue Sep 29th 2009, 12:05pm, Visitor posted:

This is by the far the best and simplest article I found that gives an excellent explanation about TCP windows Size and throughput ...

On Sun Nov 29th 2009, 12:31pm, Visitor posted:

Thanks for a very lucid, simple, and compelling explanation!

On Wed Mar 24th 2010, 8:14am, Visitor posted:

Great article. Thank you!

On Fri Mar 26th 2010, 8:02pm, Visitor posted:

Great article!; However, manual tuning will only work in certain environments; consider this topology: http://i39.tinypic.com/34g7ij6.jpg

if you manually adjust the RWIN for ftp01.lax01 and ftp01.nap01 for the 2Mbps link: RWIN: 75K, you will indeed see a performance increase; however, once you try to establish a link to ftp01.ber01 from ftp01.lax01 and transfer a file, you will see a decrease in performance: as the RWIN for the connection will need to be at least 250K for 200ms and a 10Mbps VPN pipe (assuming 0 percent utilization on both Internet connections): 10000000/8 * .2 = 250K, however, as you manually configured the RWIN on ftp01.lax01, the maximum theoretical throughput you can achieve between ftp01.lax01 and ftp01.ber01 would be 3Mbps (RWIN/RTT=Throughput); 75000/.2= 375KB/sec or 3Mbps. Check this out, if you started the same file transfer from ftp02.lax01 to ft01.ber01 with TCP RWIN Auto Tuning, your performance would be much better as the RWIN would be calculated on the fly.

Hope this clears things up a bit.

Evilbit

On Fri Mar 26th 2010, 11:29pm, Steve Kehlet posted:

@Evilbit: Glad you liked the article. Yes, when sizing your tcp windows you need to consider the path with the largest bandwidth-delay product, or you could unintentionally limit your throughput. In your example, simply pick 250KB and you're covered in both cases. Regarding auto tuning of the tcp window size, good point, it's probably best to let the OS handle this, if it's capable--note I wrote this article six years ago, when neither Windows nor Linux (and certainly not Solaris) had this ability.

On Sat Mar 27th 2010, 9:48am, Visitor posted:

Steve, exactly; six years ago is a long time; this is why I added it in case someone else had some confusion. Take care

EB

On Sat Mar 27th 2010, 9:49am, Visitor posted:

Steve, exactly; six years ago is a long time; this is why I added it in case someone else had some confusion. Take care

EB

On Mon Apr 19th 2010, 5:53pm, Visitor posted:

Hi - really appreciate the excellent & clear writeup. By chance, do you have a new link for "A User's Guide to TCP Windows (excellent)" The page can't be displayed. thanks!

On Mon Apr 19th 2010, 6:32pm, Steve Kehlet posted:

Sadly, no. Looks like the project wrapped up, google doesn't have a cached copy. Oh wait! It occurred to me to try the Wayback machine, here it is:

http://web.archive.org/web/20080803082218/http://dast.nlanr.net/Guides/GettingStarted/TCP_window_size.html

(here is the search page from the wayback machine: http://web.archive.org/web/*/http://dast.nlanr.net/Guides/GettingStarted/TCP_window_size.html)

On Wed Apr 21st 2010, 4:43pm, Visitor posted:

thanks for the tip on the Wayback machine and the link to a copy of the User's Guide! cool.

On Thu Jan 27th 2011, 9:45am, Visitor posted:

Hi Steve, I think you're message (and the same message from other authors) is getting out. It's brought up another learning, that setting the window size too large may fill buffers in newtork elements and ruin your latency and jitter. If you have a home network and want to use different types of applications, it's a good idea not to overdo the window size. Have a look at more information by sarching for bufferbloat on Wikipedia. Best Regards.

On Thu Jan 27th 2011, 10:39am, Steve Kehlet posted:

Thanks Visitor for bringing up the point of too large window sizes.

On Fri Jun 24th 2011, 1:45am, ThumbsUp posted:

Excellent, thank you very much !

On Fri Feb 10th 2012, 3:51pm, Sandeep posted:

awesome post. Thanks.

On Wed Mar 7th 2012, 6:52am, Visitor posted:

Thanks. It is great

On Mon May 13th 2013, 8:40pm, Visitor posted:

Nearly 9 years later, and still this article sits ready to serve up great information. Thanks Steve!

On Mon Jul 29th 2013, 4:42am, Visitor posted:

Well-written article, thanks!