I think this is a good one. I'll try to reserve my diagnosis until the end.
We have two endpoints (both linux) that are 25ms apart (50ms RTT). They are transferring files via an SSH tunnel - actually they are tar-ing to a pipe (|) which directs to an SSH session that executes another tar command on the other side - suboptimal IMHO, but they're unix "gurus", what're you going to do. Anyway... The customer is complaining about the throughput. The endpoints negotiate a window scale of 4 (256k window). The sites are connected via OC192 at <10% utilization - bandwidth ain't an issue. It looks like they go through the normal TCP slow start until they hit about 40Mbps, then we see a big drop in throughput and slow-start kicks in again. The window doesn't drop - well, it goes from 261k to ~255k at various points in the transfer, but those slight drops in window size don't really align with the drops in throughput. We're capturing with a probe/SPAN on one side, and a capture agent on the other. We don't see any packets dropped - what is sent is being received. NOW, what we DO see are trip-dup-ACKs, which I assume are triggering the remote stack's drop in it's congestion window. One of my fellow monkeys traced back up 20 packets or so and found a single (1) out-of-order packet. And this is where we are.
OPINION: Because of the amount of data flowing I think that one OOO packet is causing the buffers to get blown out of the water. The server AS on both sides is supported by a highly redundant network, it's possible that the OOO packet is the result of one leg in the redundant path having a temporarily slightly higher latency. It seems odd that a smoothly flowing 256k window would suddenly fill up so quickly. This may point to an I/O issue on the receiving server - but if that was the case I would expect to see much more severe drops in TCP Window size. I have NOT verified whether or not the switch is seeing output queue drops on the server interfaces - it's possible that there's a drop going on at the interface level that the probe isn't seeing. Aren't trip-dup-ACKs special though? Is that the receivers way of telling the sender to start SlowStart? It seems that pushing a TCP Zero-Window would have the same effect - but the recovery from a 0 window includes up to a 3 second period of non-communication. Is the trip-dup-ack the big RED slowdown button?
asked 03 Feb '11, 10:44
The Trip-DUP-Ack is meant to trigger TCP Fast Retransmission and by that fast recovery instead of the classical slow start... well not completely instead, but the rate that CWND increases should be rapidly higher compared to slow start. That is also dependend on which OS the sender is using -> RFC states Fast Retransmsission to trigger on the third duplicate ACK (4th ACK /w same ACK number), while Microsoft speeds up the process and does not keep to the RFC by triggering after the 2nd duplicate ACK (3rd total ACK /w same number)
Your case kind of reminds me of my question to TCP sender behaviour earlier here, remember ? :)
do you have Speed downlinks between those two stations from Gig to 100M ?
Also you might want to take a close look at the timings, because I had several cases, where wireshark was talking 'bout OOO but those were in fact fast retransmits and vice versa, but you know that - just commenting for others reading this question
answered 18 Feb '11, 01:14
Interesting one, could it be that the Duplicate ACK's have SACK options? telling the sender about the missing packet for each received packet until the OOO packet arrives?
Could you post the relevant part of the trace somewhere?
Or the output of the following:
answered 03 Feb '11, 12:48
Sorry for the delay, I've been MIA. I'll working on scrubbing the captures and getting them posted. After digging through IBM's KB we found an article loosely related to this issue. Low-and-behold a simple reboot seems to make the problem disappear for a few days. The SysAd's are STILL asking us to figure out how the network performance is improved by a server reboot. Silly admins.
(15 Feb '11, 08:14) GeonJay
The endpoints are Gig, switch uplinks are 10G, and the interconnects between the sites as OC192s.
I'm read and reread Comer's opinion on the trip-dup-ack, and I get lost in the RFC about fast retrans. Why isn't there a "Simple" button for this stuff? Thanks for putting the sequence of events more plainly.