Anything I’m missing in my analysis of retransmits?

Question

We have a source server 10.235.3.53(local) transferring to 10.240.44.9(remote), and throughput is quite slow. All local links/devices have been vetted as clean so we moved to a capture.

I see a lot of DUP ACKs followed by retransmits, which would be why we're seeing reduced performance.

https://www.cloudshark.org/captures/00208d87b26d This particular capture is the inside interface of the local firewall. The dup ack I see are sourced from remote 10.240.44.9 to the local 10.235.3.53, so it would appear the packet is getting lost somewhere between our local firewall and local server. Then we see 10.235.2.53 do a fast retransmit since it never saw an ACK for the packet. Correct?

Is there anything I'm missing from the capture which can give more details?

Thanks all!

Accepted Answer

Packets are being lost between your capture point and the receiver, 10.240.44.9, not between your capture point and the sender. To verify this, select any of the six packets that Wireshark has identified as retransmissions. Open the TCP portion of the packet, right-click on the sequence number, and select "Apply as filter > Selected." You will see both the original packet and the retransmission, meaning that the original packet made it from the sender to your capture point; it was dropped somewhere downstream from your capture point.

No, the Fast Retransmissions are not because the sender never got an ACK for the data packet. Fast Retransmissions are triggered when the sender gets three Duplicate ACKs from the receiver.

When a packet is transmitted, the sender starts the Receive Time Out (RTO) timer. If the RTO timer counts down to zero and no ACK has been received for that packet, the sender will retransmit it. However, the sender will also retransmit the packet if it received three Duplicate ACKs from the receiver, and this happens more quickly than the RTO timer counts down, hence the name "Fast Retransmission."

This communication was already underway when the capture started, so the TCP three-way handshake is missing. Whenever possible, try to start capturing before the TCP connection is established so that the three-way handshake will be in the capture file. There are certain things that are only seen in the three-way handshake. For example, it is very likely that window scaling is usesd on this connection, but without the handshake, we don't know what the window scaling factors are so we don't know what the true TCP window sizes are.

Answer 2

I think the slow performance is caused by the fact that the SSH server seems to be running in a virtual machine and doesn't get dispatched fast enough (at 28ms intervals).  
The TSVAL at the server increments at 1ms intervals and jumps in the TSVAL go together with a high RTT, so the latency is imposed by the server itself.  
For that high latency the number of bytes in flight is certainly not enough to achieve a satisfying throughput. This is probably due to the congestion window shrinking at the client because of the retransmissions. 
My bet is that the missing packets are dropped in the VM itself and not in the network.

After looking at the traces in the Juniper I must correct my statement.
The packets are dropped in the Juniper()==VPN==()Juniper tunnel, obviously when there are more than 10-12 packets sent in a single batch (due to increasing windowsizes)
The delay of 28 ms is caused by the WAN latency between the 2 Junipers.
Notes to the example below:
ip.id==0x2752 didn't make it to the server in this timeframe.
ip.id==0x2747 and ip.id==0x2753 are delayed by 28 ms

alt text I reduced the files based on ip_ids and uploaded an example here:

https://www.cloudshark.org/captures/a43693daae83

https://www.cloudshark.org/captures/cce154b812a1

Regards Matthias

alt text