I'm debugging an embedded device which is sending TCP packets to a mobile device. I get descent throughput and everything works fine, but every once in a while I get long hangs (1-2 seconds). When I check Wireshark, I see a string of DUP ACKs, then the delay happens. I would really appreciate any opinions or advice. The delay in this capture happens at packet 34066.
asked 20 Dec '16, 14:06
retagged 20 Dec '16, 20:48
It is not always the case, but in this example the IP ID field of the data packets is helpful.
You experienced packet losses followed by a retransmission timeout of 1.2 seconds.
Your "fix" is to identify the reason for the losses and try to stop them.
Possible "workarounds" would be to:
Option 1 would give you a smaller time gap when you have these events. Option 2 might reduce the time gap altogether.
There were two packets "lost" after #34011 (IP ID = 55059). The next data packet was #34014 (IP ID = 55062) which was the start of a burst of 36 packets, ending in #34065 (IP ID 55097).
Every one of those packets caused the receiver to send a Dup-ACK, indicating that it had only received up to #34011. In response to all those Dup-ACKs (we only need 3 Dup-ACKs for a Fast Retransmission), the sender retransmitted the first of the lost data packets as #34093 (IP ID 55098). The IP IDs tell us that this was a real retransmission from the sender, not just the original (which would have been IP ID 55060) arriving late.
After this, there is only one more small data packet, #34102 (IP ID 55099) with a payload of 328 bytes. This is new data which belongs after #34065.
Packet #34104 is the first ACK to #34093 and then #34105 is a Dup-ACK, triggered by the receipt of the small #34102. It is a Dup-ACK because we're still missing one data packet (what would have been IP ID 55061).
However, there are no more data packets to trigger further Dup-ACKs, so the sender now needs to wait for a retransmission timeout before it can retransmit the remaining lost data. The missing data is sent in #34117, but only after waiting 1.19 seconds for the timer to expire.
Packet #34118 is the ACK for #34117 and after that, the flow continues as normal.
The Stream Graphs help to make this easier to understand.
In the first graph, we are zoomed-in to the first part of the long 1.2 second time gap. The lost packets are in the yellow circle, the first retransmission in blue and that small #34102 in red. Notice that the ACK line remains horizontal, but stepped up, because we are still missing data that should have been in the yellow circle.
In the second graph, we zoom-out to see the full 1.2 second time gap. The blue circle contains the packets we saw in the first graph. The yellow circle highlights the retransmitted second lost packet - after the 1.19 second retransmission timeout. After that, we see the flow continue normally.
Here are some more observations that may shine some light on the reason for the packet losses.
Looking at the first graph below, we see the flow is usually sent in bursts of 38 packets per round trip. However, the burst (circled in blue) prior to the one with the losses was only 27 packets (because the receiving iPhone reduced its Receive Window from 64 KB to 40 KB).
There was a longer time gap until the next 64 KB burst (circled in yellow, notice the steeper slope) which was at a faster rate than usual. The two lost packets occurred at the end of that faster burst. Could this faster rate have caused a buffer over-run in an intermediate device?
Also notice that after the faster burst, we have "caught up" and revert to the normal flow rate. Resulting in no loss of time overall.
The second graph below, the whole flow is plotted. Notice that there is also a "catch up" flow rate after the 1.2 second time gap. The overall time does not appear to be affected by the large gap because by the end of the whole flow, we're where we would have been anyway.
answered 20 Dec '16, 17:59
edited 22 Dec '16, 20:48