This is our old Q&A Site. Please post any new questions and answers at ask.wireshark.org.

Hi all,

I'm debugging an embedded device which is sending TCP packets to a mobile device. I get descent throughput and everything works fine, but every once in a while I get long hangs (1-2 seconds). When I check Wireshark, I see a string of DUP ACKs, then the delay happens. I would really appreciate any opinions or advice. The delay in this capture happens at packet 34066.

https://www.dropbox.com/s/8cj5kepbp73glnl/delay.pcapng.zip?dl=0

Thanks!

asked 20 Dec '16, 14:06

rimb05's gravatar image

rimb05
6112
accept rate: 0%

retagged 20 Dec '16, 20:48

Philst's gravatar image

Philst
4311616


It is not always the case, but in this example the IP ID field of the data packets is helpful.

SHORT ANSWER

You experienced packet losses followed by a retransmission timeout of 1.2 seconds.

Your "fix" is to identify the reason for the losses and try to stop them.

Possible "workarounds" would be to:

  • reduce the retransmission timeout value of your sending device.
  • increase the "aggressiveness" of the sending device to send more than one retransmission at once after receiving many Dup-ACKs.

Option 1 would give you a smaller time gap when you have these events. Option 2 might reduce the time gap altogether.

LONG ANSWER

There were two packets "lost" after #34011 (IP ID = 55059). The next data packet was #34014 (IP ID = 55062) which was the start of a burst of 36 packets, ending in #34065 (IP ID 55097).

Every one of those packets caused the receiver to send a Dup-ACK, indicating that it had only received up to #34011. In response to all those Dup-ACKs (we only need 3 Dup-ACKs for a Fast Retransmission), the sender retransmitted the first of the lost data packets as #34093 (IP ID 55098). The IP IDs tell us that this was a real retransmission from the sender, not just the original (which would have been IP ID 55060) arriving late.

After this, there is only one more small data packet, #34102 (IP ID 55099) with a payload of 328 bytes. This is new data which belongs after #34065.

Packet #34104 is the first ACK to #34093 and then #34105 is a Dup-ACK, triggered by the receipt of the small #34102. It is a Dup-ACK because we're still missing one data packet (what would have been IP ID 55061).

However, there are no more data packets to trigger further Dup-ACKs, so the sender now needs to wait for a retransmission timeout before it can retransmit the remaining lost data. The missing data is sent in #34117, but only after waiting 1.19 seconds for the timer to expire.

Packet #34118 is the ACK for #34117 and after that, the flow continues as normal.

The Stream Graphs help to make this easier to understand.

In the first graph, we are zoomed-in to the first part of the long 1.2 second time gap. The lost packets are in the yellow circle, the first retransmission in blue and that small #34102 in red. Notice that the ACK line remains horizontal, but stepped up, because we are still missing data that should have been in the yellow circle.

In the second graph, we zoom-out to see the full 1.2 second time gap. The blue circle contains the packets we saw in the first graph. The yellow circle highlights the retransmitted second lost packet - after the 1.19 second retransmission timeout. After that, we see the flow continue normally.

alt text

alt text

Here are some more observations that may shine some light on the reason for the packet losses.

Looking at the first graph below, we see the flow is usually sent in bursts of 38 packets per round trip. However, the burst (circled in blue) prior to the one with the losses was only 27 packets (because the receiving iPhone reduced its Receive Window from 64 KB to 40 KB).

There was a longer time gap until the next 64 KB burst (circled in yellow, notice the steeper slope) which was at a faster rate than usual. The two lost packets occurred at the end of that faster burst. Could this faster rate have caused a buffer over-run in an intermediate device?

Also notice that after the faster burst, we have "caught up" and revert to the normal flow rate. Resulting in no loss of time overall.

The second graph below, the whole flow is plotted. Notice that there is also a "catch up" flow rate after the 1.2 second time gap. The overall time does not appear to be affected by the large gap because by the end of the whole flow, we're where we would have been anyway.

alt text

alt text

permanent link

answered 20 Dec '16, 17:59

Philst's gravatar image

Philst
4311616
accept rate: 27%

edited 22 Dec '16, 20:48

Thanks for the detailed answer. I'm going to look into your suggestions. I'm using LWIP for the sender, and the receiver is an iPhone. I've just looked in the lwip code and I can't seem to find an option to reduce the rto (unless it's buried in the code itself)... Would you be familiar with lwip by chance?

(20 Dec '16, 19:11) rimb05

No, I don't know much about LWIP.

I just did a bit of Googling though. Here is one article that seems to discuss the right topic.

http://lwip.100.n7.nabble.com/Reduce-retransmission-timeout-td15223.html

(20 Dec '16, 19:54) Philst
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×205
×104
×42
×20
×7

question asked: 20 Dec '16, 14:06

question was seen: 2,733 times

last updated: 22 Dec '16, 20:48

p​o​w​e​r​e​d by O​S​Q​A