This is a static archive of our old Q&A Site. Please post any new questions and answers at ask.wireshark.org.

TCP ZeroWindow loop

0

My NFS client and NetApp filer got stuck in this loop of ACKs and ZeroWindows. This repeated over and over until i finally dropped the connection with tcpdrop. I'm thinking this is a bug on the NetApp filer, can someone help me break down exactly what is happening? It seems like my client (10.231.96.85) is waiting for an acknowledgment of 55k of data, but the filer (10.231.96.105) is just sending ZeroWindow back forever:

  1   0.000000 10.231.96.105 -> 10.231.96.85 TCP [TCP ZeroWindow] nfs > oob-ws-http [ACK] Seq=1 Ack=1 Win=0 Len=0 TSV=122735999 TSER=828368381
  2   0.000004 10.231.96.85 -> 10.231.96.105 TCP [TCP ACKed lost segment] oob-ws-http > nfs [ACK] Seq=55773 Ack=2 Win=1029 Len=0 TSV=953767461 TSER=122735063
  3   0.000006 10.231.96.105 -> 10.231.96.85 TCP [TCP ZeroWindow] [TCP Keep-Alive] nfs > oob-ws-http [ACK] Seq=1 Ack=1 Win=0 Len=0 TSV=122735999 TSER=828368381
  4   0.000009 10.231.96.85 -> 10.231.96.105 TCP [TCP Keep-Alive ACK] oob-ws-http > nfs [ACK] Seq=55773 Ack=2 Win=1029 Len=0 TSV=953767461 TSER=122735063
  5   0.000011 10.231.96.105 -> 10.231.96.85 TCP [TCP ZeroWindow] [TCP Keep-Alive] nfs > oob-ws-http [ACK] Seq=1 Ack=1 Win=0 Len=0 TSV=122735999 TSER=828368381
  6   0.000015 10.231.96.85 -> 10.231.96.105 TCP [TCP Keep-Alive ACK] oob-ws-http > nfs [ACK] Seq=55773 Ack=2 Win=1029 Len=0 TSV=953767461 TSER=122735063
  7   0.000017 10.231.96.105 -> 10.231.96.85 TCP [TCP ZeroWindow] [TCP Keep-Alive] nfs > oob-ws-http [ACK] Seq=1 Ack=1 Win=0 Len=0 TSV=122735999 TSER=828368381
  8   0.000021 10.231.96.85 -> 10.231.96.105 TCP [TCP Keep-Alive ACK] oob-ws-http > nfs [ACK] Seq=55773 Ack=2 Win=1029 Len=0 TSV=953767461 TSER=122735063
  9   0.000022 10.231.96.105 -> 10.231.96.85 TCP [TCP ZeroWindow] [TCP Keep-Alive] nfs > oob-ws-http [ACK] Seq=1 Ack=1 Win=0 Len=0 TSV=122735999 TSER=828368381
 10   0.000026 10.231.96.85 -> 10.231.96.105 TCP [TCP Keep-Alive ACK] oob-ws-http > nfs [ACK] Seq=55773 Ack=2 Win=1029 Len=0 TSV=953767461 TSER=122735063

What could cause this to happen? I thought it might've been described in Section 2.17 of RFC 2525 "Known TCP Implementation Problems" - http://www.ietf.org/rfc/rfc2525.txt:

Name of Problem Failure to RST on close with data pending

Description When an application closes a connection in such a way that it can no longer read any received data, the TCP SHOULD, per section 4.2.2.13 of RFC 1122, send a RST if there is any unread received data, or if any new data is received. A TCP that fails to do so exhibits "Failure to RST on close with data pending".

 Note that, for some TCPs, this situation can be caused by an
 application "crashing" while a peer is sending data.

We have observed a number of TCPs that exhibit this problem. The problem is less serious if any subsequent data sent to the now- closed connection endpoint elicits a RST (see illustration below).

Significance This problem is most significant for endpoints that engage in large numbers of connections, as their ability to do so will be curtailed as they leak away resources.

Implications Failure to reset the connection can lead to permanently hung connections, in which the remote endpoint takes no further action to tear down the connection because it is waiting on the local TCP to first take some action. This is particularly the case if the local TCP also allows the advertised window to go to zero, and fails to tear down the connection when the remote TCP engages in “persist” probes (see example below).

asked 16 Jan ‘12, 10:25

administraitor's gravatar image

administraitor
1222
accept rate: 0%

edited 21 Sep ‘12, 08:42

cmaynard's gravatar image

cmaynard ♦♦
9.4k1038142

Any chance you can post the actual pcap somewhere? And can you post the real seq#’s as opposed to relative numbers? Edit, Preference, Protocols, TCP, Relative sequence numbers"

Also, what is your window scaling factor?

(21 Sep ‘12, 14:05) hansangb


One Answer:

1

The snippet you've included does seem to match up with the behavior you're seeing...but...the packet timestamps are confusing me. The endpoint that receives the ZeroWindow advert is supposed to wait for a while before sending a "zero window" probe - and that wait period is supposed to increase as more ZeroWindow adverts are received.

From http://www.usenix.org/publications/library/proceedings/bos94/full_papers/lin.a

  1. Keep sending data to the echo port without reading the echoed data.

As Figure 6 shows, because the probe program sends data without reading the echo, the receive buffer of TCP A eventually becomes full, causing it to send a zero-window ACK segment to TCP B. Because TCP B cannot send data to TCP A, the send buffer of TCP B will become full of echoed data. When the echo server on B cannot send more data, the receive buffer of TCP B will become full. Once the receive buffer of TCP B becomes full, it advertises a zero window to TCP A. After the zero-window condition exists for more than a threshold time period, both sides begin sending zero-window probes.

4.2 Results

Operating & Data size in & Min. probe & Max. probe System
& 0-win probe seg. & Interval & Interval Solaris 2.1 & 1 MSS octets & 200 ms & 60 sec. SunOS 4.1.1 & 1 octet & 5 sec. & 60 sec. SunOS 4.0.3 & 1 octet
& 5 sec. & 60 sec. HP-UX 9.0
& 1 octet & 4 sec. & 60 sec.
IRIX 5.1.1 & 1 octet & 5 sec.
& 60 sec.

If I were a guessing man, and I am, I'd say that you're looking at some kind of stack implementation bug.

answered 17 Jan '12, 06:23

GeonJay's gravatar image

GeonJay
4705922
accept rate: 5%