TCP ZeroWindow loop

Question

My NFS client and NetApp filer got stuck in this loop of ACKs and ZeroWindows. This repeated over and over until i finally dropped the connection with tcpdrop. I'm thinking this is a bug on the NetApp filer, can someone help me break down exactly what is happening? It seems like my client (10.231.96.85) is waiting for an acknowledgment of 55k of data, but the filer (10.231.96.105) is just sending ZeroWindow back forever:

  1   0.000000 10.231.96.105 -> 10.231.96.85 TCP [TCP ZeroWindow] nfs > oob-ws-http [ACK] Seq=1 Ack=1 Win=0 Len=0 TSV=122735999 TSER=828368381
  2   0.000004 10.231.96.85 -> 10.231.96.105 TCP [TCP ACKed lost segment] oob-ws-http > nfs [ACK] Seq=55773 Ack=2 Win=1029 Len=0 TSV=953767461 TSER=122735063
  3   0.000006 10.231.96.105 -> 10.231.96.85 TCP [TCP ZeroWindow] [TCP Keep-Alive] nfs > oob-ws-http [ACK] Seq=1 Ack=1 Win=0 Len=0 TSV=122735999 TSER=828368381
  4   0.000009 10.231.96.85 -> 10.231.96.105 TCP [TCP Keep-Alive ACK] oob-ws-http > nfs [ACK] Seq=55773 Ack=2 Win=1029 Len=0 TSV=953767461 TSER=122735063
  5   0.000011 10.231.96.105 -> 10.231.96.85 TCP [TCP ZeroWindow] [TCP Keep-Alive] nfs > oob-ws-http [ACK] Seq=1 Ack=1 Win=0 Len=0 TSV=122735999 TSER=828368381
  6   0.000015 10.231.96.85 -> 10.231.96.105 TCP [TCP Keep-Alive ACK] oob-ws-http > nfs [ACK] Seq=55773 Ack=2 Win=1029 Len=0 TSV=953767461 TSER=122735063
  7   0.000017 10.231.96.105 -> 10.231.96.85 TCP [TCP ZeroWindow] [TCP Keep-Alive] nfs > oob-ws-http [ACK] Seq=1 Ack=1 Win=0 Len=0 TSV=122735999 TSER=828368381
  8   0.000021 10.231.96.85 -> 10.231.96.105 TCP [TCP Keep-Alive ACK] oob-ws-http > nfs [ACK] Seq=55773 Ack=2 Win=1029 Len=0 TSV=953767461 TSER=122735063
  9   0.000022 10.231.96.105 -> 10.231.96.85 TCP [TCP ZeroWindow] [TCP Keep-Alive] nfs > oob-ws-http [ACK] Seq=1 Ack=1 Win=0 Len=0 TSV=122735999 TSER=828368381
 10   0.000026 10.231.96.85 -> 10.231.96.105 TCP [TCP Keep-Alive ACK] oob-ws-http > nfs [ACK] Seq=55773 Ack=2 Win=1029 Len=0 TSV=953767461 TSER=122735063

What could cause this to happen? I thought it might've been described in Section 2.17 of RFC 2525 "Known TCP Implementation Problems" - http://www.ietf.org/rfc/rfc2525.txt:

Name of Problem Failure to RST on close with data pending
Description When an application closes a connection in such a way that it can no longer read any received data, the TCP SHOULD, per section 4.2.2.13 of RFC 1122, send a RST if there is any unread received data, or if any new data is received. A TCP that fails to do so exhibits "Failure to RST on close with data pending".
 Note that, for some TCPs, this situation can be caused by an
 application "crashing" while a peer is sending data.
We have observed a number of TCPs that exhibit this problem.  The
problem is less serious if any subsequent data sent to the now-
closed connection endpoint elicits a RST (see illustration below).
Significance This problem is most significant for endpoints that engage in large numbers of connections, as their ability to do so will be curtailed as they leak away resources.
Implications Failure to reset the connection can lead to permanently hung connections, in which the remote endpoint takes no further action to tear down the connection because it is waiting on the local TCP to first take some action. This is particularly the case if the local TCP also allows the advertised window to go to zero, and fails to tear down the connection when the remote TCP engages in “persist” probes (see example below).

Answer 1

The snippet you've included does seem to match up with the behavior you're seeing...but...the packet timestamps are confusing me. The endpoint that receives the ZeroWindow advert is supposed to wait for a while before sending a "zero window" probe - and that wait period is supposed to increase as more ZeroWindow adverts are received.

From http://www.usenix.org/publications/library/proceedings/bos94/full_papers/lin.a

Keep sending data to the echo port without reading the echoed data.
As Figure 6 shows, because the probe program sends data without reading the echo, the receive buffer of TCP A eventually becomes full, causing it to send a zero-window ACK segment to TCP B. Because TCP B cannot send data to TCP A, the send buffer of TCP B will become full of echoed data. When the echo server on B cannot send more data, the receive buffer of TCP B will become full. Once the receive buffer of TCP B becomes full, it advertises a zero window to TCP A. After the zero-window condition exists for more than a threshold time period, both sides begin sending zero-window probes.
4.2 Results
Operating & Data size in & Min. probe & Max. probe System
& 0-win probe seg. & Interval & Interval Solaris 2.1 & 1 MSS octets & 200 ms & 60 sec. SunOS 4.1.1 & 1 octet & 5 sec. & 60 sec. SunOS 4.0.3 & 1 octet
& 5 sec. & 60 sec. HP-UX 9.0
& 1 octet & 4 sec. & 60 sec.
IRIX 5.1.1 & 1 octet & 5 sec.
& 60 sec.

If I were a guessing man, and I am, I'd say that you're looking at some kind of stack implementation bug.