My NFS client and NetApp filer got stuck in this loop of ACKs and ZeroWindows. This repeated over and over until i finally dropped the connection with tcpdrop. I'm thinking this is a bug on the NetApp filer, can someone help me break down exactly what is happening? It seems like my client (10.231.96.85) is waiting for an acknowledgment of 55k of data, but the filer (10.231.96.105) is just sending ZeroWindow back forever:
1 0.000000 10.231.96.105 -> 10.231.96.85 TCP [TCP ZeroWindow] nfs > oob-ws-http [ACK] Seq=1 Ack=1 Win=0 Len=0 TSV=122735999 TSER=828368381
2 0.000004 10.231.96.85 -> 10.231.96.105 TCP [TCP ACKed lost segment] oob-ws-http > nfs [ACK] Seq=55773 Ack=2 Win=1029 Len=0 TSV=953767461 TSER=122735063
3 0.000006 10.231.96.105 -> 10.231.96.85 TCP [TCP ZeroWindow] [TCP Keep-Alive] nfs > oob-ws-http [ACK] Seq=1 Ack=1 Win=0 Len=0 TSV=122735999 TSER=828368381
4 0.000009 10.231.96.85 -> 10.231.96.105 TCP [TCP Keep-Alive ACK] oob-ws-http > nfs [ACK] Seq=55773 Ack=2 Win=1029 Len=0 TSV=953767461 TSER=122735063
5 0.000011 10.231.96.105 -> 10.231.96.85 TCP [TCP ZeroWindow] [TCP Keep-Alive] nfs > oob-ws-http [ACK] Seq=1 Ack=1 Win=0 Len=0 TSV=122735999 TSER=828368381
6 0.000015 10.231.96.85 -> 10.231.96.105 TCP [TCP Keep-Alive ACK] oob-ws-http > nfs [ACK] Seq=55773 Ack=2 Win=1029 Len=0 TSV=953767461 TSER=122735063
7 0.000017 10.231.96.105 -> 10.231.96.85 TCP [TCP ZeroWindow] [TCP Keep-Alive] nfs > oob-ws-http [ACK] Seq=1 Ack=1 Win=0 Len=0 TSV=122735999 TSER=828368381
8 0.000021 10.231.96.85 -> 10.231.96.105 TCP [TCP Keep-Alive ACK] oob-ws-http > nfs [ACK] Seq=55773 Ack=2 Win=1029 Len=0 TSV=953767461 TSER=122735063
9 0.000022 10.231.96.105 -> 10.231.96.85 TCP [TCP ZeroWindow] [TCP Keep-Alive] nfs > oob-ws-http [ACK] Seq=1 Ack=1 Win=0 Len=0 TSV=122735999 TSER=828368381
10 0.000026 10.231.96.85 -> 10.231.96.105 TCP [TCP Keep-Alive ACK] oob-ws-http > nfs [ACK] Seq=55773 Ack=2 Win=1029 Len=0 TSV=953767461 TSER=122735063
What could cause this to happen? I thought it might've been described in Section 2.17 of RFC 2525 "Known TCP Implementation Problems" - http://www.ietf.org/rfc/rfc2525.txt:
Name of Problem Failure to RST on close with data pending
Description When an application closes a connection in such a way that it can no longer read any received data, the TCP SHOULD, per section 4.2.2.13 of RFC 1122, send a RST if there is any unread received data, or if any new data is received. A TCP that fails to do so exhibits "Failure to RST on close with data pending".
Note that, for some TCPs, this situation can be caused by an
application "crashing" while a peer is sending data.
We have observed a number of TCPs that exhibit this problem. The
problem is less serious if any subsequent data sent to the now-
closed connection endpoint elicits a RST (see illustration below).
Significance This problem is most significant for endpoints that engage in large numbers of connections, as their ability to do so will be curtailed as they leak away resources.
Implications Failure to reset the connection can lead to permanently hung connections, in which the remote endpoint takes no further action to tear down the connection because it is waiting on the local TCP to first take some action. This is particularly the case if the local TCP also allows the advertised window to go to zero, and fails to tear down the connection when the remote TCP engages in “persist” probes (see example below).
asked 16 Jan ‘12, 10:25
administraitor
1●2●2●2
accept rate: 0%
edited 21 Sep ‘12, 08:42
cmaynard ♦♦
9.4k●10●38●142
Any chance you can post the actual pcap somewhere? And can you post the real seq#’s as opposed to relative numbers? Edit, Preference, Protocols, TCP, Relative sequence numbers"
Also, what is your window scaling factor?