This is a static archive of our old Q&A Site. Please post any new questions and answers at ask.wireshark.org.

Connections to remote server just stall - Seq mismatch?

0

I have an issue I can’t quite figure out. Clients on my network connect to a terminal server on a remote network via a terminal emulator for Windows and perform the normal operations end-users do, but their sessions stall after seemingly random amounts of time (anywhere from 30 seconds to 4 hours). They are left staring at a blank terminal emulator screen and have to force-close the application instance. I checked for any issues with our firewall concerning ACLs or IPS dropping connections, I checked the local Windows firewalls running on the clients’ workstations, I disabled TCP offloading for Windows 7 to check for compatibility, and then I set up Windows XP Pro workstations with local firewalls disabled, and all the logs are empty. I even checked the terminal emulator application for issues, but again the logs are empty, and the same problem still occurs.

I did A Wireshark packet capture and see the same pattern on every workstation right before the connection stalls: the server sends a client a packet, the client receives the packet and properly increments the Ack number and sends it off to the server, and then the server replies with a Seq that is 100 bytes less than the correct length. The client then responds Ack with the proper number, and the server again responds with the Seq that is 100 bytes less in length. This occurs a few times, and then the connections stalls. Sometimes it is reset automatically, and sometimes nothing happens until the user hits the Disconnect button in the terminal emulator and a FIN packet is successfully sent. But no other data packets are sent in the meantime.

Here’s output from a real capture (IPs are modified):

372 2011-11-22 10:39:14.350000 201.xxx.xxx.xxx 201.xxx.xxx.xxx TCP 944 orbixd > 16936 [PSH, ACK] Seq=25305 Ack=1675 Win=36060 Len=890

373 2011-11-22 10:39:14.550000 201.xxx.xxx.xxx 201.xxx.xxx.xxx TCP 54 16936 > orbixd [ACK] Seq=1675 Ack=26195 Win=63970 Len=0

374 2011-11-22 10:46:10.833333 201.xxx.xxx.xxx 201.xxx.xxx.xxx TCP 60 orbixd > 16936 [ACK] Seq=26095 Ack=1675 Win=36060 Len=0

375 2011-11-22 10:46:10.833333 201.xxx.xxx.xxx 201.xxx.xxx.xxx TCP 54 [TCP Dup ACK 373#1] 16936 > orbixd [ACK] Seq=1675 Ack=26195 Win=63970 Len=0

376 2011-11-22 10:56:19.883333 201.xxx.xxx.xxx 201.xxx.xxx.xxx TCP 60 orbixd > 16936 [ACK] Seq=26095 Ack=1675 Win=36060 Len=0

377 2011-11-22 10:56:19.883333 201.xxx.xxx.xxx 201.xxx.xxx.xxx TCP 54 [TCP Dup ACK 373#2] 16936 > orbixd [ACK] Seq=1675 Ack=26195 Win=63970 Len=0

378 2011-11-22 11:06:28.950000 201.xxx.xxx.xxx 201.xxx.xxx.xxx TCP 60 orbixd > 16936 [ACK] Seq=26095 Ack=1675 Win=36060 Len=0

379 2011-11-22 11:06:28.950000 201.xxx.xxx.xxx 201.xxx.xxx.xxx TCP 54 [TCP Dup ACK 373#3] 16936 > orbixd [ACK] Seq=1675 Ack=26195 Win=63970 Len=0

380 2011-11-22 11:16:38.350000 201.xxx.xxx.xxx 201.xxx.xxx.xxx TCP 60 orbixd > 16936 [ACK] Seq=26095 Ack=1675 Win=36060 Len=0

381 2011-11-22 11:16:38.350000 201.xxx.xxx.xxx 201.xxx.xxx.xxx TCP 54 [TCP Dup ACK 373#4] 16936 > orbixd [ACK] Seq=1675 Ack=26195 Win=63970 Len=0

And then it stalls.

Sometimes this pattern occurs, but the server eventually responds with the correct Seq number and transmission resumes, and the program slows down during this time.

I do not have access to the remote server or anything on its network; it is our ISP’s, so I can only troubleshoot on my end. Of course I got the obligatory “The problem is on your end” answer when I brought up the issue. If it is, fine; I just want it fixed.

Does anyone know the cause of this or have a hunch? Can a Sequence mismatch do this, or am I overlooking something? Is it a problem on the remote server side? Is it a buffering issue? An MTU mismatch? I do know this has a pattern of happening a lot in the morning and then lessens substantially in the afternoon.

Thanks in advance for any input!

asked 24 Nov '11, 00:05

Yellowninja's gravatar image

Yellowninja
1111
accept rate: 0%

First thing i notice is that there is a nearly constant 10 minute delta between those packets - kind of looks to me like a customized TCP keepalive, where the machine sends this minus 100 Seq number to make the other side trigger the Ack and maybe by this implementing connection keepalive.

Whats not so clear is why the tcp.len is zero in those packets, because normally I'd send at least one Byte to really trigger an ACK from the other side, but in your case the other side's stack even reacts to zero byte payload TCP packets with Acks.

What are your thoughts about that?

(24 Nov '11, 00:21) Landi

These clients always send back an Ack after receiving a packet, even if it has no payload. The Len of 0 is odd, but they happen quite frequently and always contain the correct Seq until right before the stall.

I also noticed the remote server sends out Window Update packets every minute or so even though every packet contains the PSH flag and it appears buffers are never full.

(24 Nov '11, 00:45) Yellowninja

Do the connections only stall when clients are idle like in the trace snippet you posted or even when the clients are actively communicating ?

(24 Nov '11, 01:31) Landi

They stall while actively communicating, too. I have dozens of captured sessions, and a good half are during active sessions, and I've seen the issue in real time from the end-user's perspective. They load up a screen, hit a command, and then it stalls.

(24 Nov '11, 09:52) Yellowninja

Well now things get really interesting... I'm just thinking for patterns, like:

  • Do those minus 100 bytes Seq Nr. packets ONLY occur right before a stalling session?
  • Could you check the IP headers of the regarding packets if there is a packet without DF bit set ?
  • do you have any idea why the timestamps of so many packets match exactly ? where did you capture the traffic?
  • could you provide absolute sequence numbers to check whether this might occur on special events (like seq.nr. overflow) ?

If cou can provide another anonymized sample trace on cloudshark e.g. I'd like to look at it

(24 Nov '11, 10:17) Landi

Ok, in response to your points:

1) Those minus 100 byte Seq. packets occur when the connection stalls, and a couple of times they occur in the same pattern, but the server eventually responds with the right Seq number and transmission resumes. The vast majority of the time results in the stall. The minus 100 byte Seq. packets do not occur at any other time.

2) The DF flag is set on all packets originating from the client but not on any originating from the remote server.

3) I don't know about the timestamps--seems to be some kind of burst.

4) I'll post on cloudshark as soon as I can.

(25 Nov '11, 12:32) Yellowninja
showing 5 of 6 show 1 more comments

One Answer:

0

Sounds to me like there is a device that performs sequence number translations on the packets. Most firewall do "Initial Sequence Number Randomization" to protect old TCP/IP implementations that did not choose a random initial sequence number.

Do you have traces at both sides (client and server) of the connection? How do the sequence numbers look on the server side (as I believe from the trace snippet that it was made on the client side). If it is not possible to capture on the server side, are you able to capture on the public side of your own FW?

Also, what TCP options are used? Is SACK used? Or maybe it is initiated, but masked by an intermediate FW. The packets should say it all, so if you could post a tracefile on www.cloudshark.org with the IP addresses anonimized and the TCP payload stripped (you can use Bittwiste for that), that would be really helpful :-)

answered 24 Nov '11, 13:53

SYN-bit's gravatar image

SYN-bit ♦♦
17.1k957245
accept rate: 20%

I only have a trace at client side since I don't have access to server side (different company), and I removed the relative sequence number and the real sequence numbers are showing same trend of 100 bytes off on server Seq.

I stripped the payload and uploaded it here: http://www.cloudshark.org/captures/ed43d82e8a41

Let me know if you have trouble with it now that all packets are Len=0 and Wireshark added in all sorts of Dup Packet messages now. The last ones are, of course, the problem packets.

Thanks again :)

(27 Nov '11, 23:14) Yellowninja

Thank you for posting the tracefile. It shows that SACK was not used for this TCP stream. Bittwiste has done a more thorough job than I expected, not only did it strip the payload, it also changed the length field in the IP header, resulting in a TCP len of 0, which makes the file quite difficult to read. Sorry, my fault, should have checked the working of the -L option myself first.

(28 Nov '11, 01:45) SYN-bit ♦♦

However, combined with the output you already posted, makes it more clear that something goes wrong somewhere along the path to the server. Are you able to make a connection from another network too? So you can determine if it is something in your network messing up? Also, can you make a trace on the outside of your FW? The best would be to make a trace at the demarcation point of your network. That way you can determine if the problem is caused by a device under your control or by a device out of your control.

(28 Nov '11, 01:47) SYN-bit ♦♦

I do not have the ability to try it from a different network since this server is housed at our ISP, and only public IPs from their block are allowed. We are connected with AT&T CSME, so the only hops are my router on this side and their router then server. I can't see it being something on AT&T's side since I suspect I would see this type of issue on all traffic. I will hook up a switch beyond our router and see what happens.

Thanks again.

(28 Nov '11, 07:49) Yellowninja