This is a static archive of our old Q&A Site. Please post any new questions and answers at ask.wireshark.org.

TCP Window Size Allocation and TCP Zero Window Errors

0

Greetings Everyone!

My mind is about to explode as everyone is always blaming the network guys for disconnections and slowness in the networks, but Solarwinds Reports that all utilization in the network devices and links are Okay. So I tried sniffing then I got "Zero Window" errors. So If I may ask:

1.) How is TCP Window Size Allocated? (Is it per TCP conversation like if you have 1 Application(Mozilla Firefox) and 5 tabs open, then the OS allocates Window Size per tab?)

2.) What causes and How do you fix a "TCP Zero - Window" issue? (The Stock Trading Server is the one who is having a hard time processing burst traffic and sending the TCP Zero window messages to the traders but based on the Network Utilization(CPU, Memory and Link Utilization of Network Devices) in Solarwinds and Performance Monitoring(CPU, Disk Space, Memory, NIC Utilization) in both the Stock Trading Server and Database Server, it shows that is Perfectly Normal and even under-utilized!.)

3.) Is it perhaps in the Trading Server's Settings? (32 GB Memory but only uses the default tcp window allocation size of 64 MB)

4.) Or is there something wrong with how slow the Trading Application process the data? (I am planning to increase the TCP Buffer Size from 64KB to probably 256KB but it might not help if the the Trading Application Server itself process the data slowly.)

5.) Also, all the traders are experiencing "Unable to Connect to Trading Server" and "Intermittent Connections" errors. (but there's no report of network problem like "down links" or "fully utilized links". I've even tried to change the polling data to every 1 minute to capture short disconnections but I still see no problem) So I think that there might be a latency problem

6.) How do you measure Latency of Network Communication efficiently? What Free and Paid Software Solutions do you recommend? (Traceroute reports 4ms and even if i increase the ping packet to 1mb, it also shows 1-3ms delay so I don't think that's helpful)

7.) How do you sort out each TCP thread/conversations if the source port and destination port are the same and the data is encrypted? (Like if the Stock Trading Server and the SQL Server talks on the same port numbers but has multiple transactions going on.)

Sorry, I'm just new to the networking world so there are a lot of stuff I don't know and can't find in books and other resources. I think this kind of things are learned through experience so please share your wisdom.

Thank you and Have a Good Day! :)

asked 08 Sep '14, 07:26

Sharknado's gravatar image

Sharknado
1336
accept rate: 0%

edited 08 Sep '14, 07:30

If I may say so:

It appears there are issues in a Production Trading system with with real $$$ impacts.

I would suggest strongly that one or more experts should be found (hired or whatever) who can do a complete analysis and recommend solutions.

IOW: "Do it yourself" is not a good approach. :)

(08 Sep '14, 07:54) Bill Meier ♦♦

One Answer:

1

1.) How is TCP Window Size Allocated? (Is it per TCP conversation like if you have 1 application(Mozilla Firefox) and 5 tabs open, then the OS allocates Window Size per tab?)

It is per TCP conversation. 10 conversations with a 64k window require 640k of RAM to store incoming segments. It is not allocated per Firefox tab, because if you open a web page in a tab that e.g. opens 5 connections (1 for the web page, 4 for some pictures) you may end up with 5 TCP connections. It depends on how the communication is performed - some web servers allow pulling all 5 elements (still as an example) in one session, others require the browser to open up a single one for each element.

2.) What causes and How do you fix a "TCP Zero - Window" issue? (The Stock Trading Server is the one who is having a hard time processing burst traffic and sending the TCP Zero window messages to the traders but based on the Network Utilization(CPU, Memory and Link Utilization of Network Devices) in Solarwinds and Performance Monitoring(CPU, Disk Space, Memory, NIC Utilization) in both the Stock Trading Server and Database Server, it shows that is Perfectly Normal and even under-utilized!.)

Zero Window means that the receiver of the packets waves a "white flag" towards the sender, telling it to stop sending because there is no more buffer space for incoming packets. This is in almost all cases a sign of the receiver being too slow to process the incoming packets in time. If performance monitoring on the receiver doesn't show any problems it is a bit hard to say why the Zero Window problem occurs, but you should try to update the NIC drivers, or change network cards. Things get really messy if the server is a VM - if so, please specify what kind of virtualization you're using.

3.) Is it perhaps in the Trading Server's Settings? (32 GB Memory but only uses the default tcp window allocation size of 64 MB)

It depends on the latency and bandwidth of the conversation. Generally, the higher the latency and the higher the bandwith on the link, the more window size you need. But if you're running into zero window problems more window will only delay the problem in most cases, but once zero window is happening it will often repeat to do so.

4.) Or is there something wrong with how slow the Trading Application process the data? (I am planning to increase the TCP Buffer Size from 64KB to probably 256KB but it might not help if the Trading Server itself is slow.)

Possible. Maybe the Trading Application is not reacting to the TCP stack offering more packets quickly enough. More memory could help, but also have the developers check if they can do anything to process new data coming from the stack faster. Maybe there aren't enough threats in use by the program.

5.) Also, all the traders are experiencing "Unable to Connect to Trading Server" and "Intermittent Connections" errors. (but there's no report of network problem like "down links" or "fully utilized links". I've even tried to change the polling data to every 1 minute to capture short disconnections but I still see no problem) So I think that there might be a latency problem

Latency as in timeouts? Often, if an application is swamped, it will not react to more TCP stack messages about nodes trying to connect. You should check if you see SYN - SYN/ACK messages, or if the stack doesn't react at all when a connection fails. It will tell you if the stack or the application is the problem - if you see a SYN/ACK as a reaction to the SYN from the client, the stack works.

6.) How do you measure Latency of Network Communication efficiently? What Free and Paid Software Solutions do you recommend? (Traceroute reports 4ms and even if i increase the ping packet to 1mb, it also shows 1-3ms delay so I don't think that's helpful)

Use Wireshark. Find the three way handshake and determine initial RTT from there. See http://blog.packet-foo.com/2014/07/determining-tcp-initial-round-trip-time/

7.) How do you sort out each TCP thread/conversations if the source port and destination port are the same and the data is encrypted? (Like if the Stock Trading Server and the SQL Server talks on the same port numbers but has multiple transactions going on.)

Source and destination port cannot be the same. The client selects a new source port for each TCP connection. If the Stock Trading Server connects to the SQL server with multiple TCP connections it will use a new source port each time. If a database connection only uses one TCP connection to issue multiple requests you need to treat them as separate requests/answer pairs by finding the request and determining the size of the answer. If there is encryption you're in trouble, because you'll have to decrypt the communication to see where the request/answer pairs begin and end. It usually better to turn of encryption for the duration of the analysis.

answered 08 Sep '14, 07:48

Jasper's gravatar image

Jasper ♦♦
23.8k551284
accept rate: 18%

Hi Jasper!

Thank you and I appreciate that you replied to my inquiries.

1.) Thanks, now I understand that.

2.) Yes, it is under a Vsphere. I'm not on the Server Team but as far as I know, the Trading Application is under 1 VM and is using a Virtual NIC. The VSphere is connected to 4 different cisco switches but I only saw it using two of the switches. Also there might be resource sharing configured in case one VM needs more memory.

3.) True, increasing the window size might only delay the problem that's why I still haven't tried that.

4.) Well, the vendor for the Trading Application always claims that they have multiple clients and we are the only one who is having this issue. They also claim that it's because we don't have QoS implemented in our LAN and we don't have a separate physical network infrastructure for that certain application. So we really have to prove first that the problem is caused by the Application.

5.) Yes, the latency might be too high to the point that timeout happens. I never get to see the SYN Ack unless I see a "RST Flag" and I only see that once or twice a day and it doesn't happen everyday.

6.) Ok, I'll go ahead and check that.

7.) The port the sql server uses is always 1433(well because it's fixed) and the port used by the trading server is at 8693(made-up number) but they always use those 2 port numbers through the entire day, everyday. I think it's configured to be encrypted because it's all financial data so I don't think there's a way to decrypt that.

Thanks for all your help Jasper!

(08 Sep '14, 08:12) Sharknado

First of all, a couple of things about how this Q&A site works ;-)

  1. please use comments, not answers, if you reply
  2. if you like something, vote it up (thumbs up button left to the answer)
  3. if an answer helped, accept it (checkmark button)

Now, to your points:

  1. have the VMware admins check if the VM uses VMXNET3 adapters. Most run VMs with emulated AMD or Intel cards - VMXNET adapters are way faster and may solve your problem. They do not work on all OSes, though, so have them check if it is an option

  2. QoS will not help at all in Zero Windows situation. It is a host performance problem, not a latency/delay problem. Tell the vendor to understand Zero Window symptoms first before going through their excuse calendar ;-)

  3. If you see a RST and not a SYN/ACK then the TCP stack on the server refuses further connections. Maybe the connection table is exhausted - you might want to do a capture at the server to find out how many concurrent connections can be establihsed before it refuses any more.

  4. Sure, SQL uses 1433 for MS-SQL, and it should always be the same. If the trading server uses only one port as well it should only be one connection - otherwise Wireshark will diagnose "TCP Port Reused". If you don't see that message (check with filter "tcp.analysis.reused_ports") you don't have a problem like that.

(08 Sep '14, 08:23) Jasper ♦♦

1.) I'll ask them about that tomorrow. So can I get back to you on this? :)

2.) You're right. I'll construct a documented report regarding all my findings and slap it right to their faces the next time we meet.

3.) How can I see the number of concurrent connections if the trade server only uses 1 port number to converse with the database server throughout the day?

4.) I checked and I didn't see anything with the filter you posted so I don't think that's a problem.

Thanks again Jasper! :)

(08 Sep '14, 08:39) Sharknado
  1. sure
  2. open up the Conversation Statistics and select the TCP tab. Sort by IP addresses to find you application server. Check how many rows there are where it talks to the SQL IP. According to your assumption, there should be only one.
(08 Sep '14, 11:58) Jasper ♦♦