Greetings Everyone! My mind is about to explode as everyone is always blaming the network guys for disconnections and slowness in the networks, but Solarwinds Reports that all utilization in the network devices and links are Okay. So I tried sniffing then I got "Zero Window" errors. So If I may ask: 1.) How is TCP Window Size Allocated? (Is it per TCP conversation like if you have 1 Application(Mozilla Firefox) and 5 tabs open, then the OS allocates Window Size per tab?) 2.) What causes and How do you fix a "TCP Zero - Window" issue? (The Stock Trading Server is the one who is having a hard time processing burst traffic and sending the TCP Zero window messages to the traders but based on the Network Utilization(CPU, Memory and Link Utilization of Network Devices) in Solarwinds and Performance Monitoring(CPU, Disk Space, Memory, NIC Utilization) in both the Stock Trading Server and Database Server, it shows that is Perfectly Normal and even under-utilized!.) 3.) Is it perhaps in the Trading Server's Settings? (32 GB Memory but only uses the default tcp window allocation size of 64 MB) 4.) Or is there something wrong with how slow the Trading Application process the data? (I am planning to increase the TCP Buffer Size from 64KB to probably 256KB but it might not help if the the Trading Application Server itself process the data slowly.) 5.) Also, all the traders are experiencing "Unable to Connect to Trading Server" and "Intermittent Connections" errors. (but there's no report of network problem like "down links" or "fully utilized links". I've even tried to change the polling data to every 1 minute to capture short disconnections but I still see no problem) So I think that there might be a latency problem 6.) How do you measure Latency of Network Communication efficiently? What Free and Paid Software Solutions do you recommend? (Traceroute reports 4ms and even if i increase the ping packet to 1mb, it also shows 1-3ms delay so I don't think that's helpful) 7.) How do you sort out each TCP thread/conversations if the source port and destination port are the same and the data is encrypted? (Like if the Stock Trading Server and the SQL Server talks on the same port numbers but has multiple transactions going on.) Sorry, I'm just new to the networking world so there are a lot of stuff I don't know and can't find in books and other resources. I think this kind of things are learned through experience so please share your wisdom. Thank you and Have a Good Day! :) asked 08 Sep '14, 07:26 Sharknado edited 08 Sep '14, 07:30 |
One Answer:
1.) How is TCP Window Size Allocated? (Is it per TCP conversation like if you have 1 application(Mozilla Firefox) and 5 tabs open, then the OS allocates Window Size per tab?) It is per TCP conversation. 10 conversations with a 64k window require 640k of RAM to store incoming segments. It is not allocated per Firefox tab, because if you open a web page in a tab that e.g. opens 5 connections (1 for the web page, 4 for some pictures) you may end up with 5 TCP connections. It depends on how the communication is performed - some web servers allow pulling all 5 elements (still as an example) in one session, others require the browser to open up a single one for each element. 2.) What causes and How do you fix a "TCP Zero - Window" issue? (The Stock Trading Server is the one who is having a hard time processing burst traffic and sending the TCP Zero window messages to the traders but based on the Network Utilization(CPU, Memory and Link Utilization of Network Devices) in Solarwinds and Performance Monitoring(CPU, Disk Space, Memory, NIC Utilization) in both the Stock Trading Server and Database Server, it shows that is Perfectly Normal and even under-utilized!.) Zero Window means that the receiver of the packets waves a "white flag" towards the sender, telling it to stop sending because there is no more buffer space for incoming packets. This is in almost all cases a sign of the receiver being too slow to process the incoming packets in time. If performance monitoring on the receiver doesn't show any problems it is a bit hard to say why the Zero Window problem occurs, but you should try to update the NIC drivers, or change network cards. Things get really messy if the server is a VM - if so, please specify what kind of virtualization you're using. 3.) Is it perhaps in the Trading Server's Settings? (32 GB Memory but only uses the default tcp window allocation size of 64 MB) It depends on the latency and bandwidth of the conversation. Generally, the higher the latency and the higher the bandwith on the link, the more window size you need. But if you're running into zero window problems more window will only delay the problem in most cases, but once zero window is happening it will often repeat to do so. 4.) Or is there something wrong with how slow the Trading Application process the data? (I am planning to increase the TCP Buffer Size from 64KB to probably 256KB but it might not help if the Trading Server itself is slow.) Possible. Maybe the Trading Application is not reacting to the TCP stack offering more packets quickly enough. More memory could help, but also have the developers check if they can do anything to process new data coming from the stack faster. Maybe there aren't enough threats in use by the program. 5.) Also, all the traders are experiencing "Unable to Connect to Trading Server" and "Intermittent Connections" errors. (but there's no report of network problem like "down links" or "fully utilized links". I've even tried to change the polling data to every 1 minute to capture short disconnections but I still see no problem) So I think that there might be a latency problem Latency as in timeouts? Often, if an application is swamped, it will not react to more TCP stack messages about nodes trying to connect. You should check if you see SYN - SYN/ACK messages, or if the stack doesn't react at all when a connection fails. It will tell you if the stack or the application is the problem - if you see a SYN/ACK as a reaction to the SYN from the client, the stack works. 6.) How do you measure Latency of Network Communication efficiently? What Free and Paid Software Solutions do you recommend? (Traceroute reports 4ms and even if i increase the ping packet to 1mb, it also shows 1-3ms delay so I don't think that's helpful) Use Wireshark. Find the three way handshake and determine initial RTT from there. See http://blog.packet-foo.com/2014/07/determining-tcp-initial-round-trip-time/ 7.) How do you sort out each TCP thread/conversations if the source port and destination port are the same and the data is encrypted? (Like if the Stock Trading Server and the SQL Server talks on the same port numbers but has multiple transactions going on.) Source and destination port cannot be the same. The client selects a new source port for each TCP connection. If the Stock Trading Server connects to the SQL server with multiple TCP connections it will use a new source port each time. If a database connection only uses one TCP connection to issue multiple requests you need to treat them as separate requests/answer pairs by finding the request and determining the size of the answer. If there is encryption you're in trouble, because you'll have to decrypt the communication to see where the request/answer pairs begin and end. It usually better to turn of encryption for the duration of the analysis. answered 08 Sep '14, 07:48 Jasper ♦♦ Hi Jasper! Thank you and I appreciate that you replied to my inquiries. 1.) Thanks, now I understand that. 2.) Yes, it is under a Vsphere. I'm not on the Server Team but as far as I know, the Trading Application is under 1 VM and is using a Virtual NIC. The VSphere is connected to 4 different cisco switches but I only saw it using two of the switches. Also there might be resource sharing configured in case one VM needs more memory. 3.) True, increasing the window size might only delay the problem that's why I still haven't tried that. 4.) Well, the vendor for the Trading Application always claims that they have multiple clients and we are the only one who is having this issue. They also claim that it's because we don't have QoS implemented in our LAN and we don't have a separate physical network infrastructure for that certain application. So we really have to prove first that the problem is caused by the Application. 5.) Yes, the latency might be too high to the point that timeout happens. I never get to see the SYN Ack unless I see a "RST Flag" and I only see that once or twice a day and it doesn't happen everyday. 6.) Ok, I'll go ahead and check that. 7.) The port the sql server uses is always 1433(well because it's fixed) and the port used by the trading server is at 8693(made-up number) but they always use those 2 port numbers through the entire day, everyday. I think it's configured to be encrypted because it's all financial data so I don't think there's a way to decrypt that. Thanks for all your help Jasper! (08 Sep '14, 08:12) Sharknado First of all, a couple of things about how this Q&A site works ;-)
Now, to your points:
(08 Sep '14, 08:23) Jasper ♦♦ 1.) I'll ask them about that tomorrow. So can I get back to you on this? :) 2.) You're right. I'll construct a documented report regarding all my findings and slap it right to their faces the next time we meet. 3.) How can I see the number of concurrent connections if the trade server only uses 1 port number to converse with the database server throughout the day? 4.) I checked and I didn't see anything with the filter you posted so I don't think that's a problem. Thanks again Jasper! :) (08 Sep '14, 08:39) Sharknado
(08 Sep '14, 11:58) Jasper ♦♦ |
If I may say so:
It appears there are issues in a Production Trading system with with real $$$ impacts.
I would suggest strongly that one or more experts should be found (hired or whatever) who can do a complete analysis and recommend solutions.
IOW: "Do it yourself" is not a good approach. :)