Hi Everyone, I'm currently having a problem troubleshooting a trading application. Let me give a simple diagram of the current network setup (Gov't Stock Exchange Network Router)X-->(256kbs Leased Line)<--X(Telco Router)<-(100mbps fast E link)--->Our Network Devices(5 Switches, 1 Firewall)<-->Trading Server. Our users reports that they are experiencing slowness at around 9:30 to 9:45am. I checked the CPU, Memory, Response Time and Link Utilization of all our Network Devices and Interfaces and all of them reports normal levels. Part of the trading process is the communication between the Stock Exchange Network and our Trading Servers so if there is any slowness on that 256kbps leased line link, surely it would contribute to the slowness. Unfortunately, the telco router is not being monitored by the Telco and we're still asking for permission if we can add their device to our Solarwinds. So the closest link I could look at is the 100mbps link from our switch going to the leased line router on our side. When the traders are experiencing 3ms to 5ms latency in trading, it shows this: Transmit: 1500bps - 1900bps Receive: 2000bps - 2400bps Bytes Transferred per Minute: 44KB-60KB Wireshark Reports no problem at this time Special note though on every 9:34 - 9:37 because they experience 10ms - 15ms latency in trading: Transmit: 1900bps - 2400bps Receive: 2400bps - 3200bps Bytes Transferred per Minute: 90KB - 170KB Wireshark Reports that I'm getting TCP Zero Window(trade server sending the zero window alert to the to stock exchange server) errors but it only lasts for a few milliseconds and only happens at twice or thrice a day. And there was even one incident when our traders where experiencing crazy latencies of 1min - 3mins delay in trading!: Transmit: 4000bps Receive: 5600bps Wireshark Reports that we were getting TCP Zero Window(trade server sending the zero window alert to the to stock exchange server) errors for the whole trading period of that day. This only happened once and until now, I'm still not available to resolve this issue The Trading Server team reports that their CPU, Memory and NIC utilization is normal and of course, everyone is blaming the network guys. So here are my questions:
Thanks a lot for all your help guys! :) asked 13 Aug '14, 20:59 Sharknado edited 13 Aug '14, 21:01 |
2 Answers:
Zero window usually means: Give me a break. Don't send me any more data, as I cannot handle them anyway. So, if your trading server is sending a zero window message, it's more likely that there is a problem on the server and/or with the trading application. Even if the values for cpu, mem, nic look O.K. on that server, there could still be a problem, if the application is waiting for a resource (network share, database) and thus is unable to process the data fast enough.
In the GUI
or
Please read the docs for an explanation of those graphs.
If you see the zero window messages directly in front of the server (captured on a mirror port of the switch), it's not a network problem. You should then blame it back to the server or application guys ;-)) See my explanation above. Regards answered 14 Aug '14, 00:00 Kurt Knochner ♦ |
Hi, The key here is correlation; to the nearest second do the slow trades always coincide with the zero window size? The thing that strikes me from the figures you have given is that they could easily be as result of a single TCP Retransmission. If Network Round Trip Time is approx 3ms a lost trade request packet would be detected after 6ms, and the retransmitted trade request would be responded to after a further 3ms (perhaps a bit more allowing for compute time) giving trading latency of about 10ms. Let's set aside the 1 to 3 min issue for the moment. What you need to do is look at the time between a trade request leaving the trading server and the response coming back from the exchange (I'm assuming the complaint here is time to trade and not freshness of prices). Most trading protocols like this are very simple; packet to the exchange with the request and packet back with the response. Identify the TCP port(s) that the Exchange trading process is using, and then filter the traffic to just analyze traffic to and from those ports. You could export the Packet List data to a CSV and study the response times in Excel. A simpler way would be to use the TRANSUM plugin ( see http://www.tribelabzero.com/resources ) which is freely available. TRANSUM will show you the response times from the exchange - just remember the add the Exchange TCP port numbers to the list of Service Ports in the TRANSUM Preferences (see the TRANSUM Manual for details). Time sync your capture units to the trading server as best you can, and ask the trading app support people for the precise time of slow trades (there is bound to be a time stamped log - be careful of timezone differences). Once you have these times look at the response times for the exchange requests at those times. If you find a slow one, check if there have been no retransmissions or zero window events. If not the latency is between your trace point and the exchange or in the exchange itself. One final point, bear in mind that your trading server may be using TCP Segmentation Offload (or other offload functions) and so what you see at the NIC interface may not be what the trading app is seeing. Best regards...Paul answered 14 Aug '14, 15:12 PaulOfford |