Hi, We are currently having a recurring (and intermittent) problem with an in-house application connecting to a server sitting on our local network. The process works fine for a certain amount of time (sometimes hours, sometimes days - never more than 48 Hours) then suddenly starts failing - the application hosted on the workstation throwing error messages. The "process" works as followed: - The client creates a TCP connection to a server on a specific port then passes some data through. - At the server level, there is an application listening to the specific port and uses the data to "do stuff". The application then shuts the TCP connection. During normal operations, we see the following at network level:
What we are seeing when the process fails is that the TCP connection is established between the client and the server but the server resets the connection [RST] prior to the [PSH, ACK] message coming through:
The [RST] message contains the following error description "Acknowledgment number: Broken TCP. The acknowledge field is nonzero while the ACK flag is not set" Once this happens, the server effectively refuses any new connection on that specific port, regardless of the source. To work-around the problem, I have to change the application on the client level to use another port and set the "listener" on the server to listen to the new port. We've already established that the [RST] is not triggered by the application as we don't see any connection being created at application level when the process is failing. We've also established that all previous TCPconnections are being closed correctly by the server. We've recently rebuilt the server with latest firmware and Windows Security Updates to see if it would fix the issue - no joy there. The other interesting point is that we ran this process from another server for 2 weeks and did not have any issue - both servers were on the same Hardware / OS / Windows SU / Application version ... The difference is that they are sitting into 2 different offices (2 different VLAN's) - connecting via a WAN Link (Metro Ethernet). I've gone through a series of websites about this problem but couldn't find anything relevant. Could you please help identifying what is causing the server / network to "block" the port ? I really don't have any idea where to start. Thanks. CalaoSHP asked 19 Jan '12, 05:05 CalaoSHP edited 19 Jan '12, 06:56 |
One Answer:
First of all, I'm not sure that the message "Acknowledgment number: Broken TCP. The acknowledge field is nonzero while the ACK flag is not set" within the RST Paket is the actual cause of the connection failing, because the RST already is a result of that having happened, so I'd guess it's just a side effect. More interesting is the reason why the server would suddenly refuse a session after it has already been established on the TCP level. I've seen this kind of thing to happen when an application decides that it doesn't like the client, for example because it's IP address isn't in the pool of allowed nodes and shuts it down right after the Three Way Handshake. In your case this should not be the reason, since I'm sure you'd already know if the application has a feature like that. Since you said it happens after a while I'd start monitoring the concurrent session count, and try to find out if there is some kind of resource exhaustion taking place. You could either capture all connection until the problem shows up (which might be tons of data, and a lot of work to process), or maybe a simple netstat -an can help. It will tell you how many sessions there are, and since you tear down your "good" communication with FINs you might have a problem with too many TimeWait states still being kept. You might want to look into setting the tcptimedwaitdelay in the Windows Registry to the lowest possible value (30 seconds) - take a look at the "TCP/IP implementation details" from Microsoft technet for details. The curious thing is that the three way handshake works, which it wouldn't in the first place if the port is not open and listened to by an application. The application learns about the new connection after the handshake is completed, and it looks to me like the application is saying "New Connection? Naw...", which results in the unfriendly "RST" paket going out when the application closes the socket. answered 19 Jan '12, 07:27 Jasper ♦♦ edited 19 Jan '12, 07:28 The "we use a different port" piece piques my interest. And you're right, the immediate RST looks like a firewall denying access. The servers are identical, but I wonder if their usage is identical. (19 Jan '12, 07:54) GeonJay Hi Jasper, Thanks for your response - I know it's a bizarre behaviour - I've been on the case for a few months now and still nowhere on finding what is happening. Especially that the same application works on a different server with the same hardware / OS / application; but in a different office and on different VLAN. We've monitored the concurrent sessions on that specific port but it hardly ever exceeds 30 - so we discounted that theory. I hear what you're saying about the "New Connection? Nah..." but we don't even see the connection reaching the application. Frustrating ! (19 Jan '12, 08:04) CalaoSHP Hi GeonJay, If it was Firewall related, wouldn't it only impact people going through the Firewall? Which is not the case here as people on the same LAN as the server (therefore not going through the Firewall) are also affected. The server in the other office is a full backup of the first one - so once it is in use, it does everything that the primary server does. (19 Jan '12, 08:49) CalaoSHP 2 The message "Acknowledgment number: Broken TCP, The acknowledge field is nonzero while the ACK flag is not set" can be regarded as cosmetic. Per the RFCs, any time the ACK flag is not set, the acknowledgment number field SHOULD BE set to zero, however, recent Windows versions don't bother to zero the acknowledgment number if the RST bit is set. As @Jasper says, this is a side effect, not a cause. This is not RFC-compliant behavior, but I've never seen it cause a problem. (19 Jan '12, 09:12) Jim Aragon @CalaoSHP: if the application doesn't get the connection but the TCP stack allows to create it in the first place I can only think of the Windows OS having some sort of trouble talking to the application. The OS has to know that the application has the port open, otherwise it wouldn't SYN/ACK at all. Looks like OS and Application have communication issues, especially if you say it keeps the old port open when you change it... By the way, what's the timing between SYN-SYN/ACK-ACK and the RST? Is there a delay, or is it instant? (19 Jan '12, 09:32) Jasper ♦♦ 1 One other thing that just came to mind - can you monitor the application ressource usage, maybe with Process Explorer? It is possible that the application has some sort of memory leak, increasing over time, and at some point when the failing connection comes in the application fails to allocate ressources for it so it gets dropped. A telltale sign could that the consumed memory goes up or stays the same, but never goes down. On a final note: I doubt you have a network issue here, it really looks like "OS vs. Application". (19 Jan '12, 09:35) Jasper ♦♦ @Jasper - we're talking miliseconds between [SYN] - [SYN,ACK] - [ACK]. (20 Jan '12, 01:18) CalaoSHP What about the RST? The Handshake isn't that interesting, but does the RST take a while, or is it just as fast? (20 Jan '12, 03:35) Jasper ♦♦ @Jasper - the whole handshake takes miliseconds (20 Jan '12, 04:18) CalaoSHP Sorry - got my knickers in a twist ... the RST is also instant. (20 Jan '12, 04:32) CalaoSHP 1 Okay... so I think you'll have to focus on troubleshooting the OS/Application parts on the bad server, or you could go for a differential analysis to see what the differences are between the bad and the good server. I doubt it's the VLAN or the physical location, but maybe stuff like concurrent sessions, requests per minute etc. which are different and give you a hint. (20 Jan '12, 04:38) Jasper ♦♦ 2 Hi, I'm glad to say we found out what the problem was and fixed it :-) A little by accident I admit but it's now been working for 2 days without a glitch. At the same time as troubleshooting this problem (port blocking / connection reset), we were also troubleshooting another one on the same server whereby processes were failing to initialise. The error message we were getting in the system event log was something like "Application Error : The application failed to initialize properly (0xc0000142)". I then found this article http://support.microsoft.com/kb/824422 which suggested that the server was running out Desktop Heap Memory. I applied the recommended changes (changing the registry value from 512 to 1024) and we've not had the problem since. Conclusion: The O/S was not able to manage the non-interactive processes on the server, therefore rejecting new requests (RST - Acknowledgment number: Broken TCP. The acknowledge field is nonzero while the ACK flag is not set) which subsequently caused the port to be "corrupted". If you want to monitor the Desktop Heap activity on your server, you can download the Desktop Heap Monitor tool from http://www.microsoft.com/download/en/confirmation.aspx?displayLang=en&id=17782 It's a little fiddly to install & run but once installed you can get detailed information. I hope this helps anybody with similar issues. CalaoSHP (27 Jan '12, 02:41) CalaoSHP showing 5 of 12 show 7 more comments |
Good write up! Where are you taking the captures from? When you look at the captures of the failed transactions can you verify that the ACK flag is not set? Do other communications from the server still work? Does this application rely on the Windoze TCP stack, or does it have it's own implementation? Does the server's switchport show any interesting / unusual errors? Does simply killing/restarting the daemon process fix the problem? Is there any kind of firewall in place - either hardware or software on the server?
RESPONSE PART 1
Hi GeonJay,
Thanks for replying so quickly - I'll do my best to answer your questions.
REPONSE PART 2
Looking forward to hearing from you.