This is a static archive of our old Q&A Site. Please post any new questions and answers at ask.wireshark.org.

RST - tracing

0

Hi folks:

We have a Gateway server. It's a WIndows 2003 O/S running a Java JBOSS application. It uses TCP/IP to communicate to remote medical devices that are wirelessly connected via TCP/IP.

These devices interchange basic XML messages using TCP/IP to our server which listens on a single port (51244).

At several of our client sites, we've noticed these messagess aren't being processed properly. We performed a packet capture using Wireshark and notice that there are frequent RST occuring that appear to be generated by the server. When this occurs, the client senses this, and re-opens another connection to attempt to send messages. This creates a problem in that since messages aren't being processed properly, the server has too many requests from clients. (up to several hundered xml message transmission requests per second).

Our 3rd party applicaiton provider says it's not his java applicaiton that is causing the RST. BUt it's either the server, or the VMWARE environment the server is running under. (our Gateway is sometimes installed in a VMWARE solution.

So my question is this: how can I determine if it's the application issuing the RST, or if somehow it's generic O/S or Java socket issues?

Supposedly, the JBOSS application uses basic java sockets. There's a thread pool that listens for incoming connection attempts on port 51244 and then handles the socket communications to those devices.

I am looking for some advice on how to determine if the app issued the RST, if a thread bombed abnormally and issued the RST, if the O/S issued the RST?

It's clear it's coming from the server. but i need some help in understanding if there are ways to track down WHAT COMPONENT on the server is causing the RST.

So I apologize that this isn't necessarily a specific wireshark question. but i'm wondering if others have advice on further root cause?

BTW...this is a 3rd party developed application. we have zero insight into the code.

asked 20 Dec '10, 09:39

bubbawny69's gravatar image

bubbawny69
1333
accept rate: 0%

I'm just glad Wireshark is Open Source.

(20 Dec '10, 15:01) Jaap ♦

6 Answers:

1

A couple of things:

  1. It sounds like you really need to setup a decent test environment to
    evaluate your architecture. Being able to simulate clients "attacking" your servers if going to help you more objectively understand what the limits are. Hopefully your developer can assist you with specifiying this - hopefully they have one as well.

  2. VMWare vSphere or ESX is pretty sound now. From my experience, VMware networking is usually the least of your concerns. Assuming you are using qualified enterprise-grade servers and are suing the VMware NIC driver - VMWare should be quite transparent to your app. We have
    many deployments with hundreds of virtualised guests. Many guests
    serve many hundreds of clients.
    Network, CPU and memory are almost never of concern - only disk IO can be issues at times. Vmware provides good management tools that should
    help you see what is going on.

  3. The TIME_WAITS are pretty much normal as Jasper said. Unless you have 1000's (literally) you shouldn't be concerned.

  4. The RSTs almosst certainly are because of your 8 connect limit on the app. The first RST is simply that the app doesn't want to service the client. The subsequent RSTs are the result packets from your client still being in the pipe hitting the
    server TCP stack on a closed
    connection.-

My suggestions are:-

  1. Beat up/work with your developer more. Either they should be providing you more insight into what is an appropriate architecture that they will support - or work out how to move to another developer that will provide the support or access to the source code you can work with.
  2. Why is there a 8 connection limit? On what basis is this limit defined (in otherwords what is the bottle neck you are trying to fix). I have seen many Apache/Tomcat as well as IIS servers handling hundreds of simultaneous connections. There is a big mismatch between your hundreds of clients and the server application architecture.

  3. You might want to consider fronting your server with a load-balancer and actually redepploy your servers as a farm. This way you can enforce your 8 connection limit at the load-balancer. Each of your servers could still be a virtual guest - a decent VMware server should be able to handle dozens of such 8 connection guests. You will find that boxes like F5 and Netscaler will be able to do this job to a tee.

answered 29 Dec '10, 15:48

martyvis's gravatar image

martyvis
8911525
accept rate: 7%

edited 29 Dec '10, 15:59

0

Well, this is not exactly easy to investigate. In my experience there are a couple of reasons for RSTs being sent:

  1. A SYN packet is sent to a TCP port that has no process listening on it, so the SYN request is denied with an RST packet (I think we can rule this one out, 1 down, a few to go)
  2. A successful data transfer is terminated using a RST packet instead of FIN/ACK/FIN/ACK, because it is supposed to be faster and block less ressources (stuff like TIME_WAIT). Not a nice way to finish a communication but pretty common (thx Microsoft :-)). We should be able to rule this one out, too.
  3. The application drops an active TCP communication port, for example because it encountered a timeout, which forces the OS to shut down the socket by sending out an RST packet. For timeouts this is usually something very obvious because you can see very specific delta times in the flow just before the RST is issued, like 10, 15, 30, 60 or 90 seconds. This depends on what the programmer told the application to use as a timeout value for a TCP socket.
  4. The OS (more specifically the TCP stack) drops the connection with a RST because it ran into an unacceptable state, for example if it receives a sequence number that is way outside the receive window or other funny TCP malfunctions - I have seen stuff like that happen if firewalls mess around with lost packets and try to repackage data into new packets. Investigating problems like that can take a lot of hard work to track down validity of sequences.
  5. The application is somehow unhappy with the application data it got handed over from the TCP stack and decides to terminate the session, resulting in a socket close and a RST packet on the wire. Very hard to determine as long as you can choke the programmer for some details about his program...

I may have forgotten one or the other additional reason for RSTs being sent, but this is just what I could think of at the moment... hope it helps.

answered 20 Dec '10, 17:03

Jasper's gravatar image

Jasper ♦♦
23.8k551284
accept rate: 18%

0

Applications don't actually issue RSTs - that's the job of the TCP stack. Normally for instance if an app cleanly close()s a socket, it will cause a FIN to be sent. You can however force a quick shutdown when you close a socket (by setting the SO_LINGER timer to 0) which it seems will cause a RST. (I looked at http://tangentsoft.net/wskfaq/articles/debugging-tcp.html amongst other references). RSTs are also normally sent when a packet arrives on a unestablished socket - or at least one that the receiver has no resources to process. It might also sent by an intervening device like a firewall refusing a connection.

Assuming a firewall isn't the issue, your best bet is going to try and turn up the level of logging on your application/server and see if it provides information on issues of state or resource exhaustion.

Also you may want to invoke "perfmon" on your server and monitor some of the TCPv4 parameters around connections.

In Linux I would suggest using "strace" as well to monitor system calls. There do appear to be strace-like tools for Windows that might also be useful.

Microsoft also have some Winsock tracing tools that might be useful - http://msdn.microsoft.com/en-us/library/bb892103%28v=vs.85%29.aspx

answered 20 Dec '10, 17:11

martyvis's gravatar image

martyvis
8911525
accept rate: 7%

edited 20 Dec '10, 17:14

0

Can you use editcap to keep just 128 bytes or so and post the trace? The thing you have to look out for is the advertised window size from the server. TCP RST basically means something horribly went wrong and the stack is just going to give up. However, IE can and does use RST to quickly tear down sessions. So IE browser clients sending RST to close out SSL sessions is not unusual (nor is it a problem).

So look at the window size coming from the server to see if you see something odd. For example, do you see some time passing (more than 200ms) and the window size STILL has not incremented? This basically means the server is not accepting the data from the stack fast enough. In fact, this will be in one of my Sharkfest sessions this year. Also, do you have the SYN and the SYN-ACK? What window sizes are being negotiated? Could someone have jacked up the windows scaling so much that the server is running out of memory? Finally, are the sequence/ack numbers within the expected range? It could be that something in the path (load balancers, wan accelerators etc) may be getting confused. Good luck.

answered 20 Dec '10, 18:57

hansangb's gravatar image

hansangb
7912619
accept rate: 12%

0

Guys, I appreciate the feedback.

I've gone down the rabbit hole on this one. I've read RFC 793 to understand TCP better and am getting a handle on things.

From the packet trace, the RST are coming from the server. From what I've read up on Java SocketServer class, it is possible for the application to initiate a forced close triggering a RST. Unfortunately, we do not have access to code to look inside to see if the mechanims for closing a socket are using a type which forces a reset.

We see plenty of TIME_WAITS on the server, and believe that the application is overloaded. What's peculair is that this is a VMWARE installation and we've noted some performance delta (decrease) between our physical gateway and our virtual one.

Our application's lower level socket / connection processor to our medical devices is definitely overwhelmed, perhaps from poor design. but it is likely that some how VMWARE is interfering somehow with the normal processing of TCP socketed connections.

What we noticed is that when the server is down for maintainance, our devices queue up messages. After some time, when the applicaiton is restored, 200+ devices instantly try to open connections to the server to hand off messages. The server application apparently has a design to handle (8) socket connections simultaneously. SO there's all these other devices that don't get a connection to the application socket. After a 1 second timeout, they try again.

After some time,the application finally catches up with the overload (flood) of messages and normal order is restored.

Our developer does not let us look inside code, so we do not have any means of understanding whether this is efficient or not. It does NOT seem like a good architecture in today's technology.

However, this does not mitigate the fact that the same code (application) works OK on a physical environment. Somehow, there's something causing interference. Perhaps it's the VMWARE NIC Driver, the VM SWITCH, an overwhelmed VM HOST or VM NETWORKING, i'm not sure. So even though we doubt the application arthicture is not solid, we do note that there's a difference between physical and virtualized instances.

REgarding VMWARE virtualized instance of our application, even under reduced load (few devices), the application can't seem to keep up. It's more noticable that we see what our developer is calling a "device Denial of Service" in VMWARE only. But it's not all implementations on virtualization. There are some that are working well on VMWARE, some that work well on HYPER-V, and all work well on PHYSICAL.

I think our issue is that all of our customers control their own VMWARE environments. So we have no control or knowledge of VMWARE configurations. We are just a VM inside their VMWARE infrastructure.

So i'm not saying it's a VMWARE issue for sure. It likely is a combination of poor architecture and perhaps some related VMWARE configuration or provisioning issue.

What i'd like to learn is how to determine or measure if there's significant packet loss happening on the interface. Why are we getting so many RST and TIME_WAITS? is it because normal socketed connection closes are losing the FIN/ACK, FIN/ACK and then the O/S RST the connection?

I've learned more about this than I really intended to learn. but it's still not enough to root cause why so many server RST and TIME_WAITS are occuring. Clearly the limits on the applicaton connection limits aren't keeping up with demand.

But it's so odd that when a connection gets established (SYN, SYN/ACK, ACK), and a push/ACK occurs starting to deliver data, that there's an immediate RST. The device gets the RST after the three-way handshake, but it's in the middle of push/ack some data. the RST comes in twice. Once after the three way handshake, then the push/ack is seen in the packet capture, then the RST comes in again.

So in the packet capture it goes like this:

  1. device sends SYN
  2. Server respondes with SYN/ACK
  3. device sends ACK
  4. RST from server
  5. push/ack of data from device
  6. RST from server

So my thought is this:

THe device initiates a connection. The WIndows 2003 O/S processes that connection at the WINSOCK level. The Application has no way to handle the socket connection. So WINSOCK knows that the buffer is full, the application is backlogged, and the WINSOCK (O/S) issues an immediate RST. Before the device sees that RST, it's PUSHING data to the server (attempting to). WINSOCK (O/S) sees thsi data and sends RST again.

that's my theory.

is there a way, that you know, to limit how many TCP connections an incoming server can handle? our application is constrained in its socket connection handling and is overwhelmed. I wonder if there's a way to throttle the connections some how. Kind of like a governor, to reduce the stress on the application as a short term remediation while they investigate why the appliation can't handle the load.

Also, if a RST is issued, does it therefore mean that that socketed connection will go into TIME_WAIT because it never properly closed (Fin/ack, FIn/ack)?

answered 29 Dec '10, 06:04

bubbawny69's gravatar image

bubbawny69
1333
accept rate: 0%

Your theory sounds sound (no pun intended) to me. But as Jasper already answered, TIMEWAIT is not caused by the RSTs. HOWEVER, because you have limited number of connections, timewait may be impacting how many connections you can handle. One what to see this is by lining up all all successful connections by time (statistics, conversations, tcp) and sort by RELative Start time). You may see that eight connections line up perfectly. Refer to Sharkfest 2010 Session A-8 "Another (unusual) Hidden Danger) presentation.

(29 Dec '10, 08:41) hansangb

In that presenation, the throughput was limited by having just 10 concurrent TCP connections. Once you export the rel start time to Excel, it's very clear what the limitation is. Finally, VM plays middleman to your server. So the handshake may be happening despite the fact that the application can't handle it. I don't know how to sync up the VM NIC with your server though. One long shot might be disabling TCP offloading on your Windows server.

(29 Dec '10, 08:43) hansangb

0

Yikes, looks like a big mess. I'd say your primary problem is that you don't seem to have any kind of leverage on the developers to get their design fixed to scale better. You should really look into this and see if there's anything you can do to get them to cooperate on this - from what you tell us I'm pretty sure your trouble is caused by the application design/application scalability.

Regarding the SYN - SYN/ACK - ACK - RST sequence: I've seen that happen whenever an application is listening on a port and after the TCP socket has handled the connection establishment and tells the application that there is a new communication partner the application denies the new connection (for whatever reason the programmer chose). I don't think it's just a simple buffer issue; it's the application forcefully denying a new connection. For example: in one of my latest cases it was an FTP server that would deny connections from any IP except those coming from a specific range. Sometimes the client is happy that the three way handshake worked and sends data right away, which is why you get two resets: one caused by the application denying the new connection and one from the tcp stack that receives the client data after the first reset and resets again. So you got that one right.

And no, as far as I know there is no way to limit TCP connection except by deploying a firewall in front of it that will only let a certain number of connection through to the server. In my eyes this is some kind of a bad workaround and won't do much good.

Regarding the TIME_WAIT: this is NOT a result of a RST to a connection - those are immediately shut down and don't do TIME_WAIT. TIME-WAIT happens after a graceful shutdown (FIN/ACK/FIN/ACK), and on Windows blocks ressources by default for 240 seconds IMHO. That is an ancient mechanism (regarding the 240s) to cope with late arrival network packets. Since you say the application only allows 8 concurrent connections your problems may be caused by waiting too long for TIME_WAIT to be complete, because it will block further connections to be accepted. You should configure the TIME_WAIT delay as low as possible (30 seconds on windows), which is done through registry parameters:

Set the parameter "TcpTimedWaitDelay" at HKLM\SYSTEM\CurrentControlSet\Services\ TcpIp\Parameters to 30.

The other action you should take is to find a good way to handle more concurrent connections at the application level, which is where you need to talk to the developers. They could implement some sort of broker/agent architecture (or improve the existing one if there is any), or maybe a loadbalancer can balance the incoming transaction to multiple application servers.

Regarding VMware: if you think the problem is with the virtualized hosts you need to investigate what kind of environment it is. Is it an enterprise setup, or someone running VMs on free virtualization solutions on cheap hardware? How many VMs are there? Is there a ressource management in place, and does it grant enough ressources to the application VMs? How about network bandwith, NIC teaming, traffic shaping etc? So far I haven't seen bad problems like yours being caused by VMware alone.

Hope this helps a bit.

answered 29 Dec '10, 07:44

Jasper's gravatar image

Jasper ♦♦
23.8k551284
accept rate: 18%

edited 29 Dec '10, 07:48