Hello Wiresharkers! We have a problem that's been going on for quite some time now and we still can't figure out what is the cause of the problem. A little more information on the issue : So we have the server and we have an application that is needs to send information to the server and then get a response from it, work with the information and then end everything. After all this is done, it should start the whole process all over again. Now there are couple of strange things happening. First when the problem occurs the user that is logged in the application is logged out from it and it gives an error in the logs something like "Connection timed out". I've been using WireShark and later on tcpdump and netstat to get as much information as possible. After the WireShark capturing is done, it's visible that in almost every session (Like 80%) of the time, there is something wrong. Mostly there are 3 parts that are constantly appearing :
Another interesting thing is that the session never ends in the calm way. It's always the server that sends the FIN,ACK packet and then the client is responding with RST,ACK and finally only with RST. There are two captures that I've made with tcpdump that can be downloaded from this link : https://drive.google.com/drive/folders/0B7NLgVAbc9Gjblo2anQ4cVMtYXc?usp=sharing One is when there is no problem (It means that the application is working "properly") and another that there is obviously some rainbow from 5251 to 5850 number. After using netstat I noticed that the application is actually sending a SYN packets, but it looks like that the server never responds or something like this. I can see in that in the captures, the server actually responds but not with an SYN,ACK packet. There is a time out of 2 seconds on the application and after that it logs out the user. I've look through so many blogs and articles, but could not find anything that can help resolve the issue (And I've tried lots of suggestions like turning the window scaling off because of not receiving the ACK packet after the SYN was sent). I forgot to mention that we have 2 different routers that give two totally different networks and we put a test device on both networks hoping that something might change, unfortunately the issue still occurs. For sure there is something wrong, but I can really see it so I'm asking for your help. Whatever you can suggest it's going to be good as I've hit rock bottom looking at those logs. THANK YOU! asked 11 Sep '16, 07:37 Raxter edited 11 Sep '16, 22:02 |
Are you sure you haven't filtered the "Something_wrong..." capture before posting it? You say that the Bad Thing happened between frames 5251 and 5850, but when I use a display filter
tcp.flags.syn == 1 and tcp.analysis.spurious_retransmission
, in order to see repeated SYN packets whose existence means that the previous SYN for the same session has not been responded, I get a range from 4740 (first re-sent SYN) and 5103 (last re-sent SYN). Each SYN is re-sent only once.What is clear that "something" changes the order of packet delivery. So in almost all sessions which make it over the initial handshake, the third packet from server to client (third by Seq and Len) is captured first, followed by the first and second one by Seq and Len. This causes Wireshark to mark the first by time with
Previous segment not captured
and the following two by time asOut-of-Order
, and not dissect the payload. Among other things, this prevents Wireshark's SSL/TLS dissector from dissecting the payload of these packets. It also makes the client respond to the first one by time with a packet carrying zero payload and an Ack number already sent before, which Wireshark marks as just a Window size update.Note that this effect does not prevent the application at the client from working, as it receives all the data it expects in a proper order - the TCP stack handles that. So this behaviour only makes the analysis harder by adding optical noise and interfering with operation of the SSL/TLS dissector.
Before trying to understand the real issue - can you capture simultaneously at the server end, please? This should tell us whether something in the network is responsible for the reordering issue and, possibly, also for the fact that the server responds with a RST packet to an incoming SYN packet.
Hey Sindy,
Thank you for the fast response! I only used a filter to show the server (87.121.90.189) and the client (192.168.7.114). I uploaded the original capture Drive : https://drive.google.com/drive/folders/0B7NLgVAbc9Gjblo2anQ4cVMtYXc?usp=sharing
I'm a new member so I was sure that I missed something in my original question. Thank you a lot for clarifying the behaviour of the Wireshark, I was not sure if it was actually something wrong or Wireshark just marked those packets to warn me.
Now that I looked at the "Something wrong" I noticed that it was a different capture from the one that I wanted to upload, but it catches the behaviour that I was trying to explain. The new one does not have any filters and it has everything.
I will be able to run Wireshark on the server tomorrow so I will upload another file in Drive and post an update about it.
Thank you again!
Please run Wireshark (or tcpdump) on server and client simultaneously, to allow comparison. You say that you have two different routers, but it is not clear whether they are different pieces or different types, and even if different types, they may incidentally share the same network stack. So until a capture taken next to the server (or, if not possible, directly on it) shows that the weird ordering happens already at the server itself, we cannot exclude that the router is responsible.
It would be also good if you could provide a description or drawing of the network structure between the client and the server. I suspect that the Mikrotik which is a default gateway for the Pi is doing a NAT, which might explain part of the strange behaviour (server sending RST in response to SYN). So it would be worth it to also capture at both "LAN" and "WAN" side of the Mikrotik. Rather than storing the files locally, you can send the captured data to the machine capturing elsewhere using TZSP. I do have a suspicion but I'd like to see the hard data first.
Thanks again! I will run the tcpdump today and will upload it later. I have a capture from one of the LAN ports of the MikroTik that we connected the test device to and it's uploaded in the folder. In this file the client is 192.168.5.30. We also have a backup ISP and I ran a test on it, but the issue occurred again.
The NAT is configured on the MikroTik as well. Another thing that I'm not sure if it's going to help is that this issue occurs super random, but one day I decided to leave the test device all night. It gave no errors for 12+ hours and then when there were a lot of devices working with the server, the issue started again.
Also, a drawing of how the client reaches the server :
https://s22.postimg.org/m47kt2t4x/Network.png
Unfortunately the server is not in our building and it has to go through some devices that we have no control over.
OK, as it seems really complex, I'll disclose my suspicion in advance.
What I think is that due to many clients establishing many short-lived TCP sessions each, one of the NATs along the way is running out of ephemeral ports on its WAN interface, and reuses them too early. Therefore, the SYN from a new TCP session sent by client X gets the same ephemeral port on the WAN side of that NAT device like previous session from client Y did, and because session from that WAN_IP:port is still in TIME_WAIT state at the server, the server responds to the SYN with a RST with sequence number belonging to the previous session.
A normal behaviour of a client is to keep the session open for some time after use, unless it is 100% sure that it won't have to talk with the same server again, and re-use it for subsequent communication with the same server. It would only close it if it wouldn't need to send anything for some tens of seconds. Of course, if in your case, the termination of the "successful" sessions is a consequence of an application layer error, this approach cannot be used - in this case, the application layer error needs to be addressed first.