Note that the timing is not synchronized between client and server Observations:
Any help would be appreciated [EDIT] Added capture files here: Client: http://www.cloudshark.org/captures/5407bc01958e Server: http://www.cloudshark.org/captures/3bf196dc6b3f Note that there are several successful handshakes and ensuing traffic, but at some point something breaks as described above. This question is marked “community wiki”. asked 28 May ‘13, 14:53 RomanM edited 29 May ‘13, 18:25 krishnayeddula |
3 Answers:
The same issued was discussed in a Oracle Forum thread "The Solaris response, with just ACK instead of the typical SYN ACK, is good according to RFC 793. The FW obviously doesn't agree to this behaviour and RST's the connection towards both ends. Note that the correct (= expected) SYN ACK from the server is sent out delayed. So it might be the http server is having problems acctepting the new connection in a timely manner causing an unexpected ACK to flow out before the SYN_ACK. answered 29 May '13, 23:29 mrEEde2 i think this article is most insightful and relevant to the issue. I guess the double SYN packet is the root cause and what makes things cascade, we will need to investigate it (30 May '13, 11:38) RomanM
good find!
I doubt that! That's just a comment of one forum user, without any explanation why that would be O.K. It's rather a bug, as also mentioned in the forum article. Unfortunately the link is dead. But if you search for the BUG ID, you'll find some information.
Anyway, is there any Solaris system involved? (30 May '13, 12:11) Kurt Knochner ♦ The command to start the trace at the server was snoop. So this must be a solaris system. (30 May '13, 22:57) mrEEde2 The double syn packet is not the root cause, it is a retransmission because the linux client didn't see a syn_ack within 3 secs. In total the http server didn't accept the new connection for 3.3 secs. This is the problem that needs to be investigated (30 May '13, 23:12) mrEEde2 So you are saying that the double SYN is normal and is per the implementation of the linux stack ? It was odd to me because I know that a lot of network equipment will be suspicious of multiple SYN packets 'flooding' - hence the RST... (31 May '13, 07:56) RomanM
no device I know of will block a second SYN (after a few seconds) as 'flooding', as that's just regular TCP retry mechanisms. (31 May '13, 07:59) Kurt Knochner ♦ That's standard tcp behaviour: retransmit, when you don't get an ack within your retransmission timer. Later in the flow we can use the rtt measurement to adjust this but initially we have to take a guess. Linux uses 3 secs, which is far away from 'flooding' (31 May '13, 08:03) mrEEde2 showing 5 of 7 show 2 more comments |
Looks like Load balancer is not translating server ip address to virtual ip address when replying back to client. Assuming 47.29.0.122 is the client,if you check ACK-RST packet which is 136 is having SIP:47.29.0.122 and DIP:192.168.140.26(In normal case the DIP should be Load balancer IP a.k.a VIP which is 24.114.118.166).ACK-RST is generated because somehow client didn't liked the previous packet it got and it failed to process it.One case is, It opened connection to LB ip but it got a response directly from server instead of LB ip. Is asymmetric flow triggered(Forward traffic hits LB-A and Reverse hit LB-B and LB-B instead of translation do a plain routing which will break the session) "Then the client received a ACK (and not a SYN ACK as expected) from the server, which the server sent in replay to a SYN" I didn't get this part. Why syn-ack is not expected? SYN/SYN-ACK and ACK are must and should for any TCP Based communication right?How come client will send an ACK with out seeing SYN-ACK from server? Better to wait for some expert analysis here. I am sure you will get. answered 28 May '13, 15:08 krishnayeddula edited 28 May '13, 15:45 |
It looks as if the server ignores the first SYN packet and then answers with ACK (Frame #134). Possible Reason: Asymmetric routing. You see only one half of the communication and the other half is router through a different interface/path. This usually happens in cluster environments (Firewalls, Loadbalancers, etc.), hence the different IP addresses. They are either NATed (Firewall) or balanced (Loadbalancer). Questions:
If the capture was taken on the server itself, then there must be two interfaces in the server (possibly with an IP address in the same subnet) and the OS does send the replies to the same interface where the requests came into the system. You don't see the SNY-ACK for the SYN in Frame #132, as it may have taken a route you did not monitor. The strange thing here is Frame #134, which should not exist in this conversation at all, as it is an ACK from the server to the client. If the capture was taken on a TAP or switch, there is most certainly a cluster tool involved. You see the requests coming from one cluster node and you don't see the answers as they are sent to the second cluster node. The second node possibly drops those packets, and that's why you don't see the SYN-ACK at the client. Suggestion: Check your environment for misconfigured clustered devices (Firewalls, Loadbalancers) and/or a misconfigured server with dual interfaces (possibly in the same subnet - some versions of windows do allow that!). Regards answered 28 May '13, 23:35 Kurt Knochner ♦ edited 28 May '13, 23:45 I think i ought to better describe the environment. I have a client application sitting behind a FW and a LB. The LB has a VIP (24.114.118.166) and it balances the client traffic to either of two clustered servers. 192.168.140.26 is the payload IP from which a particular server rx\tx with LB. I run the capture on both servers monitoring the payload IP on which it communicates with the LB and on the client. One more comment is that this problem occurs not 100% of the time - this is about 1/30 attempts to establish a 3 way handshake that fails. As you, I am also puzzled why we are seeing frame 134, i am not sure why this ACK is sent (i can see that it has a seq num=1 which is different but im not sure what it means) The other thing which i find weird is that the server receives the first SYN packet - but since i didnt start the capture on the client and server at synchronized times i dont know how long after it was sent After receiving the second SYN we see that odd ACK, and then a SYN ACK and 3 seconds after both client and server get a RST ... (29 May '13, 09:00) RomanM
was there any capture filter in place? If so, what was it? (29 May '13, 09:54) Kurt Knochner ♦ On the server snoop -d e1000g1 port 8080 src 47.29.0.122 or dst 47.29.0.122 (anything leaving or coming to payload IP on port 8080) On client tcpdump./tcpdump -i dav0 tcp port 8080 (just anything on port 8080 since traffic is light) (29 May '13, 09:59) RomanM Are there several interfaces in the server, possibly with interface bonding? If so, can you please capture the traffic on all interface in parallel. I still suspect, that the SYN-ACK for the first SYN, was sent through another interface and was then blocked/dropped somewehere (Firewall, LB, etc.) (31 May '13, 08:11) Kurt Knochner ♦ |
It would really help to see these packets in full, like on www.cloudshark.org for example. I suspect the mac-addresses, ip ID’s and IP TTLs can tell a lot more to pinpoint the problem.
done, added