HI I am trying to troubleshoot some issues on a Linux Apache web server (very slow response several times during the day). For testing i am trying to load a page on my laptop (192.168.249.2) from the web server (172.18.26.41). I used wireshark and noticed some TCP retransmissions and TCP DUP ACKs, see below, I ran the trace on both ends and saw the same results so no packets are going missing. I cant see anything out of sequence here and i don't understand why 26.41(web) is sending a TCP retransmission (22461) if its already received an ACK in packet 21453. It only does this when there is a heavy load on the server, could this be an application (Apache) issue where port 80 becomes unavailable, could wireshark have seen packet 21453 come in but if apache is busy and port 80 becomes unavailable 26.41 issues a re transmission?? There doesn't appear to be any congestion on the network, we have 100mb link into the MPLS and a 1gb at the other end and the highest utilisation is approx 30%.
Any help would be helpful. asked 12 Jun ‘14, 03:18 navs_123456 edited 12 Jun ‘14, 03:20 grahamb ♦ |
4 Answers:
After doing a googling found that it could be causing because of TCP_DEFER_ACCEPT option,Please check similar behaviour in this link, http://webcache.googleusercontent.com/search?q=cache:zoA_A0cao28J:https://labs.ripe.net/Members/gih/the-curious-case-of-the-crooked-tcp-handshake+&cd=9&hl=en&ct=clnk&gl=in answered 14 Jun '14, 00:35 kishan pandey Very good find!! Thanks a lot for that link for several reasons
BTW: How did you find that article? What did you search for? Hint: I converted your comment to an answer, as it is an answer in itself. It does not explain what is causing the problem, but it might help to do so! Please read the FAQ, for how this site works. (15 Jun '14, 03:06) Kurt Knochner ♦ Thanks Kurt for encouraging words for a Learner like me,I googled for "retransmit syn ack even after receiving ack" and it worked. (15 Jun '14, 05:55) kishan pandey cool. Keep going ;-) (15 Jun '14, 06:18) Kurt Knochner ♦ |
Based on the answer of @kishan pandey (TCP_DEFER_ACCEPT), here is a new attempt to explain what could have happened: TCP_DEFER_ACCEPT makes the server wait for a data packet from the client, which never arrives, so the SYN-ACK is sent again. Now, the interesting question is: Why is there no data packet (HTTP GET, POST, HEAD) from the client? A possible answer: Because it's not the client who ACKs the SYN-ACK, but 'something' between the client and the server. And now we should have a look at your Fortinet Firewall (concluded from the MAC address used in the capture file). Maybe the Fortinet acts as a TCP or SYN proxy, because there is a content scanning feature enabled on that firewall and that's causing problems.
Maybe the client sends a SYN and the firewall forwards that SYN to the server. The server answers with a SYN-ACK (wrong check sum in the capture file), which gets answered by the firewall (SYN proxy, TCP proxy) with an ACK. But then the firewall does not forward the SYN-ACK to the client and also drops any further SYN (re-transmit) from the client, for whatever reason (Hint: the SYN-ACK from the server shows a wrong check sum in the capture file - which could be true or caused by TCP offloading on the server). As the client has no chance to send the data packet (never seen the SYN-ACK), the server will resend its SYN-ACK due to the TCP_DEFER_ACCEPT option. So, the key to analyze this problem is a capture file taken at the client and the server in parallel, to check who is doing what. @navs_123456: Can you please post both capture files (client and server), so we can check my new theory :-) Regards answered 15 Jun ‘14, 03:24 Kurt Knochner ♦ edited 15 Jun ‘14, 03:32 |
Looks like the ACK for the SYN-ACK does not get through to the server, so the SYN-ACK is sent again and again. You have created the capture file on (or near) the client. What do you see in the capture file on (or near) the server at the same time? Regards answered 12 Jun '14, 05:19 Kurt Knochner ♦ edited 12 Jun '14, 07:10 Hi Kurt, this is the capture on the linix web server 172.18.26.41 (used tcpdump), this is what puzzles me, the server is actually seeing ACK for the SYN ACK, or at least wireshark can see it. (12 Jun '14, 08:02) navs_123456
Ah, you're right. It's on the server. I messed up the delta times of SYN-ACK and ACK ;-)
Yes, that's 'strange'. Is there any firewall (iptables) on that server? What is the output of the following commands?
and
Furthermore:
Do you see an 'increasing' number of the following?
Do you see an 'increasing' number of the following?
(12 Jun '14, 08:18) Kurt Knochner ♦ BTW: One thing I found in the capture file. The TSval values of the SYN and ACK frames are identical although the time delta is ~4ms. I've never heard about a linux kernel feature that drops frames like these, but you'll never know. What do you see in the capture files from the same client, while a connection works. Is the TSval value identical as well (maybe set/updated by your Fortinet). (12 Jun '14, 08:38) Kurt Knochner ♦ One more thing, as I have seen similar things recently. How many entries do see in the output of the following command, while the system behaves like you described it?
How many of them are in TIME_WAIT and how many in ESTABLISHED and SYN_SENT state
(12 Jun '14, 08:50) Kurt Knochner ♦ HI Kurt Dont believe any firewall is enabled. iptables -L -nv Chain INPUT (policy ACCEPT 0 packets, 0 bytes) pkts bytes target prot opt in out source destination Chain FORWARD (policy ACCEPT 0 packets, 0 bytes) pkts bytes target prot opt in out source destination Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes) pkts bytes target prot opt in out source destination conntrack -L doesnt appear to exist netstat -ns -w this looks ok to me. Ip:
netstat -ns -t Tcp: looks like a few issues here.
(13 Jun '14, 02:43) navs_123456 HI out of the 900 entries in netstat -nat i had the following (this was during the busy period)
(13 Jun '14, 04:48) navs_123456 Hi, also my comment about the TSval seems to have dissapered. Basically it said that i ran the test twice during a time where the connection was working normally, both times the delta between the SYN and ACK was 3ms, once the TSval incemented by 1 and the second time it didnt increment. (13 Jun '14, 04:56) navs_123456 O.K. let's sum it up: There is no firewall on the server, so nothing that could block the (valid) ACks. There is no problem with the number of sessions, regardless of their state, as even 510 connections in TIME_WAIT is nothing for a busy web server. There are some 'interesting' TCP counters (like 5100 failed connection attempts), but the reason for those are unknown. Could be related to your problem. You are capturing on the server, so the ACKs should get through to the OS, as Wireshark sees them as well. There seems to be no problem with the TSval time stamps, as you have observed working connections, with the same TSval value for the SYN and the ACK frame (at least that's how I understand your comment). Unfortunately I'm running out of ideas. You could try to enable debug logging in apache and check if you get any results. However, as long as the OS does not accept the 3-way-handshake ACK, I doubt that apache will ever get 'notified' about the new connection. So, to me it looks like a problem in the OS (kernel), but I don't have any explanation to offer, as everything looks good. One last thing: You can try to disable TCP offloading, if it is enabled. And then check if things are getting better.
(13 Jun '14, 06:30) Kurt Knochner ♦ Hi Kurt, yes you have understood all correctly. I am wondering about the 126 SYN_RECV messages, suggests the OS hasnt seen an ACK for these SYNs. I will try some more troubleshooting and try and marry up a new wireshark capture with netstat -nat command, to see if wireshark see's the ACKs come in from the client but the netstat -nat is still sitting in SYN_RECV. This i guess would confirm its some sort of OS issue. Thanks for your help. (13 Jun '14, 08:27) navs_123456
Yes, that could be another sign for your problem. As I cannot find any problem within the ACK frame (checksum O.K., etc.) I have no explanation to offer why the OS might drop that ACK frame.
agreed. BTW: What happens if you restart apache while you see that many connections in SYN_RECV state? BTW: did you check/disable TCP offloading? (13 Jun '14, 09:18) Kurt Knochner ♦ One more thing... Can you please run the following command and post the output for some of the SYN_RECV connections
(13 Jun '14, 09:22) Kurt Knochner ♦ showing 5 of 11 show 6 more comments |
Maybe the TCP or IP checksumm is incorrect. The trace file is gone again so I couldn't check myself Ok, found the trace by removing the '.' from the URL and the checksums are correct. answered 12 Jun '14, 13:13 mrEEde edited 13 Jun '14, 12:01 Isn't the idea of SYN cookies to have no queue? So, if SYN cookies are enabled, there should be no SYN_RECV at all, as the socket will be 're-created' only after a valid ACK has been received. At least that's how I understand it. I might be wrong... (13 Jun '14, 13:47) Kurt Knochner ♦ from http://lwn.net/Articles/277146/ : " Due to this limitation, and the modest computation overhead of the cryptographic hash, the Linux stack only resorts to syncookie based connections when the number of half-open connection exceeds a high watermark controlled by the net.ipv4.tcp_max_syn_backlog sysctl." I was thinking that the syn_backlog might be sitting at 128 and only then the syn_cookie code would kick in. But ... I might be wrong ;-) Should be easy enough though to try it out and feedback (14 Jun '14, 05:51) mrEEde
Good point regarding the 'late use' SYN cookies! But then, if the number of half open connections is the trigger for using SYN cookies, how would disabling SYN cookies fix the problem that caused the half open connections? ;-)) (15 Jun '14, 03:03) Kurt Knochner ♦ |
It’s time consuming to do an analysis based on text exports and screen shots. Can you please post a pcap file somewhere, like Google drive, dropbox or cloudshark.org?
hi, try this https://dl.dropboxusercontent.com/u/48120370/172.18.26.41-DuringSlow-11am-spec.pcap , i have cleaned the file so the packet numbers have changed.