Hello, Many times, my server is responding the way described is this question title (some other times stuff work just fine). My topology is something (simplified) like this (I'm not the network admin):
According to Apache stats there are still free workers/threads to receive requests and CPU/RAM usage is quite normal. Please take a look at the following image (from a tcpdump capture on one of web servers & open in Wireshark) and provide some ideas about what the issue might be (I've struggled with this for several weeks now) Capture file uploaded here: https://www.cloudshark.org/captures/4a87072e66c5 asked 08 Nov '16, 14:05 diazdw edited 11 Nov '16, 07:30 showing 5 of 6 show 1 more comments |
3 Answers:
There are a couple of things out-of-the-ordinary me in this trace: Retransmission of the SYN/ACK Long delay before first HTTP request Wrong ACK number on HTTP request Next troubleshooting steps could be:
answered 11 Nov '16, 03:18 SYN-bit ♦♦ Great answer! There is an interesting phrase in the article Sake has mentioned: "...some configurations that involve server farms and front end load balancers assume that there is a clean separation on the initial TCP handshake and the subsequent transaction and there are complex failure modes that arise when this option is used in such a case..." This is very close to our case. It looks like handshake is made by ACE itself, and then ACE starts to forward requests directly from client. (Look at the TTL behavior). Acting this way assumes performing SEQ and ACK numbers manipulation in order to achieve matching between ACE's own packets and client's packets (coming to the server side). Maybe "out-of-order" SYN-ACK server's retrasnmissions somehow mess ACE's state machine that leads to bug in SEQ-ACK manipulation mechanism inside of it. If you can, just try to turn TCP_DEFER_ACCEPT setting off and spot the difference. If you can't do it, try to spot the next pattern: SYN-ACK retransmission seen BEFORE GET request will cause RSTs from server. (11 Nov '16, 07:56) Packet_vlad @SYN-bit Thanks for you deep review. Regarding the 3 issues you've pointed out:
I'll do some more troubleshooting as you suggest and see what I find out. Thanks again for your time and help. (11 Nov '16, 08:01) diazdw |
As the SEQ of the TCP/RST does not seem to match the SEQ and ACK of the 3-way-handshake, I would like to see the real tracefile to look at the SEQ and ACK of the http request. I would also like to look at the IP TTL to see whether an intermediate device might be sending the TCP/RST and I would like to look at the ip.id to help in the analysis. In short, no good analysis could be done (at least not by me) just based on the screenshot. Too much important information is missing... answered 09 Nov '16, 08:12 SYN-bit ♦♦ Yes I think it, too. (09 Nov '16, 08:18) Christian_R @SYN-bit & @Christian_R I just uploaded the (original tcpdump) capture to Cloudshark: https://www.cloudshark.org/captures/4a87072e66c5 . Please take a look at it. (09 Nov '16, 08:28) diazdw |
It looks like the server connection is never opened. That's why the server is sending those resets. In Frame 71, the server sends it's SYN-ACK, but then resends it, according to the screenshot you provided, in Frame 73. But it also looks like the server did get the ACK from the client, according to the capture. Was this capture taken directly on a specific web server? What's going on with the SYN-ACK being resent and the ACK not being acknowledged by the server suggests that the capture was taken outside of the web server farm, likely on the FW or LB. And there seems to be some retransmissions as well. So if the capture was not taken on a specific web server and there are retransmissions, it's likely that Frame 72 is never received by any server, and is actually dropped, but you just don't see it. Therefore, when the client sends it's HTTP GET, the server doesn't OK it because the connection's not opened yet. That's why in Frame 80, the server resends the SYN-ACK in another attempt to open the connection. So I would verify that you're capturing data from a specific web server, if you can, and also look into those retransmissions. What causing that? Since you're not the network admin, this is something you can probably bring up with that person. As Jaap mentioned, a capture you can share would help to take a better look at this, but I think that's what is happening here. answered 09 Nov '16, 07:41 jeantunis @jeantunis It looks to me (2nd) client ACK made its way to the server NIC, but somehow that ACK never climbs up the TCP stak. Capture was taken directly on a specific web server. I have shared the needed (original) capture made by using tcpdump at https://www.cloudshark.org/captures/4a87072e66c5 (09 Nov '16, 08:26) diazdw Hm, if the capture has been taken at the web server, then problem is maybe inside the system or after the point capture. Because, I came to the same finding like @jeantunis. (09 Nov '16, 09:11) Christian_R This is very interesting capture. It seems that we're on the server side (on the server itself actually). I think that because: 1) timing analysis of 3-way handshake; 2) TTL of outgoing packets = 64; 3) IP packets of 2960 Bytes in size (in working connection sample, that means we're capturing before NIC does LSO). But how in that case an ACK (frame no.72) that we've already seen in the capture could be dropped? Only somewhere up the server's IP stack, after capture point. And two more points: - Packet 74 (GET) has TTL of 59, not 127 as packet 72 had. Also packet 74 has wrong ACK of 2106390967, whereas initially ACK packet 72 had 2962498563. It looks like these two packets have different sources? How could it be?
(10 Nov '16, 02:46) Packet_vlad Just looked at the working stream. It contains the same TTL transition, probably some proxy is involved on the path. (10 Nov '16, 02:50) Packet_vlad @packet_vlad: seems that there is somekind of virtual environment. The lost ACK: Yes it is strange. Maybe it is some kind of a driver issue. (10 Nov '16, 02:55) Christian_R @Christian_R is correct. There does seem to be some sort of virtual environment with VSS Monitoring. And in that environment, you need to be careful how and where you capture data. The second thing is around the Retransmission Timeout and IP ID. If you look at frame 74, originally sent at 161.2 seconds, we would expect a retransmission anywhere between 1 and 3 seconds, depending on TCP implementation, if there's no response from the server. The retransmission happens about 1.5 seconds later at 162.6. With the backoff algorithm, we should expect a retransmission after 3 seconds, then 6 seconds, then 12 seconds, and so on. And the IP IDs should increment accordingly. Everything happens the way they should until after the retransmission in frame 83 at 165.5 with an IP ID = 3334. The next frame we see from the client is frame 85 at 182.5 with IP ID = 3336. What happened to the frame that should have occurred around 171.5 (or so) with IP ID = 3335? That doesn't exist. The packet capture doesn't have it. And there could be a number of reasons for this. So wherever this capture was taken, whether on a physical or virtual server, it doesn't look like you can completely rely on the capture taken there. My suggestion would be to 1) capture as close to the server as possible, but not on it, and 2) capture at multiple points. Based on the diagram you showed, you have a FW and ACE boxes that are manipulating the packets, and that's just what we know. There could be other things we don't know. You want to be able to trace a stream of packets going from the edge of your network across the FW, ACE and anything else all the way to the server. Last, I could be wrong, but I don't think frame 72 was seen by the capture and then got dropped on its way up the stack. That's unlikely to happen for just one client and one TCP connection. I also don't think it's a NIC driver issue because other communication between the client and server is happening without any problems earlier in the capture. If this is a physical server, and you are sure the issue is there, you should narrow your focus on that box with a profiling tool along with tcpdump. But don't just capture the communication between the server and client -- capture everything to see what's happening to other clients as well. That could clue you in to whether this is server-related (like any recent changes to the server farm) , network-related (like any recent network changes) or something else entirely. (10 Nov '16, 08:52) jeantunis showing 5 of 6 show 1 more comments |
Here is a similar problem: https://ask.wireshark.org/questions/43648/spurious-retransmission-and-dup-ack
@Christian_R Thanks for commenting. I'd checked the thread you posted beforehand. The problem they discussing is about a client delayed response and no solution was found regardind server side. In my case, I don't see that delay; I think the client 2nd ACK is ignored/lost in my case too, though :-( I don't know why yet...
Can you share a capture in a publicly accessible spot, e.g. CloudShark?
If you have a layer 4 device in the path it is often a good idea to take a capture in front of and behind the layer 4 device.
@Jaap Capture file uploaded to CloudShark. Please check the link added to main question.
@Christian_R The device immediately before clients reach web servers is a Cisco ACE load balancer. Do you know if capture by using port mirroring is possible there?