A customer in Taiwan is connection to my secured web server using Chrome. We're running javascript (compiled with GWT) in the browser session, sending requests back to the server via the XmlHttpRequest object. There seems to be one particular code path that gets "stuck"; it will seem to be working fine, and then some change (presumably to the data being sent) will "break" the communication. In the httpd access logs, we see 400 responses to these requests, sent after the 120 second timeout elapses. The access log also reports the SSL_PROTOCOL (TLSv1) and the SSL_CIPHER (DHE-RSA-AES256-SHA). We're actually seeing the request line, some of our custom cookies, so the message headers in the request appear to be complete. In the httpd error logs, we see a bunch of "(70007)The timeout specified has expired: proxy: prefetch request body failed" as mod_proxy tries to pass the request through to the application server on the back end. We set up a dumpcap job, which records traffic from this particular ip-block. Because I wasn't sure how much disk was necessary, I've been using a SNAPLEN of 66 bytes (the data is supposed to be encrypted, so I wasn't expecting any more than that to be useful?) Reviewing the capture with Wireshark tells me that each of these failures matches a particular pattern of packets. For example Frame 3004: a packet from the client, Len: 69. The gap between this packet and the previous in the tcp.stream is about 9 seconds, so this is clearly the beginning of the problem request (timestamp matches that found in the httpd logs). Wireshark identifies the protocol as "SSL" Frame 3005: a packet from the client, Len: 377. Wireshark identifies the protocol as "TLSv1". It also reports TCP Previous segment lost. Some quick math on the actual sequence number vs the expected sequence number suggests about 1530 bytes of data are absent. The arrival of this packet was about 40ms after frame 3004. This is the last packet we see from the client in this tcp.stream Frame 3006: ACK sent to the client. The sequence number is ...2669, appears to be the acknowledgement number for both inbound packets in frames 3004 and 3005. The acknowledgement number for the outbound packet is ...2973, which matches the "next sequence number" for frame 3004. Two minute pause with no packets exchanged. Frame 3140/3141: the server sends an SSL packet pair, which presumably is the status 400 page. Frame 3142: FIN/ACK Frame 3143, etc: throttled retries of the packet sent in Frame 3140. One other data point. About 20 seconds after the problem request, we get an http request to report that this logic failed. Our client side code includes a timer object which fails after about 15 seconds, and we drop into logic to report the error (no problems there) followed by a retry attempt (packet sizes on subsequent attempts can be slightly different because the client data changed during the 20 second interval). The retry event failed with the same symptoms. As far as I can tell, once we get into this state, the problem persists until the client's browser process is killed and restarted. The server, during the problem interval, is happily fielding other requests, including those of the same type that our giving us problems here. And I'm stumped -- this doesn't match anything in my experience. I don't have access to the client network, so I can't currently see what the traffic looks like from that end; we haven't been able to identify a trigger that would allow us to reproduce the problem elsewhere, and my google searches don't seem to find a match for these symptoms. asked 16 Jun '11, 13:49 DanilSuits |
2 Answers:
This seems to be the most interesting. 1530 bytes missing? That's an odd MTU size... answered 16 Jun '11, 14:37 Jaap ♦ |
You might want to use a higher snaplength so that will at least be able to identify the SSL handshake messages that will give you more insight in the datastream. Also, a packet capture file (or at least some more output lines) of the complete TCP session will help to help you... answered 19 Jun '11, 14:47 SYN-bit ♦♦ |
Plus or minus 20, for today's examples. I also need to see if that estimate jibes with the successful saves we saw prior to the incident/after the restart.