We are having a strange issue with one of our clients. We have an SSL app that accepts an XML post, processes data and sends the response back to the client. We recently moved data centers and now are having an intermittent issue with one customer (our largest of course!). While most of the transactions complete in less than 6 seconds, a small percentage is taking 20 seconds. The analysis of those transactions shows some very odd communication. Basically everything starts out fine. We receive the request, process it and start streaming data back. About a second in to the response streaming we start to have duplicate ACKs, fast retransmissions and retranmissions. This continues to get worse until it becomes an ACK storm. For 16 seconds the client sends two duplicate ACKs (for the same original ACK) and our server responds with 2 ACKS. This continues until our server retrasmits a 78 byte data packet and communication normalizes. The pattern, including the retransmission of a small packet is fairly consitent. I have at least one capture that shows over 3400 dup ACKs to the same original! Some duplicate ACKs are not unexpected. We have a DMZ firewall that is connected via multi-gigabit etherchannel. With this particular firewall that results in out-of-order packets. We have taken captures at the host, at the Internet router and in between. We do not see packet loss occurring in our network. We have pinned the traffic to each of our two ISPs without any change in performance. We now have opened a ticket with our ISP. We have requested a capture from the customer but have not yet received that. Many references to ACK Storms suggest a man-in-the-middle attack. Without a capture from the client I cannot validate or confirm whether this is occurring. In some of the ugly captures there are a few ACKs from the customer that have the PSH bit set (while this is not set on the bulk of the ACKs) and have a different TTL than the other responses from this customer. Obviously I really want to see things from the customer perspective. Does anyone have an additional suggestions? Thank You! asked 08 Jul '11, 09:28 ericinsd |
One Answer:
First of all - Good luck! Are you using a load balancer or any kind of layer4 firewall? We've had problems in the past with certain devices (points finger at DataPower boxes) having very strange issues when it comes to XML scrubbing when using SSL with digital certificates. Our eventual bandaid was to reboot the boxes every friday evening - this would stop all issues for about a week. A firmware update eventually provided a permanent fix. I think you're on the right path by looking at the path. How far off are the TTLs? Don't worry about the PSH bits - those may or may not be a symptom of the issue. When considering the total amount of data transmitted from you to them - does a "bad" capture resemble the "good" capture? Have you performed a double-sided capture - one on your side and one on the client side? How do those captures compare? Is the end client actually sending all of those ACKs? Is there a VPN in play somewhere - could this be a simple MTU issue? answered 12 Jul '11, 05:33 GeonJay |