I have been trying to diagnose a SQL timeout issue, and started a capture on one web server tonight. 10 minutes after starting we started receiving monitoring pages and customer calls that we were offline. This issue was affecting all subnets; crossing VLANs, Cisco ASA firewalls between subnets, and Cisco ACE load balancers fronting a dozen web services. Various applications were unable to connect to the backend web services (through the load balancers and firewalls, with services both in the same subnet/VLAN and crossing through routers/firewalls/load balancers). As soon as we stopped capturing all issues were resolved and not recurred. Our application logs contain many connection errors across all services (all backend services were sporadically affected); there were no errors on any switch, firewall, or load balancer; no unusual packet errors at the switches; and no server had any OS level errors (all servers are Windows 2003/2008). I am at a loss to explain what happened; my team and our hosting provider can not identify anything after several hours of investigation. We are hosted in a highly available datacenter with all Cisco equipment installed according to industry standards; I have no reason to believe that equipment is causing our issues (and not being logged). There are no spikes in inbound or outbound traffic to explain anything nefarious or unusual traffic. The timing is too coincidental to immediately dismiss the fact that the only time the issue occured a Wireshark capture was occurring. There is a similar post at http://ask.wireshark.org/questions/8892/wireshark-causing-network-problems that identifies a similar problem. Has anyone else seen a similar problem ever? We are baffled and looking for any help that may be there. (I can't do any of the suggested test in the linked post until our next maintenance window in case the capture was the cause of the problem.) asked 05 Apr '12, 22:44 bdstark |
One Answer:
It is kinda hard to tell what happened without having a look at the captured data - and maybe even that won't help since it could be the capturing machine that caused the effect. Usually when someone says that the whole network got into trouble at the same time the first thing I ask about is the spanning tree - you could look for at topology change that could explain the drastic effect in the network (the switches should be able to tell you that). The other, more important thing I keep telling the students in my network analysis classes is this: never capture on a system that is part of the problem you want to solve. Always use a third, completely passive capture device - an additional PC or appliance, with the capture card being read-only. It is something that professinal capture cards do anyway, but which you can simulate on a normal PC NIC by removing all protocol bindings from it. That way your capture device can't interfere with anything the network does. It just listens, just like a doctor using her/his stethoscope. answered 06 Apr '12, 04:34 Jasper ♦♦ edited 06 Apr '12, 04:34 |