We had an event in our network that generated an ARP storm. There were two contributors to this: 1) Gratuitous ARP Requests from VMware: at 09:35:02, we started seeing Gratuitous ARP's with Sending Eth MAC of particular VMGuest, Destination Eth MAC = broadcast (ff's) In ARP:
This gradually increased to a very high level from multiple VM Guests across every VLAN that the ESX host has on its trunk. Trying to understand:
2) Once this hit one of our VLAN's, I had GARP Replies from 5 different hosts with his own MAC in Sender MAC but all zeros in the other Sender/Target fields. This huge storm led to a collapse of our older core switches because the ARP storm reached >2GBps. Question on this traffic is: - Why would a host reply to the GARP in this way? Packet dissection below for request and reply. Thanks in advance for help. -Tim
asked 07 May ‘16, 09:39 CMH_Tim edited 07 May ‘16, 19:53 |
One Answer:
I don't think that the ARP where the root cause. https://crnetpackets.com/2015/08/28/special-type-of-arp-packets/ answered 07 May '16, 14:10 Christian_R Thanks Christian for the response and the link. My response below: 1) I believe a loop is a possibility, but we've not found proof of it. I'm scrubbing packets to see if there are any spanning tree changes noted. Nothing in our logs so far indicates any network change that would have suddenly caused us to have a loop and triggered this. Regardless of that outcome, I do still believe that VMWare's packets are not properly formatted. 2) Maybe I missed it, but that link doesn't show any packet with an all-zeros sender MAC or all-zeros on sender and target IP address. My understanding is that the sender should always put there own MAC at the very least. Since posting, I've also confirmed that these packets are NOT actually originating from the guest but are generated by the ESX host (had a local packet capture agent running on some guests during the start of this event that didn't show the GARP packets when my network taps did show them). I've done some more research and it appears that this behavior may be related to VMWare's "Notify Switch" setting where it sends a RARP packet when a VMGuest joins the network. This RARP has both sender and target MAC set to the Guest MAC. Here's a reference: http://rickardnobel.se/vswitch-notify-switches-setting/ Since original post, I've confirmed the RARP's are there and look correct (also not seen on guest VM capture), but the GARP's are not mentioned in that reference and look wrong to me. So I'm working with our design engineers to figure out how a loop could have occurred, but still need answers to my original questions about why the GARP requests and subsequent GARP replies were generated. Thanks, Tim (07 May '16, 19:51) CMH_Tim 1
While it is hard to say why the original GARP requests have been generated (most likely candidates are mere bug and some malware), I'm afraid that many (if not all) implementations would respond to an ARP request containing
regardless whether such request has been sent as gratuitous or not. The reason is that 0.0.0.0 normally means "any of my local IPs" in local context (see section 3.2.1.3 of RFC 1122, especially the point "must not be sent except as source under specific circumstances"). So you don't even need a loop, it is enough that all machines on the LAN segment respond to a single broadcast ARP request to have a kind of LAN Smurf Attack. (08 May '16, 01:49) sindy Well if I were you, I would go with my findings into the lab and try to get a better understanding of this ARPs and their interaction with the network devices. There are so much ARP implementation out there and everyone is a little bit different. Let´s assume this is the root cause of your problem! How often can you see these Requests? And does it always tear down the network? And how many hosts will answer? And how long (time) can you see this high BC rate? Also a loop could occur without some log entrys. Your link was filled up with >2GBit/s Broadcast -> this is normally done by a loop. How have you stoped this incident? But I agree with you and @sindy this ARP packets looks different to the normal GARPs. (08 May '16, 12:12) Christian_R @sindy - Thanks about the replies. That makes sense, although not every device replies like that, of course so I think a lot of implementations are interpreting those as incorrect/not relevant and not responding. @Christian_R - We are working with the vendor but not had much luck getting them to figure out why they generate these packets. We do have a low volume still going on without noticeable impact, but I'm concerned that if we hit the VLAN where devices actually respond, we could see minor issues again. That being said, I'm fairly certain now that those are a symptom - no matter how strange - but not root cause. As noted below, I'm pretty sure that you were right that the root cause was a loop. I did go back into our logs from Friday's event and found a 3750 switch stack that logged MAC flapping shortly after the GARP storm started. The flaps were on its two uplinks to our core switches and contained the MACs of the core switches, so there was some type of loop at that time, just no logs of it before the GARP's. Switch/router engineers took a look but couldn't find any evidence of a loop so Saturday night, we proceeded to bring redundancy back by restarting and reconnecting all links to the 2nd core switch. When we brought the 2nd core switch back into the network late Saturday night and began reconnecting the redundant links between the core and edges, all went well until that 3750 stack was connected. Once that happened, we had another GARP storm and same impact as before. We have no changes made by anyone on Friday when this happened, but my guess is that something happened that started the loop but nothing got logged. On Saturday's reconnect, we're certain there was no storm before the MAC flaps started again. Thanks again for the help. Would love to hear from someone who could explain those GARP requests. (09 May '16, 06:57) CMH_Tim Well here could be found more info for the ARPs: Back to your loop: - Sometimes high CPU Load could cause a loop. - Sometimnes this load could be caused by ARP Storms, when you use managed switches. - Some Spanning Tree configurations might end in high CPU load, too. (09 May '16, 11:24) Christian_R |
Apologize for poor formatting…first time posting - wasn’t sure how to maintain clean view.
PLease advise if repost is needed.
Thanks, Jasper for the formatting assist! What button should I have hit?
Forget about the buttons as many of the necessary ones are missing. Press “edit” under the Question or Answer post (the page layout is different when editing comments); to the right from the text entry pane with the buttons above, there is another (read-only) one, called “Markdown Basics”, with a link to “learn more about Markdown”. And look for “code” there.
Or edit your own Question and see what formatting characters Jasper had to add so that the text would look that way.