Gratuitous ARP with Multiple Zero Fields

Question

We had an event in our network that generated an ARP storm. There were two contributors to this: 1) Gratuitous ARP Requests from VMware: at 09:35:02, we started seeing Gratuitous ARP's with Sending Eth MAC of particular VMGuest, Destination Eth MAC = broadcast (ff's) In ARP:

Sender MAC all zeros,
Sender IP all zeros,
Target MAC All FF's,
Target IP all Zeros

This gradually increased to a very high level from multiple VM Guests across every VLAN that the ESX host has on its trunk. Trying to understand:

Has anyone seen this before?
Why are there All zeros in those ARP fields?
What could generate this traffic?

2) Once this hit one of our VLAN's, I had GARP Replies from 5 different hosts with his own MAC in Sender MAC but all zeros in the other Sender/Target fields. This huge storm led to a collapse of our older core switches because the ARP storm reached >2GBps. Question on this traffic is: - Why would a host reply to the GARP in this way?

Packet dissection below for request and reply.

Thanks in advance for help.

-Tim

GARP Request:
 No.     RelativeTime   Delta Display  Time                          Source                Src Port Destination           vlan_id    Length Info 
2715936 96.183749      0.001018       2016-05-06 09:36:06.341627    Vmware_ac:00:57                Broadcast             2081,127   68     Gratuitous ARP for 0.0.0.0 (Request)
Frame 2715936: 68 bytes on wire (544 bits), 68 bytes captured (544 bits)
    Encapsulation type: Ethernet (1)
    Arrival Time: May  6, 2016 09:36:06.341627000 Eastern Daylight Time
    [Time shift for this packet: 0.000000000 seconds]
    Epoch Time: 1462541766.341627000 seconds
    [Time delta from previous captured frame: 0.000000000 seconds]
    [Time delta from previous displayed frame: 0.001018000 seconds]
    [Time since reference or first frame: 96.183749000 seconds]
    Frame Number: 2715936
    Frame Length: 68 bytes (544 bits)
    Capture Length: 68 bytes (544 bits)
    [Frame is marked: False]
    [Frame is ignored: False]
    [Protocols in frame: eth:ethertype:vlan:ethertype:vlan:ethertype:arp]
    [Coloring Rule Name: ARP]
    [Coloring Rule String: arp]
Ethernet II, Src: Vmware_ac:00:57 (00:50:56:ac:00:57), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
    Destination: Broadcast (ff:ff:ff:ff:ff:ff)
        Address: Broadcast (ff:ff:ff:ff:ff:ff)
        .... ..1. .... .... .... .... = LG bit: Locally administered address (this is NOT the factory default)
        .... ...1 .... .... .... .... = IG bit: Group address (multicast/broadcast)
    Source: Vmware_ac:00:57 (00:50:56:ac:00:57)
        Address: Vmware_ac:00:57 (00:50:56:ac:00:57)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Type: 802.1Q Virtual LAN (0x8100)
802.1Q Virtual LAN, PRI: 1, CFI: 0, ID: 2081
    001. .... .... .... = Priority: Background (1)
    ...0 .... .... .... = CFI: Canonical (0)
    .... 1000 0010 0001 = ID: 2081
    Type: 802.1Q Virtual LAN (0x8100)
802.1Q Virtual LAN, PRI: 0, CFI: 0, ID: 127
    000. .... .... .... = Priority: Best Effort (default) (0)
    ...0 .... .... .... = CFI: Canonical (0)
    .... 0000 0111 1111 = ID: 127
    Type: ARP (0x0806)
    Padding: 00000000000000000000
    Trailer: 0000000000000000
Address Resolution Protocol (request/gratuitous ARP)
    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: request (1)
    [Is gratuitous: True]
    Sender MAC address: 00:00:00_00:00:00 (00:00:00:00:00:00)
    Sender IP address: 0.0.0.0
    Target MAC address: Broadcast (ff:ff:ff:ff:ff:ff)
    Target IP address: 0.0.0.0
GARP Reply:
No.     RelativeTime   Delta Display  Time                          Source                Src Port Destination           Dst Port vlan_id    Length
1571028 0.984442       0.000000       2016-05-06 09:37:23.970442    Netscout_02:ce:e2              00:00:00_00:00:00              2340,228   68     Gratuitous ARP for 0.0.0.0 (Reply)
Frame 1571028: 68 bytes on wire (544 bits), 68 bytes captured (544 bits)
Encapsulation type: Ethernet (1)
Arrival Time: May  6, 2016 09:37:23.970442000 Eastern Daylight Time
[Time shift for this packet: 0.000000000 seconds]
Epoch Time: 1462541843.970442000 seconds
[Time delta from previous captured frame: 0.000000000 seconds]
[Time delta from previous displayed frame: 0.000000000 seconds]
[Time since reference or first frame: 0.984442000 seconds]
Frame Number: 1571028
Frame Length: 68 bytes (544 bits)
Capture Length: 68 bytes (544 bits)
[Frame is marked: False]
[Frame is ignored: False]
[Protocols in frame: eth:ethertype:vlan:ethertype:vlan:ethertype:arp]
[Coloring Rule Name: ARP]
[Coloring Rule String: arp]
Ethernet II, Src: Netscout_02:ce:e2 (00:80:8c:02:ce:e2), Dst: 00:00:00_00:00:00 (00:00:00:00:00:00)
Destination: 00:00:00_00:00:00 (00:00:00:00:00:00)
Address: 00:00:00_00:00:00 (00:00:00:00:00:00)
…. ..0. …. …. …. …. = LG bit: Globally unique address (factory default)
…. …0 …. …. …. …. = IG bit: Individual address (unicast)
Source: Netscout_02:ce:e2 (00:80:8c:02:ce:e2)
Address: Netscout_02:ce:e2 (00:80:8c:02:ce:e2)
…. ..0. …. …. …. …. = LG bit: Globally unique address (factory default)
…. …0 …. …. …. …. = IG bit: Individual address (unicast)
Type: 802.1Q Virtual LAN (0x8100)
802.1Q Virtual LAN, PRI: 1, CFI: 0, ID: 2340
001. …. …. …. = Priority: Background (1)
…0 …. …. …. = CFI: Canonical (0)
…. 1001 0010 0100 = ID: 2340
Type: 802.1Q Virtual LAN (0x8100)
802.1Q Virtual LAN, PRI: 0, CFI: 0, ID: 228
000. …. …. …. = Priority: Best Effort (default) (0)
…0 …. …. …. = CFI: Canonical (0)
…. 0000 1110 0100 = ID: 228
Type: ARP (0x0806)
Padding: 00000000000000000000
Trailer: 0000000000000000
Address Resolution Protocol (reply/gratuitous ARP)
Hardware type: Ethernet (1)
Protocol type: IPv4 (0x0800)
Hardware size: 6
Protocol size: 4
Opcode: reply (2)
[Is gratuitous: True]
Sender MAC address: Netscout_02:ce:e2 (00:80:8c:02:ce:e2)
Sender IP address: 0.0.0.0
Target MAC address: 00:00:00_00:00:00 (00:00:00:00:00:00)
Target IP address: 0.0.0.0

Answer 1

1

I don't think that the ARP where the root cause.
I think the broadcast storm is more a follow up. I think you had got a looping network. But some info about that kind of ARP packets can be found here:

https://crnetpackets.com/2015/08/28/special-type-of-arp-packets/

answered 07 May '16, 14:10

Christian_R
1.8k●2●6●25
accept rate: 16%

Thanks Christian for the response and the link. My response below:

1) I believe a loop is a possibility, but we've not found proof of it. I'm scrubbing packets to see if there are any spanning tree changes noted. Nothing in our logs so far indicates any network change that would have suddenly caused us to have a loop and triggered this. Regardless of that outcome, I do still believe that VMWare's packets are not properly formatted.

2) Maybe I missed it, but that link doesn't show any packet with an all-zeros sender MAC or all-zeros on sender and target IP address. My understanding is that the sender should always put there own MAC at the very least.

Since posting, I've also confirmed that these packets are NOT actually originating from the guest but are generated by the ESX host (had a local packet capture agent running on some guests during the start of this event that didn't show the GARP packets when my network taps did show them).

I've done some more research and it appears that this behavior may be related to VMWare's "Notify Switch" setting where it sends a RARP packet when a VMGuest joins the network. This RARP has both sender and target MAC set to the Guest MAC. Here's a reference: http://rickardnobel.se/vswitch-notify-switches-setting/ Since original post, I've confirmed the RARP's are there and look correct (also not seen on guest VM capture), but the GARP's are not mentioned in that reference and look wrong to me.

So I'm working with our design engineers to figure out how a loop could have occurred, but still need answers to my original questions about why the GARP requests and subsequent GARP replies were generated.

Thanks, Tim

(07 May '16, 19:51) CMH_Tim

1

still need answers to my original questions about why the GARP requests and subsequent GARP replies were generated

While it is hard to say why the original GARP requests have been generated (most likely candidates are mere bug and some malware), I'm afraid that many (if not all) implementations would respond to an ARP request containing

Target MAC address: Broadcast (ff:ff:ff:ff:ff:ff)
Target IP address: 0.0.0.0

regardless whether such request has been sent as gratuitous or not. The reason is that 0.0.0.0 normally means "any of my local IPs" in local context (see section 3.2.1.3 of RFC 1122, especially the point "must not be sent except as source under specific circumstances").

So you don't even need a loop, it is enough that all machines on the LAN segment respond to a single broadcast ARP request to have a kind of LAN Smurf Attack.

(08 May '16, 01:49) sindy

Well if I were you, I would go with my findings into the lab and try to get a better understanding of this ARPs and their interaction with the network devices.

There are so much ARP implementation out there and everyone is a little bit different.

Let´s assume this is the root cause of your problem! How often can you see these Requests? And does it always tear down the network? And how many hosts will answer? And how long (time) can you see this high BC rate?

Also a loop could occur without some log entrys. Your link was filled up with >2GBit/s Broadcast -> this is normally done by a loop. How have you stoped this incident?

But I agree with you and @sindy this ARP packets looks different to the normal GARPs.

(08 May '16, 12:12) Christian_R

@sindy - Thanks about the replies. That makes sense, although not every device replies like that, of course so I think a lot of implementations are interpreting those as incorrect/not relevant and not responding.

@Christian_R - We are working with the vendor but not had much luck getting them to figure out why they generate these packets. We do have a low volume still going on without noticeable impact, but I'm concerned that if we hit the VLAN where devices actually respond, we could see minor issues again.

That being said, I'm fairly certain now that those are a symptom - no matter how strange - but not root cause. As noted below, I'm pretty sure that you were right that the root cause was a loop.

I did go back into our logs from Friday's event and found a 3750 switch stack that logged MAC flapping shortly after the GARP storm started. The flaps were on its two uplinks to our core switches and contained the MACs of the core switches, so there was some type of loop at that time, just no logs of it before the GARP's. Switch/router engineers took a look but couldn't find any evidence of a loop so Saturday night, we proceeded to bring redundancy back by restarting and reconnecting all links to the 2nd core switch.

When we brought the 2nd core switch back into the network late Saturday night and began reconnecting the redundant links between the core and edges, all went well until that 3750 stack was connected. Once that happened, we had another GARP storm and same impact as before.

We have no changes made by anyone on Friday when this happened, but my guess is that something happened that started the loop but nothing got logged. On Saturday's reconnect, we're certain there was no storm before the MAC flaps started again.

Thanks again for the help. Would love to hear from someone who could explain those GARP requests.

(09 May '16, 06:57) CMH_Tim

Well here could be found more info for the ARPs:
https://ask.wireshark.org/questions/5178/why-gratuitous-arps-for-0000
http://www.cisco.com/c/en/us/support/docs/ios-nx-os-software/8021x/116529-problemsolution-product-00.html
https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1028373
But I have not read them all. And I have no real world experience with those special type of ARPs.

Back to your loop: - Sometimes high CPU Load could cause a loop. - Sometimnes this load could be caused by ARP Storms, when you use managed switches. - Some Spanning Tree configurations might end in high CPU load, too.

(09 May '16, 11:24) Christian_R