This is a static archive of our old Q&A Site. Please post any new questions and answers at ask.wireshark.org.

How to stop packet flood in the switch?

0

Hi there!

I'm with a trouble that goes beyond my network knowledge. I'm posting here looking for some help of network experts. We have some network hangs since 2 weeks ago. Nothing changed in the configuration. First I suspected the router to be dead/overcharged since it was the source of freezes and rebooting it was solving the problem. But I changed 2 times of router, from a Cisco Linksys E4200 with stock firmware to an Asus RT-AC87U with DD-WRT build 26635 (still in beta, so could not be stable) then a Netgear with stock firmware. All presented the same trouble. The router gets unresponsive to ping and naturally the network connection is dead. I then discovered that just unplugging/replugging the network cable was sufficient to come back to working state.

The router is plugged to a dlink DGS-1210-52 switch (named A) linked to another DGS-1210-24 switch (named B) by an optical fiber. All clients are connected to one of those switches, with also some wireless access points. I mirrored the switch port in which the router is connected to analyze the traffic. During the freezes I saw TCP Packet retransmissions like this:

0.000000 109.234.160.36 -> 172.20.30.12 IMAP Response: * OK Still here
0.000003 109.234.160.36 -> 172.20.30.12 IMAP [TCP Retransmission] Response: * OK Still here
0.000005 109.234.160.36 -> 172.20.30.12 IMAP [TCP Retransmission] Response: * OK Still here
0.000007 109.234.160.36 -> 172.20.30.12 IMAP [TCP Retransmission] Response: * OK Still here
0.000116 109.234.160.36 -> 172.20.30.12 IMAP [TCP Retransmission] Response: * OK Still here
0.000119 109.234.160.36 -> 172.20.30.12 IMAP [TCP Retransmission] Response: * OK Still here
0.000241 109.234.160.36 -> 172.20.30.12 IMAP [TCP Retransmission] Response: * OK Still here
0.000243 109.234.160.36 -> 172.20.30.12 IMAP [TCP Retransmission] Response: * OK Still here
0.000363 109.234.160.36 -> 172.20.30.12 IMAP [TCP Retransmission] Response: * OK Still here

And this, until I unplug/replug the cable or during many minutes. Analyzing this packet I got the information that 109.234.160.36 is our externalized mail server and 172.20.30.12 is the IP of a colleague. Her PC was sleeping (so unreachable) at the moment of the flood. I was thinking of a bad behaviour of our mail client (Windows Live Mail 2009). Then the network froze again:

0.000000 179.60.192.2 -> 172.20.30.60 TCP https > 46812 [FIN, ACK] Seq=0 Ack=0 Win=57 Len=0 TSV=1771622880 TSER=10028832
0.000002 179.60.192.2 -> 172.20.30.60 TCP [TCP Retransmission] https > 46812 [FIN, ACK] Seq=0 Ack=0 Win=57 Len=0 TSV=1771622880 TSER=10028832
0.000005 179.60.192.2 -> 172.20.30.60 TCP [TCP Retransmission] https > 46812 [FIN, ACK] Seq=0 Ack=0 Win=57 Len=0 TSV=1771622880 TSER=10028832
0.000007 179.60.192.2 -> 172.20.30.60 TCP [TCP Retransmission] https > 46812 [FIN, ACK] Seq=0 Ack=0 Win=57 Len=0 TSV=1771622880 TSER=10028832
0.000139 179.60.192.2 -> 172.20.30.60 TCP [TCP Retransmission] https > 46812 [FIN, ACK] Seq=0 Ack=0 Win=57 Len=0 TSV=1771622880 TSER=10028832
0.000141 179.60.192.2 -> 172.20.30.60 TCP [TCP Retransmission] https > 46812 [FIN, ACK] Seq=0 Ack=0 Win=57 Len=0 TSV=1771622880 TSER=10028832
0.000144 179.60.192.2 -> 172.20.30.60 TCP [TCP Retransmission] https > 46812 [FIN, ACK] Seq=0 Ack=0 Win=57 Len=0 TSV=1771622880 TSER=10028832
0.000146 179.60.192.2 -> 172.20.30.60 TCP [TCP Retransmission] https > 46812 [FIN, ACK] Seq=0 Ack=0 Win=57 Len=0 TSV=1771622880 TSER=10028832

This time, nothing to do with the mails, it's a wireless notebook (172.20.30.60) having troubles to receive data from Facebook (179.60.192.2). The notebook lost the connection to the AP at the moment of the flood. Again, unplugging/replugging the cable between the router and the switch stopped the retransmission packets to flood the network. Unplugging/replugging the cable between the router and the the WAN modem did no effect (packets stopped while unplugged but restarted when plugged again).

Then I got another colleague having TCP retransmissions from the mail server (I have no info about if the computer was sleeping at the moment):

0.000000 109.234.160.36 -> 172.20.30.26 IMAP Response: * OK Still here
0.000003 109.234.160.36 -> 172.20.30.26 IMAP [TCP Retransmission] Response: * OK Still here
0.000005 109.234.160.36 -> 172.20.30.26 IMAP [TCP Retransmission] Response: * OK Still here
0.000007 109.234.160.36 -> 172.20.30.26 IMAP [TCP Retransmission] Response: * OK Still here
0.000112 109.234.160.36 -> 172.20.30.26 IMAP [TCP Retransmission] Response: * OK Still here
0.000114 109.234.160.36 -> 172.20.30.26 IMAP [TCP Retransmission] Response: * OK Still here
0.000228 109.234.160.36 -> 172.20.30.26 IMAP [TCP Retransmission] Response: * OK Still here

Then I had the same flood from a SIP connection (don't know if this is not a malicious try):

0.000000 62.75.145.240 -> 172.20.30.129 SIP Request: OPTIONS sip:[email protected]
0.000004 62.75.145.240 -> 172.20.30.129 SIP Request: OPTIONS sip:[email protected]
0.000116 62.75.145.240 -> 172.20.30.129 SIP Request: OPTIONS sip:[email protected]
0.000120 62.75.145.240 -> 172.20.30.129 SIP Request: OPTIONS sip:[email protected]
0.000124 62.75.145.240 -> 172.20.30.129 SIP Request: OPTIONS sip:[email protected]
0.000127 62.75.145.240 -> 172.20.30.129 SIP Request: OPTIONS sip:[email protected]
0.000238 62.75.145.240 -> 172.20.30.129 SIP Request: OPTIONS sip:[email protected]
0.000241 62.75.145.240 -> 172.20.30.129 SIP Request: OPTIONS sip:[email protected]
0.000363 62.75.145.240 -> 172.20.30.129 SIP Request: OPTIONS sip:[email protected]

x is our external IP and there is NAT configured to let 172.20.30.129 make/receive SIP calls. But 172.20.30.129 was off at this moment and we wasn't expecting a call. Then, very strange, I got this:

0.000000  172.20.30.7 -> 172.20.30.11 TCP rtip > 4580 [PSH, ACK] Seq=0 Ack=0 Win=8192 Len=79
0.000004  172.20.30.7 -> 172.20.30.11 TCP [TCP Retransmission] rtip > 4580 [PSH, ACK] Seq=0 Ack=0 Win=8192 Len=79
0.000007  172.20.30.7 -> 172.20.30.11 TCP [TCP Retransmission] rtip > 4580 [PSH, ACK] Seq=0 Ack=0 Win=8192 Len=79
0.000009  172.20.30.7 -> 172.20.30.11 TCP [TCP Retransmission] rtip > 4580 [PSH, ACK] Seq=0 Ack=0 Win=8192 Len=79
0.000012  172.20.30.7 -> 172.20.30.11 TCP [TCP Retransmission] rtip > 4580 [PSH, ACK] Seq=0 Ack=0 Win=8192 Len=79
0.000014  172.20.30.7 -> 172.20.30.11 TCP [TCP Retransmission] rtip > 4580 [PSH, ACK] Seq=0 Ack=0 Win=8192 Len=79
0.000017  172.20.30.7 -> 172.20.30.11 TCP [TCP Retransmission] rtip > 4580 [PSH, ACK] Seq=0 Ack=0 Win=8192 Len=79
0.000020  172.20.30.7 -> 172.20.30.11 TCP [TCP Retransmission] rtip > 4580 [PSH, ACK] Seq=0 Ack=0 Win=8192 Len=79

172.20.30.7 is a IP temperature monitor and 172.20.30.11 is the PC collecting its data. For the first time the problem is LAN to LAN. And, interesting fact, both equipments are plugged to switch B, so if I'm correct, I should not see those packets at the router port at the switch A. This time I had to completely reboot both switches to stop the flood. And this morning I got another one:

0.000000 172.20.30.11 -> 172.20.30.18 SNMP GET 1.3.6.1.2.1.25.3.2.1.5.1 1.3.6.1.2.1.25.3.5.1.1.1 1.3.6.1.2.1.25.3.5.1.2.1
0.000004 172.20.30.11 -> 172.20.30.18 SNMP GET 1.3.6.1.2.1.25.3.2.1.5.1 1.3.6.1.2.1.25.3.5.1.1.1 1.3.6.1.2.1.25.3.5.1.2.1
0.000006 172.20.30.11 -> 172.20.30.18 SNMP GET 1.3.6.1.2.1.25.3.2.1.5.1 1.3.6.1.2.1.25.3.5.1.1.1 1.3.6.1.2.1.25.3.5.1.2.1
0.000009 172.20.30.11 -> 172.20.30.18 SNMP GET 1.3.6.1.2.1.25.3.2.1.5.1 1.3.6.1.2.1.25.3.5.1.1.1 1.3.6.1.2.1.25.3.5.1.2.1
0.000011 172.20.30.11 -> 172.20.30.18 SNMP GET 1.3.6.1.2.1.25.3.2.1.5.1 1.3.6.1.2.1.25.3.5.1.1.1 1.3.6.1.2.1.25.3.5.1.2.1
0.000013 172.20.30.11 -> 172.20.30.18 SNMP GET 1.3.6.1.2.1.25.3.2.1.5.1 1.3.6.1.2.1.25.3.5.1.1.1 1.3.6.1.2.1.25.3.5.1.2.1
0.000016 172.20.30.11 -> 172.20.30.18 SNMP GET 1.3.6.1.2.1.25.3.2.1.5.1 1.3.6.1.2.1.25.3.5.1.1.1 1.3.6.1.2.1.25.3.5.1.2.1
0.000018 172.20.30.11 -> 172.20.30.18 SNMP GET 1.3.6.1.2.1.25.3.2.1.5.1 1.3.6.1.2.1.25.3.5.1.1.1 1.3.6.1.2.1.25.3.5.1.2.1

Another LAN to LAN, 172.20.30.11 is a PC and 172.20.30.18 is a network printer (Canon 1022). And this time it's not a TCP retransmission but a SNMP request. I tried to play with all the switch options: Storm control, Spanning Tree (STP), Power saving setup, IEEE802.3az, Flow control, DLink safeguard engine. Nothing worked. Googled a lot, found also that could be damaged cables, replaced the cable to the router. Well, I'm kinda desperate and rely on your help to stop having to reboot the network 3x a day and being able to focus in other tasks!

Thanks very much if you read until here and even more if you have a clue of what's going on in my network. I can provide more details as needed. Help really appreciated!

asked 08 Jul '15, 09:11

vitoriodelage's gravatar image

vitoriodelage
11115
accept rate: 0%

I have the following remarks.

Is the the IP Id identical at every packet of a case and is it different of zero?

Could you provide a quick drawing where we can see all of your switches and the gateway of this subnet and the cables between this devices?

Have you checked interface counters at the switches on every uplink?

(08 Jul '15, 09:38) Christian_R

First of all, thank you for your interest in my case :)

1) What do you mean by "IP id"? How can I see it? Sorry, I'm newbie to network analysis and just doing my first steps with ethereal (Wireshark doesn't work on the Powerbook G4 that I got to do the job);

2) The schema is simple WAN <- Ethernet -> (WAN IP) Router (172.20.30.10/16 Gbit full duplex) <- Ethernet -> (172.20.30.205/16) Switch A <- Ethernet -> Gbit media converter <- optical fiber -> Gbit media converter <- Ethernet -> (172.20.30.203/16) Switch B

Both switches have many devices connected on them (computers, printers, wireless access points). The setup is not new and the setup did not change at the moment the network started hanging (about 15 days ago). Have a mix of Gbit, 100 and 10Mbit devices.

3) Switch counters are OK and show no transmission errors or collisions. Apart from the ethereal snif, nothing abnornal is revealed by the switches or router GUI interfaces. That's why I'm becoming crazy! :)

(08 Jul '15, 13:09) vitoriodelage

2 Answers:

1

I think of course you got a loop, too.

The interesting thing seems to be, that the issue only occurs if the destination mac address has been deleted from the cam table ( internal switch table to decide the switching path). It obviously does not occur with normal broadcasts or multicasts. Strange, what is the difference in your config? (enabled broadcast control mechanisms)

So maybe the problem could be reproduced if you add a wrong static arp entry for a ip of your subnet on a test device and then try to ping it with just one ping.

But the question will still be the same: where does the loop occur in your straight l2 design?

Maybe it is a device which connected is two more than one switch.

Maybe your WLAN is spanning a loop.

Maybe it is a switch issue.

Maybe you have changed something (firmware, unimport stuff) several weeks ago.

But indeed if you can't answer these things fast. Then the most safety and fastes way to find the issue is the solution which Kurt told you.

answered 08 Jul '15, 23:16

Christian_R's gravatar image

Christian_R
1.8k2625
accept rate: 16%

"So maybe the problem could be reproduced if you add a wrong static arp entry for a ip of your subnet on a test device and then try to ping it with just one ping." -> I would like to try this, but it's maybe too late for it (since I rebooted the faulty modem). Where to enter the wrong address? At a computer or switch? On the switch I can only find MAC to port association, no MAC to IP. I choose the IP of an existing device or a free IP in the subnet? Sorry about the questions, I'm newbie to network at such a technical level.

(09 Jul '15, 03:11) vitoriodelage

I meant you could create a static arp entry for an unused mac address and better an unused IPAddress which belongs to your subnet. You could create this entry on random host in this subnet.

For example the Syntax for the arp entry on Windows OS on Linux could be a little bit different.

arp -s 172.20.30.249 00-aa-00-62-c6-08

And then try to ping this address from that host for example with the following command on win: ping -n 1 172.20.30.249

Beware! If it will cause a looping transport the issue could only be stopped by unplugging cables or switchoff one switch!!!

And look what happens.... Then tell me.

Could you explain how the router has done the Loop? Could it be that you have always a Loop but the broadcast storm are filtered ou by storm control?

Maybe you could show us the anonymised Switch config.

You should really look if one or more device/devices is/are connected to more than one switch port. Do you have such devices?

(09 Jul '15, 03:32) Christian_R

Thanks for the details, I will try this today and let you know.

I have a physical server with 4 ethernet cards. But each card is used by a different virtual server. It's not configured with both cards on the same OS, and not showing itself guilty during the loops.

(09 Jul '15, 23:57) vitoriodelage

Tested the false arp entry: The ping ended normally after 10 sec with a timeout as expected. No troubles in the network, no flood triggered.

About the switch config, I will see how I can extract it.

(10 Jul '15, 02:45) vitoriodelage

ok than we will see.

(10 Jul '15, 03:03) Christian_R

I have the impression that the flood is happening in a combination of many equipments, not only one. Yesterday morning, as in comment of Kurt Knochner answer, I found the old modem/router flooding the packets in the network (packets that have nothing to do with it, as their are LAN device to LAN device packets).

Yesterday evening on another flood (same case of SNMP request 172.20.30.11 to 172.20.30.18):

  1. unplugged the old modem/router -> flood continued -> replugged
  2. unplugged main router -> flood continued -> replugged
  3. unplugged switch B -> flood continued
  4. unplugged main router again -> flood stopped

It seems clear that it comes from a combination of many devices. I would like to know what is causing the devices to flood the network with those packets. I don't know very well how the network stack works, but there is a mechanism with arp tables or something that when gets overflowed result in things like this, no? I should check those tables to find an abnormal activity.

I'm still blind and I still don't know what is triggering all this mess! The idea of Christian_R of pinging an inexistent MAC address unfortunately didn't triggered the loop. Any idea on how to trigger the problem? This would help understanding how to solve it…

As always, kind and warm thanks to all people envolved, your help is really appreciated!

(10 Jul '15, 03:14) vitoriodelage

I think the same as Kurt does. We do not have a lot of info. So as you are telling you have a sporadic issue. So you should watch the events and collect the and compare the info and maybe you are able to isolate the root cause. To find the root cause of a sporadic problem is mostly hard work.

(10 Jul '15, 03:29) Christian_R

Yes, it is! I would like to be able to provide more data, that's why I'm here, to have a good methodology on how to analyze the problem. It's sporadic and the symptoms are not always exactly the same (It's not unplugging the same equipment that ends the flood).

I don't have the expertise to look deeply the transaction than sniffing the network. I guess it would be interesting to see the memory status of switches and routers at the moment of the bug.

(10 Jul '15, 04:52) vitoriodelage

@Christian_R: My apologies for the delay, I had other problems to solve and it was an extended holiday here in France. Please don't consider my silence as if I left over. And many thanks again for your help.

I managed to upload the configuration of both switches at http://acadmed.o2switch.net/tmp/dlink/ My apologies for the ergonomy, the screen capture was the best I could get to show you the config.

For the updates, it flooded again friday morning. Flood was coming from switch B. I realized that it firmware was not up to date and flashed it with the last version (2.01.001 -> 2.03.001). Friday evening it flooded again. As I started to take the screen captures of the config in switch B, I saw that the Loopback detection was disabled and activated it.

Since this action I got no more floods. Today (first working day after friday) everything was OK. Making all the screen captures I also changed another feature in switch A, the DoS detection was turned off. And tried to turn on the loopback detection in switch A failed. It complains that it's incompatible with STP. What is strange is that switch B (older but same brand/model) accepts it.

Well, by the moment the network is working again (After so much pain, I'm prudent to say it's over), so I thank you one more time for your tips. And if you see anything on the config that could be improved, please let me know.

Cheers!

(15 Jul '15, 08:55) vitoriodelage

ok that are a lot pics. I will have a deeper look at them in two hours or so. But why do I see at Port 47 on Switch A such a lot of failures? Is there a half duplex device connected? And what is connected at Port 13?

(15 Jul '15, 09:35) Christian_R

So I checked the screenshots. The first thing that I can say that both switches are different HW Revision and Firmware Rev.

Switch A: It can be seen that there is an active port mirroring configured between Port 8 and 45.

Port 13: Sometimes comes up with 10 MBit/s HDX and sometimes with 1GBit/s FDX

At Port 47 and 24 failure could be seen.

Switch B: Port 1and Port 24 I would check at least.

At both Switches I would suggest you really should check at the port setting tab the actual operational mode of the interface against your expected value.

I think a Port with 100 MBit/s HDX is not very common. And a Port with HDX is worthful to be investigated if HDX is really the correct mode.

(15 Jul '15, 14:19) Christian_R

Maybe this could help you. Disconnect the Port 46 at Switch A and tell/show us the stp status of switch A. There is a very small chance that this leads us to problem source. But keep your procceses in mind to do this.

(15 Jul '15, 22:21) Christian_R

Thank you for you time!

At switch A: The mirror at port 8 is for monitoring purpose. The port 45 is the main switch. I did the monitoring at the router cable because it was my first lead.

Port 13 is a copier, it switches between 10 and 1Gbit because of power saving mode. This used to be fixed disabling the power saving at the switch. I removed this setting when testing the config for the loop issue. Now that loops seems to be gone, I'll disable power saving for it again.

Port 24 is an old Synel SY-751 badging system. Although there are few errors reported, its working fine.

Port 47 is the backup modem that was provoking one of the loops. Is one of the first ADSL Modems, an Alcatel Speed Touch. It's used to have a lot of packet errors since long time, without major troubles. It gives connection.

Port 46 is the link between both switches. Why are you incriminating it? Did you mean port 47? I can't find anything wrong about it from port monitoring nor from the system log.

Switch B Port 24 is the link with switch A.

Port 1 is a standard PC from a colleague. Connectivity is OK. The 10/1Gbit dance comes from PC power saving.

Port 20 that have 100 HDX is the modem for temperature/humidity capture. This one was incriminated in some of the loops. I will try to fix it at 100 FDX or 10FDX.

Thank you for your fresh eye. Making the report here I'm getting some details I've overlooked alone.

(16 Jul '15, 02:43) vitoriodelage
 Port 46 is the link between both switches. Why are you incriminating it? Did you mean port 47? I can't find anything wrong about it from port monitoring nor from the system log. 

I meant Port 46, if you unplug it the connection betwen the switches should be lost. If not or you got a new Root port then you know why you have a loop. it is judt a test.

At port 47 normally there should not be failures. As I remind me it is configured as FDX.

(16 Jul '15, 03:20) Christian_R

Yes, if I disconnect port 46 the connection with switch B is lost. It's the only path to the other switch.

I can force the port 47 to 10Mbit FDX but it still makes some errors: RX InOctets 521160 InUcastPkts 1345 InNUcastPkts 32 InDiscards 0 InErrors 40 FCSErrors 37 FrameTooLongs 0 InternalMacReceiveErrors 0

TX OutOctets 368765 OutUcastPkts 1409 OutNUcastPkts 1314 OutErrors 0 LateCollisions 0 ExcessiveCollisions 0 InternalMacTransmitErrors 0

(16 Jul '15, 05:25) vitoriodelage
I can force the port 47 to 10Mbit FDX but it still makes some errors: RX InOctets 521160 InUcastPkts 1345 InNUcastPkts 32 InDiscards 0 InErrors 40 FCSErrors 37 FrameTooLongs 0 InternalMacReceiveErrors 0

I don´t know if this device could be forced to 10MBit/s FDX. If a dvice can talk only HDX, than it is so. On the other hand it happens sometimes that the autoneg mechanism fails and then one device sends FDX and the other HDX that is the worst case scenario for a link, because it is an undefined mode.


I have seen that Flow control is enabled at one Switch and at the other it is disabled. Check out what is correct for your environment.

I do normally start with disbaled Flow Control. But this need not be correct for your environment.


What is connected at Port Switch A Port 48? Could you post a screenshot of the dynamic forwarding table for all ports of Switch A?


I do not know why your switch loops? Maybe you have the whole time a loop but it prevented by the options you enabled at the switches like: Storm Control and Loopback detection.

Maybe you can perform a little dangerous test.

But again keep your procceses in mind before you do this. If it is forbidden then stop just now! Maybe the loop can provoked by disabling Storm control and disabling Loopback detection. If than a broadcast stormm appears it can only be stopped by discabling one device after another.

(16 Jul '15, 12:55) Christian_R

For the modem/router, I will plan to change it for a newer model and stop the RX errors.

About Flow control, I activated this option to see if it stabilize my problem. It did not and it's now disabled again (in both switches) because literature on it shows mitigate advices, especially if you want to do QoS.

The port 48 is connected to an internal server with some virtual machines on it (shared ethernet card). What did you see of suspect on this?

Here is the dynamic forwarding table of switch A:

>      ID    Port    MAC Address     VID     Type
>       1    42      00-08-5D-86-99-58   1   Dynamic     
>       2    46      00-0C-76-4C-37-3F   1   Dynamic     
>       3    48      00-10-18-6C-4A-24   1   Dynamic     
>       4    25      00-11-32-1B-FA-77   1   Dynamic     
>       5    46      00-11-43-B6-75-B8   1   Dynamic     
>       6    46      00-12-3F-55-D3-5B   1   Dynamic     
>       7    44      00-13-72-73-F9-B5   1   Dynamic     
>       8    13      00-1E-8F-30-10-1F   1   Dynamic     
>       9    40      00-1E-8F-4C-59-9A   1   Dynamic     
>       10   46      00-1F-3A-4C-3A-34   1   Dynamic     
>       11   24      00-20-4A-54-4C-F0   1   Dynamic     
>       12   47      00-90-D0-32-9F-F3   1   Dynamic     
>       13   48      08-00-27-08-3B-CE   1   Dynamic     
>       14   45      08-00-27-AB-8E-79   1   Dynamic     
>       15   48      08-00-27-BB-F0-59   1   Dynamic     
>       16   44      08-00-27-E8-7D-3E   1   Dynamic     
>       17   14      08-00-37-C1-F4-47   1   Dynamic     
>       18   32      08-62-66-8E-8F-92   1   Dynamic     
>       19   46      14-8F-C6-5A-4D-07   1   Dynamic     
>       20   46      14-D6-4D-06-C7-0C   1   Dynamic     
>       21   46      14-D6-4D-06-C7-24   1   Dynamic     
>       22   29      3C-07-54-38-80-53   1   Dynamic     
>       23   45      A4-BA-DB-3E-1C-4F   1   Dynamic     
>       24   15      A4-BA-DB-3E-1C-51   1   Dynamic     
>       25   46      AC-9E-17-96-67-02   1   Dynamic     
>       26   29      B4-18-D1-D7-F0-A1   1   Dynamic     
>       27   29      C0-C1-C0-95-1A-72   1   Dynamic     
>       28   1       C4-34-6B-53-4C-AD   1   Dynamic     
>       29   46      C8-1F-66-1D-0A-61   1   Dynamic     
>       30   11      D4-BE-D9-C7-0F-B6   1   Dynamic     
>       31   9       D4-BE-D9-C7-10-44   1   Dynamic     
>       32   45      EC-22-80-6A-A5-F4   1   Dynamic     
>       33   45      F0-79-59-D4-88-18   1   Dynamic

Looking deep at the devices attached I found a dead temperature/humidity modem (power LED is on, but network card's LED is off from modem side and from switch side). Maybe this device was short circuiting the network while dying (it's off, now). I don't have related problems anymore since Wednesday.

About your test, I had a flood while both Storm Control and Loopback detection were effective. And it's not only the fact of switching off those features that provokes a flood. I still don't know how to provoke the error. For all the test I did/related, I switched one option than wait until the next flood (aleatory but under 24h, normally). It got solved after I switch on the Loopback detection on switch B, Friday evening. But it relates everything OK:

>     Port  Loopdetect Detection State  Loop Status
>     1     Enabled     Normal
>     2     Enabled     Normal
>     3     Enabled     Normal
>     4     Enabled     Normal
>     5     Enabled     Normal
>     6     Enabled     Normal
>     7     Enabled     Normal
>     8     Enabled     Normal
>     9     Enabled     Normal
>     10    Enabled     Normal
>     11    Enabled     Normal
>     12    Enabled     Normal
>     13    Enabled     Normal
>     14    Enabled     Normal
>     15    Enabled     Normal
>     16    Enabled     Normal
>     17    Enabled     Normal
>     18    Enabled     Normal
>     19    Enabled     Normal
>     20    Enabled     Normal
>     21    Enabled     Normal
>     22    Enabled     Normal
>     23    Enabled     Normal
>     24    Enabled     Normal

Well, I think we will never be able to know what went wrong with the network. I have now lots of other tasks waiting for my attention and I don't want to take more of your precious time.

A sincerely big thank you about your help and tips. Even if we didn't get the root, your stranger eye and the exchange gave us a lot of details that I would have overlook alone. Thanks!

(17 Jul '15, 07:17) vitoriodelage

Indeed Port 48 was not well aimed, I thought saw something in the logfile But I have seen in your logfiles that once the root bridge has changed. Maybe because Switch B has rebooted, than it is ok.

But such events could be a sign for a looping problem.

You are welcome, if you need a helping hand again.

(17 Jul '15, 13:28) Christian_R
showing 5 of 18 show 13 more comments

1

It's allways the same frame with just a small time delta, so this is most certainly a loop somewhere in your switch infrastructure. Could be a direct loop (switch A -> switch B -> switch A), or a loop through any device connected to both switches.

If the problem occurs, disconnect one cable after the other until the problem stops. The last cable you unplugged is the one (somehow) involved in the loop.

And this, until I unplug/replug the cable or during many minutes.

O.K. looks like a good candidate to start with ;-)

Regards
Kurt

answered 08 Jul '15, 15:45

Kurt%20Knochner's gravatar image

Kurt Knochner ♦
24.8k1039237
accept rate: 15%

Thank you indeed for your insights! I feel less alone face to this problem. Well, today I think I made a big step forward in troubleshooting this. Got a packet loop as usual:

0.000000 172.20.30.11 -> 172.20.30.18 SNMP GET 1.3.6.1.2.1.25.3.2.1.5.1 1.3.6.1.2.1.25.3.5.1.1.1 1.3.6.1.2.1.25.3.5.1.2.1
0.000004 172.20.30.11 -> 172.20.30.18 SNMP GET 1.3.6.1.2.1.25.3.2.1.5.1 1.3.6.1.2.1.25.3.5.1.1.1 1.3.6.1.2.1.25.3.5.1.2.1
0.000006 172.20.30.11 -> 172.20.30.18 SNMP GET 1.3.6.1.2.1.25.3.2.1.5.1 1.3.6.1.2.1.25.3.5.1.1.1 1.3.6.1.2.1.25.3.5.1.2.1
0.000009 172.20.30.11 -> 172.20.30.18 SNMP GET 1.3.6.1.2.1.25.3.2.1.5.1 1.3.6.1.2.1.25.3.5.1.1.1 1.3.6.1.2.1.25.3.5.1.2.1
0.000012 172.20.30.11 -> 172.20.30.18 SNMP GET 1.3.6.1.2.1.25.3.2.1.5.1 1.3.6.1.2.1.25.3.5.1.1.1 1.3.6.1.2.1.25.3.5.1.2.1
0.000015 172.20.30.11 -> 172.20.30.18 SNMP GET 1.3.6.1.2.1.25.3.2.1.5.1 1.3.6.1.2.1.25.3.5.1.1.1 1.3.6.1.2.1.25.3.5.1.2.1
0.000017 172.20.30.11 -> 172.20.30.18 SNMP GET 1.3.6.1.2.1.25.3.2.1.5.1 1.3.6.1.2.1.25.3.5.1.1.1 1.3.6.1.2.1.25.3.5.1.2.1

The same of yesterday, computer 172.20.30.11 making an SNMP request to the printer 172.20.30.18. Following Kurt's idea, I unplugged cables from:

  1. 172.20.30.11 computer -> Flood continues
  2. 172.20.30.18 printer -> Flood continues, so from here I can guess it's not a problem on devices but a loop inside the network
  3. On switch A, unplugged the connection to switch B -> Flood continues monitored on switch A
  4. Plugged the monitoring in switch B -> Flood is also there on every port
  5. Unplugged the link to switch A -> Flood stops in switch B.
  6. Replugged the link -> Flood continues. From there I can deduce the problem comes from a device in switch A.
  7. Unplugged the main router 172.20.30.10 from switch A -> Flood continues.
  8. Got an old backup ADSL modem/router up since many years plugged to the switch A (172.20.30.198). So discreet that I forgot about it. Unplugged -> Flood stopped!

I'm ashamed that the cause is so trivial, I was too much concentrated on effectively used material and forgot about it. Reset the ADSL modem/router to free it's ARP table that was probably saturated since the time. Hope that's all.

During this investigation I studied about Storm Control and Spanning Tree Protocol. Those technologies didn't solve the problem, but I think they somehow modified it. I can't be sure, but from my analysis the problem moved from WAN to LAN circuit to LAN to LAN. Is it merely coincidence or activating those options in the switch changed something?

(09 Jul '15, 03:08) vitoriodelage

Spanning Tree and Storm Control are not perfect solutions and they don't work in all cases. I can only speculate what really happend on your switches, as there is not enough information available.

(10 Jul '15, 00:23) Kurt Knochner ♦