We have dual redundant connectivity from one POP1 to POP2. POP 1 end device is Huawei. POP 2 end device is Cisco. both side switches have RSTP on for all ports including the two uplinks ports which are the incoming and outgoing dual fiber path ports as said above. But still I have kept one path port shutdown in POP1, thinking it might be a problem or creating some kind of loop or something. Now the issue is, randomly during the working hours. The POP2 IP gets 50-60% packet loss, and customers connected from this POP2, they start calling about slow speed and traffic squuezed. Then I login to the POP1 Huawei switch and shutdown that port and enable the other path port and it immediately works fine. Next hour, again same happens and I disable this live port and re enable back the other port which I had swithed off an hour back for the same situation. Then this works. Similarly its been a ritual every day to keep on changing shutdown undo shutdown the ports. the customers connected to POp2 are kind of frustarted as its happening everyday 2-3 times. Is there a way to debug such situation in wireshark by connecting a PC to POp2 CISCO switch and running wireshark ? Let me know, what shall I exactly do for such debugging and finding whats actually going wrong ? asked 12 Jun '17, 05:35 soamz showing 5 of 8 show 3 more comments |
What link speed and traffic volume do we talk about at the inter-POP links? Could it be that there is simply too much traffic to fit, and by switching to the other link you kill enough traffic for a while, so the new link only gets clogged again when all those applications re-establish their sessions through it and when they temporarily saturate the link during some peak, they nail it by starting to re-transmit dropped packets?
If this is not the case and something goes wrong inside one of the switches over time, you won't see it by capturing just at the fibre between them (using a third switch which would mirror the traffic transiting between its two fibre ports to your PC).
To find out whether a switch is broken itself, you would have to
In this case, if the captures match, the suspected switch is fine and the amount of traffic is the issue. If they don't, the suspected switch is broken itself.
Of course your capturing machines must be able to deal with the amount of traffic without losses, and you're likely to need two mirroring ports per capturing point, so four in total, if the summary traffic in both directions exceeds (or even if it is just close to) the link capacity of the mirroring port.
Its just less than 700Mbps movement, I would say. And its Cisco 2960G switches, so definitely not something with traffic choking.
I guess something is happening which when STP calculation or Topoligy change happens for any vlan, then whole POP all vlan gets slow traffic.
I've modified your Answer into a Comment as it wasn't answering your Question.
Now 700 Mbps typical rate can easily reach 1 Gbit in peak (also mind all those leading and trailing bytes and gaps), but as I don't know how you calculate it let's leave it aside, the capture will eventually show what the short-term rates are.
If the two links between POP1 and POP2 are the only path between them, i.e. there is no L2 path between them through another site which could close a loop, you may disable STP on the open ports while the ports connected to the second link are shut down manually. This should tell you quickly whether the STP is guilty or not.
The peak is 600-700Mbps at max. Memory, CPU everything is fine.
What about the second part, possibility to switch off STP temporarily?
If that's not possible, on Cisco you can debug both BPDU sending and reception and as they are coming once in 15 seconds typically, it should not be a huge amount of data to store and look through. Although I can see no reason why they should not be mirrored if you mirror a physical port, better check that. I have seen them mirrored on an HP switch, haven't tried on C2960 and I don't have one next to me to check.
yes my last option is port mirror and wireshark read. Need to find a wirshark expert to handle it for me remotely.
Well, then take a powerful PC with two gigabit ports, mirror each direction (ingress/egress) of the active port looking towards the other site to its own mirror port, and capture using dumpcap or tcpdump, not Wireshark or tshark, at both ports simultaneously until the event pops up and a couple of seconds after. If one PC cannot deal with the traffic, use two, the captures can be merged afterwards. The merging is much easier if the machines are synchronised using NTP so there is no time offset between the captures.
Note the time of the occurrence of the issue as precisely as possible. Then you may filter only the BPDUs from the traffic and see whether there was some change in BPDU "payload" (indicating the reconfiguration) at all and if yes, whether the change took place around the time of the event.
You could set a capture filter for the BPDUs to significantly reduce the size of the capture, but in such case you'd have to capture again if STP is not guilty.
Ok I enabled cisco debug bpdu and found that, it was getting TCN due to some clients connected to one unmanaged switch.
I have fixed that now and it seems stable since 18 hours