Hello guys, I'm working on the issue with my Nagios server. Nagios monitoring was working fine, but for few days already I see these errors: "CHECK_NRPE: Error - Could not complete SSL handshake. " But theses error not consistent. So, first it gives this error, but after 5 minutes check became OK. So, I check my configuration, but as no changes was made in last time, find no issues as well. So I try to analyse Nagios traffic with Wireshark. Mostly it look ok for me, but I find strange thing - when Nagios try to establish SSL handshake it sends packet with protocol shows in Wireshark as "SSL". It receive no answer. Then after a minute it sends the same packet for SSL handshake, but with TLSv1 ptotocol. And then it works fine. http://piccy.info/view3/4921610/583ca764bdb98d778b0f605c3e0b3a22/orig/ So, question is - what is the difference between this SSL and TLSv1 protocols? As they look the same for me. asked 30 Jul '13, 08:51 Macumazan |
2 Answers:
The Client Hello is a TLS 1.0 handshake in both - tcp.stream eq 10 or tcp.stream eq 11 - connections. The difference in the Protocol interpretation (SSL vs. TLSv1) is due to the fact that in stream 11 the negotiation does not complete and wireshark sets SSL in this case. I extracted only the first 5 packets of tcp stream 10 and the Protocol field then changed to SSL also, when it was TLSv1 before with the full handshake. So the real question is, why does the "server" send a FIN in the middle of the SSL handshake. Looking at the RTT and TTL it is probably NOT the real server but maybe the riverbed appliance, but this is just a guess. answered 31 Jul '13, 00:05 mrEEde2 |
As @mrEEde2 points out, the SSL version of the client hello is actually the same, it is the interpretation of Wireshark based on the rest of the session that makes it show SSL or TLSv1. So that is not the issue. What I do see in your trace is that all traffic is sent to a TyanComp system with the mac address 00:e0:81:45:5c:a8 and that the return traffic either comes from a Cisco device with mac address 64:00:f1:c1:da:01 or from a Riverbed device with mac address 00:0e:b6:99:9e:e4. There is only one session that fails in the trace file. It is after a couple of sessions over the Cisco and before a couple of sessions over the Riverbed. As the Riverbed device is most likely a WAN optimizer, could it be that the tunnel to the remote location is flapping and that when Nagios polls while the tunnel is being rebuilt, the SSL session to the server 10.49.32.186 fails? What is the LAN setup at the nagios side of the connection? answered 31 Jul '13, 01:38 SYN-bit ♦♦ Thank you for the reply. Traffic from Nagios goes to the router. Router sends packets to the WAN provider router through the Riverbed hardware. The same setup on the other side of the WAN. I don't think that tunnel to other location is flapping, but is there a way to check this? I have access to routers before the Riverbed, but WAN provider routers is not accessible for me. (31 Jul '13, 02:01) Macumazan As the return packets from the Riverbed in stream 11 have a ip.ttl of 64, it looks like the Riverbed is directly connected to (in the same vlan/ip-subnet as) the Nagios server. Are you sure it is behind the router with the TyanComp mac-address (as seen from the Nagios server)? Can you identify all mac-addresses in the trace (TyanComp and Cisco) and tell me which device uses which mac-address? What does a traceroute from Nagios to the server show and what does a traceroute from the server to Nagios show? Are the routers redundant? What kind of first hop redundancy protocol do you use (HSRP, VRRP, etc)? (31 Jul '13, 02:18) SYN-bit ♦♦ So, network is look like this: Nagios -> Host with Linux as router() -> HP switch -> Riverbed hardware -> Verizon router() -> ... WAN ... -> Verizon router -> Riverbed hardware -> HP switch -> Linux host as router -> Nagios client Cisco is the Verizon router() Traceroute from Nagios: Tracerouter from Nagios client: I think we don't use any redundancy protocols. (31 Jul '13, 03:54) Macumazan Is the Nagios host connected on the same HP switch? And is it on the same vlan as the Linux Router and the Riverbed? What is the subnetmask used on the Nagios host, the linux router and the Verizon router? I suspect a subnet mask of 255.255.240.0, putting all devices in the same IP subnet and therefor creating asymetric routing. I bet step 5 in the reverse nagios trace was actually a response from the Verizon router (public interface). Does the Verizon router point back to the Linux router (if it's subnet mask is smaller than 255.255.240.0)? Regarding the flapping of the Riverbed tunnel, do you have access to the Riverbed device? Or can you contact someone with access to it to check whether there is anything in the logging at the times Nagios reports the server as down? (31 Jul '13, 04:11) SYN-bit ♦♦ Yes, vlan the same for them. Nagios connected to the same switch. Subnet mask is 255.255.240.0, yes. 5 step is Verizon router. It point to the Linux router. Yes, I find that when Nagios show this error, Riverbed gives the error as well: [io/inner/prod.ERR] 128089795 {10.32.241.141:60513 10.49.32.192:5666} Err while reading: Connection reset by peer It's other host with the same error from today. (31 Jul '13, 04:30) Macumazan OK, that indeed explains the asymmetric routing seen in the tracefile, as the verizon host is in the same subnet as the Nagios host it will send the return traffic straight to the Nagios host instead of the Linux router. Although the reason for this design is not clear to me, I don't think it is the reason for the failing connections. It would be interesting to see a packet capture made on the Riverbed when this problem occurs. On both the inside interface (connected to the switch) and the outside interface (connected to the Verizon router). I still suspect the Riverbed tunnel as the response time for the FIN after the Client Hello is ~11 ms, while the RTT in the 3-way-handshake was ~45 ms. This means the Riverbed must have decided to send the FIN without waiting on the response to the ClientHello from the other side. Did you also make a packet capture on the remote side? It would be interesting to see what is seen on the network there (preferably also before and after the riverbed device). (31 Jul '13, 05:09) SYN-bit ♦♦ I think I found what was the cause -> Riverbed hardware. So after I check logs from Riverbed I find that it gives these type of errors sometime: [admission_control.NOTICE] - {- -} Connection limit achieved. Total Connections 611,Branched Warmed Connections 0 So looks like we have a limits of optimize connections - 611. I put Nagios server to passthrough the optimize tunnel - don't see any ssl issues for now. I will check this until tomorrow to see if issue is fixed. It was very nice of you to help me figure this out! It was like a lesson and vector to point me where I need to develop my network debugging skills :) (31 Jul '13, 08:16) Macumazan showing 5 of 7 show 2 more comments |
Thank you for a great explanation.