Hi I'm new at wireshark and I have this problem on my network. I have a server that is losing the connection to the Omnicenter server. Everyday around at 4:07 am the user are complaining that they can't access the application. and Omnicenter states that can't ping the server. but in reality the server is up and running. I'm able to ping it, get in , work in there without losing any connection. users are complaining that they lost connectivity to the application for about 5 min. here is the capture file https://1drv.ms/u/s!AiNIPG8Ke4TTnXSdHGDLRG9iycLU where is the problem? and how to solve it? asked 08 Nov '16, 13:21 helderguzman showing 5 of 11 show 6 more comments |
I'm afraid I don't understand the picture. What is the GSFNDCMWMSDSI14? Why is the MSSQL_SVR cluster and the AS400 relevant for the issue between the users and the Omnicenter server? Why two lines are crossed?
If the GSFNDCMWMSDSI14 is a central routing and firewalling device and all plain lines in the diagram are (V)LAN segments, you should capture at the critical time simultaneously on all interfaces of all boxes which are relevant for complete processing of a user query. I.e. if the Omnicenter server needs to talk to the MSSQL_SVR and/or to the AS400 in order to be able to process and respond a request from a user, the interfaces of these boxes should be captured as well. That way, by comparing the captures, you should find which of the intermediate boxes doesn't let the packets through (if any) or which box does not respond to an incoming request in time due to some internal issue (e.g. a scheduled backup of a database preventing it from processing queries).
The application server is GSFNDCMWMSDSI14. Omnicenter is the monitor tool.
Users -> application server -> sql server and AS/400
omnicenter pings the application server to make sure the server i online.
* the user complain that they lost connectivity with the application server around 4:05 am and it last five minutes.
Omnicenter get an alert about the application not able to ping.
where is happening around 4 am and 4:30 am. Why users and omnicenter lose connectivity .
Server never went down.
- pingable from my end other can ping the server as well - i can access the server
OK, what is the network infrastructure representing the two crossed logical links? Switches, routers... common for users and the supervisor server which are not used on the way to the sql server and AS/400?
Capturing at the GSFNDCMWMSDSI14 and user machine simultaneously should show immediately whether the issue is in network communication between the two or something else. Assuming that the supervisor server doesn't send application requests similar to those sent by users and still is affected by the issue, I vote for a network problem.
Another hint is that if you apply a display filter
tcp.flags.syn == 1 and tcp.flags.ack == 0
on your capture, you can see that 10.162.200.59 (the GSFNDCMWMSDSI14) repeatedly attempts to establish a TCP session towards 10.129.5.166, and most of these attempts (if not all of them) fail. So maybe it is a data fetch issue after all?..and when you apply a filter
ip.addr == 10.129.5.166 and ip.addr == 10.162.200.59 and tcp.flags.syn == 1 and tcp.flags.ack == 1
, you'll see that actually none of these attempts has succeeded. I don't know whether you were capturing only in the middle of the downtime or whether the capture contains a time interval before or after the outage; if the capture is taken also outside the downtime, this issue is not the reason.The capture contains around 30 min. and the issue happened in between.
Well the ping inside the trace does not show Any problem. At which port does your service listen?
This an additional info WMI error on GSFNDCMWMSDSI14: Can't use string ("Timed out while retrieving data.") as a HASH ref while "strict refs" in use at /usr/local/share/perl5/Netreo/Poller/WMI.pm line 1019.
WMI error on GSFNDCMWMSDSI14: Can't use string ("Timed out while retrieving data.") as a HASH ref while "strict refs" in use at /usr/local/share/perl5/Netreo/Poller/WMI.pm line 1019.
That is the error i'm getting.
Does this error message bear a timestamp? If so, we could correlate it with some TCP RST in a capture taken simultaneously.