I am helping a volunteer community run a local library. We are working with the local authority and have to use their book loan/reservation system. Currently this is driving the volunteers crazy with unexplained and undiagnosed IT problems. The problem is not unique to our site - other volunteer libraries also experience it with the same frequency (up to 10 or more times per day). We are using a web based app and while part way through a transaction we see the app freeze - this is usually apparent when the bar code reader beeps as the bar code is read but no response message is displayed from the central system (this should happen without the need for any key being pressed). If a volunteer presses the enter key when they suspect a freeze they get a message saying "xxx.gov.uk is not responding". If they take any action the record in question is locked for the rest of the day - alternatively they can do nothing and wait for the above message to go away of its own accord - if they do this (and it takes around 2 minutes for this to happen) then no record is locked and they can continue. I have used Wireshark to trace the client side of the transactions and the page spanning the problem is shown here. The volunteers say the problem started at 12:41:26 but this is definately later than when it started because they had to realise they had a problem and then get the time stamp, so the problem will have started several seconds, or possibly longer, before. The system recovered by itself by 14:43:27 - this time is probably 2 seconds or so after the "not responding" message had gone away, again because of the time it would have taken the volunteers to respond. I'm really out of my depth in terms of understanding the exact meaning of these trace records hence my appeal for anyone with better knowledge to help. I can supply the detailed trace file if needed. Many thanks for any help. asked 21 Jun '12, 08:19 Bernard46 edited 21 Jun '12, 08:24 |
3 Answers:
It would indeed really help if you can share the tracefile, please look at Jaspers suggestion for that. Looking at your description, there are a few things that pop up in my mind:
All in all, it feels like the application does not cope well with unexpected things in the network. Looking at the traces will tell more :-) answered 21 Jun '12, 08:42 SYN-bit ♦♦ |
looking at your trace files and accessing the https server, I wonder what I get:
Who is fred and why was he there??? Anyway, can you please connect to the site through Fiddler (http://www.fiddler2.com) and post the *.saz file somewhere? HINT: Please activate the SSL "proxy" in Fiddler (see options). Regards answered 21 Jun '12, 10:26 Kurt Knochner ♦ edited 21 Jun '12, 13:32 Hi Kurt I didn't want to mention the actual site name in the Q&A (hence the xxx in the original post)because I didn't want to draw too much attention to where the issue is but its not that important. However you will probably appreciate from my earlier comments about us being volunteers and that this is a local government website I'm a bit limited in what I can do. The PC we have to use for this app is actually owned by the local govt organisation and is locked down such that I cannot install anything on it or change anything. They have total control over it, and I mean total. I did manage to persuade them to install Wireshark but I'm not at all sure I would get away with Fiddler as well - in any case I have to get them to come out to the PC to do anything with it - they won't give me the password to install code. If you can make any inspired guesses about what might be going on I can try asking for permission to get further diagnostic info. One thing I have noticed is that the clock on this particular PC is about 7 minutes fast compared with real time - previously it was 60 minutes slow - I don't know (but I think its irrelevant here) whether this could have any bearing on the problem? regards, Bernard (21 Jun '12, 11:07) Bernard46
Sorry, I did not understand that. As you posted the whole capture file, I thought that information was public anyway. I "blanked" the site in my answer.
Well, not really ideal for troubleshooting ;-) Anyway, see my other answer/idea. (21 Jun '12, 13:23) Kurt Knochner ♦ No problem Kurt, if it had been really important I would have said when I loaded the trace, but thanks for editing you response anyway. I've passed on your comments to my contact, including the request for using Fiddler, so we'll have to wait and see what comes back later tomorrow - at least I hope they will respond tomorrow. Many thanks for your help, much appreciated. Bernard (21 Jun '12, 15:16) Bernard46 |
How about this? When we run this command:
we can see alternating IP IDs for connections from the same IP. IP IDs are increased in steps of one: 0x1063, 0x1064, 0x1065, .... Then there is a new set of IP IDs: 0x1f26, 0x1f27, 0x1f28, then they jump back to: 0x1069, 0x106a, 0x106b, then this sequence: 0x5cde, 0x2f90, 0x2f91, 0x5cdf. This looks like a Load balancer for me. There are different TCP connections to at least 3-4 backend servers. Probably they did NOT configure "sticky sessions" on the load balancer (or they don't work properly). Again, I can only guess, but maybe your application needs the client to connect to the same server for a defined amount of time (usually the duration of an application session), which is probably not the case (according to your capture file). Please ask the server admins, if there is a load balancer in place and if "sticky sessions" are enabled for that virtual server.
Regards answered 21 Jun '12, 13:18 Kurt Knochner ♦ edited 21 Jun '12, 13:58 Hi Kurt, The response from the server admin is that there is no load balancer in place since there is only a single server involved, so it looks like we draw a blank on that one. I had a longish discussion with the lady who is our interface point this morning and she told me the app was designed to run on a dedicated corporate network without any 'outside' internet connectivity. In this context she said her telco and server admin people think the issue is related to the relatively slow speed connection we have to their corporate network, and possibly other components like additional firewalls required to protect them from non-approved access attempts. We have an ADSL link giving us 8 mbits download and up to 1 mbit upload - there is no vast amount of data transfer involved in this app and in speed tests we get very near to the rated speeds and our line utilisation is very low. I'm a bit reluctant to accept this explanation since I'm not sure the relative slowness of our link compared with a corporate LAN (which will incorporate remote links over many miles) should give rise to these sort of problems. What do you think? regards, Bernard (22 Jun '12, 04:13) Bernard46 Bernard, O.K. if there is no load balancer in place, the "jumping" IP IDs, still indicate that something interferes with the traffic. Could be a firewall with Layer 7 security (Content Scanning), a WAF (Web App Firewall), or similar things. However, there are other "issues" in the capture file as well, which require the use of Fiddler. I found several TCP connections, where the client does not send any data for a few seconds. Take a look at this:
Without further information about the application (ActiveX or Java used), and without insight into the decrypted traffic, it's hard to say why the client waits that long before it sends data to the server. It's most certainly the HTTP GET/POST, as it's allway the first packet with 'Application Data' from the Client. I really suggest to use Fiddler! (28 Jun '12, 03:04) Kurt Knochner ♦ Kurt, Thanks for your feedback and the time you are spending on this - I'm sorry I can't reply in a more positive manner to some of your suggestions about follow-up actions but the issue is that I'm helping a volunteer organisation to run a library but the local government organisation they have to interface with provide us with a PC to access their system for managing the Library Stock. Unfortunately this system is giving us a load of problems but as you might expect we are finding it difficult to work with the organisation at a detail level because they quote government security regulations at us whenever I suggest anything. It is not as if we are dealing with state secrets (actually its library books!) but it is very difficult to get them to do specific things for us - I'm trying to identify patterns in the problem from the outside so that I can try to persuade them to look at particular areas for further investigation. In your earlier analysis you mentioned increasing ip ids - I asked the question you posed about load balancers and as I mentioned they replied there was only one server and no load balancer. However I have to admit I did not really understand what you meant by increasing ip ids - as far as I could see there was only one ip address involved at the server end - were you referring to the port number? There could certainly be firewalls and other 'security' mechanisms in place because as I have mentioned we are a 'non-trusted' external organisation, so we are not allowed to access the system through the same physical and logical path as the organisation's own libraries. It could well be that these additional security hurdles are causing the problem, but what I need is some evidence that points to this so that I can get them to investigate - hence my efforts and questions here. One thing that does bother me is that whenever we do get a problem we see a flurry of NTP records from the client - it seems to be asking 4 different time servers for the time and makes around 7 requests - each of these requests incurs a delay of 5 seconds or more. Now the clock on this client is 7 minutes ahead of UTC time - I'm wondering if this out-of-sync time or flurry of NTP requests are anything to do with our problem - incidentally each of these problem scenarios ends with an "encrypted alert" message. Does any of this help at all or just confuse the picture? Bernard (30 Jun '12, 15:12) Bernard46
That's a convenient way to say: It's not our problem ;-)
don't care about the IP IDs any longer. If there is no load balancer, the problem I suspected (sticky sessions) is not the reason. Regarding the NTP traffic. Look at your original capture file with a filter on the client IP address (Tue). You will see NTP traffic, but there is no relation between the NTP requests (only few) and the SSL connections. I don't think that the NTP requests are a problem, at least they are not a problem in the capture you posted. Here is another idea: Can't you force the PC to use a proxy (Fiddler on another PC) by changing the proxy settings of IE? If that does not work (because IE locked is down), you could add a local DNS server (LLMNR) that answers the clients WPAD requests. See WPAD on Wikipedia. If that does not work, you could still add a transparent proxy between the client and the LAN and do SSL decryption there (SQUID). However, that's a lot of effort for a volunteer project. Maybe it's easier to insist that the local government organisation provides a working solution. BTW: Did you try to use that PC on another network? (01 Jul '12, 03:08) Kurt Knochner ♦ |
"...and the page spanning the problem is shown here." - where exactly? :) Can you upload the trace to www.cloudshark.org so we can take a look? Only do that if the trace doesn't contain critical data, like plain text login details etc.
Sorry about that - I couldn't quite work out how to get the image uploaded - despite copying what someone else appeared to have done I still got it wrong. I have uploaded the trace to Cloudshark as you suggested and the name is https://www.cloudshark.org/captures/546436d1939d - I noticed the times are displayed as relative times (not absolute) - do you need me to give the record numbers for the times I mentioned? Thanks for your help.
I've upload a trace for a successful session from logon to logoff as https://www.cloudshark.org/captures/a0fa1c782945 This consisted of the following end user actions: 1) App start 2) Logon 3) Enter Loan section of app 4) Read borrower's card via bar code scanner and auto submit 5) Return to main menu 6) Select book return option 7) Read book bar code via scanner and auto submit 8) Logoff I agree with you that the app is probably not the best in the world - my objective is to try and provide the evidence for the Library Authority to go back to the vendor and get them to fix it so that it stops driving our volunteers mad. Right now we're in danger f losing volunteers because of the way this app is bahaving unfortunately. Anything you can suggest would be very welcome. Thanks for helping.