I am helping a volunteer community run a local library. We are working with the local authority and have to use their book loan/reservation system. Currently this is driving the volunteers crazy with unexplained and undiagnosed IT problems. The problem is not unique to our site - other volunteer libraries also experience it with the same frequency (up to 10 or more times per day). We are using a web based app and while part way through a transaction we see the app freeze - this is usually apparent when the bar code reader beeps as the bar code is read but no response message is displayed from the central system (this should happen without the need for any key being pressed). If a volunteer presses the enter key when they suspect a freeze they get a message saying "xxx.gov.uk is not responding". If they take any action the record in question is locked for the rest of the day - alternatively they can do nothing and wait for the above message to go away of its own accord - if they do this (and it takes around 2 minutes for this to happen) then no record is locked and they can continue. I have used Wireshark to trace the client side of the transactions and the page spanning the problem is shown here.
The volunteers say the problem started at 12:41:26 but this is definately later than when it started because they had to realise they had a problem and then get the time stamp, so the problem will have started several seconds, or possibly longer, before. The system recovered by itself by 14:43:27 - this time is probably 2 seconds or so after the "not responding" message had gone away, again because of the time it would have taken the volunteers to respond.
I'm really out of my depth in terms of understanding the exact meaning of these trace records hence my appeal for anyone with better knowledge to help. I can supply the detailed trace file if needed.
Many thanks for any help.
It would indeed really help if you can share the tracefile, please look at Jaspers suggestion for that.
Looking at your description, there are a few things that pop up in my mind:
All in all, it feels like the application does not cope well with unexpected things in the network. Looking at the traces will tell more :-)
answered 21 Jun '12, 08:42
looking at your trace files and accessing the https server, I wonder what I get:
Who is fred and why was he there???
Anyway, can you please connect to the site through Fiddler (http://www.fiddler2.com) and post the *.saz file somewhere? HINT: Please activate the SSL "proxy" in Fiddler (see options).
How about this?
When we run this command:
we can see alternating IP IDs for connections from the same IP. IP IDs are increased in steps of one: 0x1063, 0x1064, 0x1065, .... Then there is a new set of IP IDs: 0x1f26, 0x1f27, 0x1f28, then they jump back to: 0x1069, 0x106a, 0x106b, then this sequence: 0x5cde, 0x2f90, 0x2f91, 0x5cdf.
This looks like a Load balancer for me. There are different TCP connections to at least 3-4 backend servers. Probably they did NOT configure "sticky sessions" on the load balancer (or they don't work properly).
Again, I can only guess, but maybe your application needs the client to connect to the same server for a defined amount of time (usually the duration of an application session), which is probably not the case (according to your capture file). Please ask the server admins, if there is a load balancer in place and if "sticky sessions" are enabled for that virtual server.