We have a Gateway server. It's a WIndows 2003 O/S running a Java JBOSS application. It uses TCP/IP to communicate to remote medical devices that are wirelessly connected via TCP/IP.
These devices interchange basic XML messages using TCP/IP to our server which listens on a single port (51244).
At several of our client sites, we've noticed these messagess aren't being processed properly. We performed a packet capture using Wireshark and notice that there are frequent RST occuring that appear to be generated by the server. When this occurs, the client senses this, and re-opens another connection to attempt to send messages. This creates a problem in that since messages aren't being processed properly, the server has too many requests from clients. (up to several hundered xml message transmission requests per second).
Our 3rd party applicaiton provider says it's not his java applicaiton that is causing the RST. BUt it's either the server, or the VMWARE environment the server is running under. (our Gateway is sometimes installed in a VMWARE solution.
So my question is this: how can I determine if it's the application issuing the RST, or if somehow it's generic O/S or Java socket issues?
Supposedly, the JBOSS application uses basic java sockets. There's a thread pool that listens for incoming connection attempts on port 51244 and then handles the socket communications to those devices.
I am looking for some advice on how to determine if the app issued the RST, if a thread bombed abnormally and issued the RST, if the O/S issued the RST?
It's clear it's coming from the server. but i need some help in understanding if there are ways to track down WHAT COMPONENT on the server is causing the RST.
So I apologize that this isn't necessarily a specific wireshark question. but i'm wondering if others have advice on further root cause?
BTW...this is a 3rd party developed application. we have zero insight into the code.
asked 20 Dec '10, 09:39
A couple of things:
My suggestions are:-
Well, this is not exactly easy to investigate. In my experience there are a couple of reasons for RSTs being sent:
I may have forgotten one or the other additional reason for RSTs being sent, but this is just what I could think of at the moment... hope it helps.
answered 20 Dec '10, 17:03
Applications don't actually issue RSTs - that's the job of the TCP stack. Normally for instance if an app cleanly close()s a socket, it will cause a FIN to be sent. You can however force a quick shutdown when you close a socket (by setting the SO_LINGER timer to 0) which it seems will cause a RST. (I looked at http://tangentsoft.net/wskfaq/articles/debugging-tcp.html amongst other references). RSTs are also normally sent when a packet arrives on a unestablished socket - or at least one that the receiver has no resources to process. It might also sent by an intervening device like a firewall refusing a connection.
Assuming a firewall isn't the issue, your best bet is going to try and turn up the level of logging on your application/server and see if it provides information on issues of state or resource exhaustion.
Also you may want to invoke "perfmon" on your server and monitor some of the TCPv4 parameters around connections.
In Linux I would suggest using "strace" as well to monitor system calls. There do appear to be strace-like tools for Windows that might also be useful.
Microsoft also have some Winsock tracing tools that might be useful - http://msdn.microsoft.com/en-us/library/bb892103%28v=vs.85%29.aspx
Can you use editcap to keep just 128 bytes or so and post the trace? The thing you have to look out for is the advertised window size from the server. TCP RST basically means something horribly went wrong and the stack is just going to give up. However, IE can and does use RST to quickly tear down sessions. So IE browser clients sending RST to close out SSL sessions is not unusual (nor is it a problem).
So look at the window size coming from the server to see if you see something odd. For example, do you see some time passing (more than 200ms) and the window size STILL has not incremented? This basically means the server is not accepting the data from the stack fast enough. In fact, this will be in one of my Sharkfest sessions this year. Also, do you have the SYN and the SYN-ACK? What window sizes are being negotiated? Could someone have jacked up the windows scaling so much that the server is running out of memory? Finally, are the sequence/ack numbers within the expected range? It could be that something in the path (load balancers, wan accelerators etc) may be getting confused. Good luck.
answered 20 Dec '10, 18:57
Guys, I appreciate the feedback.
I've gone down the rabbit hole on this one. I've read RFC 793 to understand TCP better and am getting a handle on things.
From the packet trace, the RST are coming from the server. From what I've read up on Java SocketServer class, it is possible for the application to initiate a forced close triggering a RST. Unfortunately, we do not have access to code to look inside to see if the mechanims for closing a socket are using a type which forces a reset.
We see plenty of TIME_WAITS on the server, and believe that the application is overloaded. What's peculair is that this is a VMWARE installation and we've noted some performance delta (decrease) between our physical gateway and our virtual one.
Our application's lower level socket / connection processor to our medical devices is definitely overwhelmed, perhaps from poor design. but it is likely that some how VMWARE is interfering somehow with the normal processing of TCP socketed connections.
What we noticed is that when the server is down for maintainance, our devices queue up messages. After some time, when the applicaiton is restored, 200+ devices instantly try to open connections to the server to hand off messages. The server application apparently has a design to handle (8) socket connections simultaneously. SO there's all these other devices that don't get a connection to the application socket. After a 1 second timeout, they try again.
After some time,the application finally catches up with the overload (flood) of messages and normal order is restored.
Our developer does not let us look inside code, so we do not have any means of understanding whether this is efficient or not. It does NOT seem like a good architecture in today's technology.
However, this does not mitigate the fact that the same code (application) works OK on a physical environment. Somehow, there's something causing interference. Perhaps it's the VMWARE NIC Driver, the VM SWITCH, an overwhelmed VM HOST or VM NETWORKING, i'm not sure. So even though we doubt the application arthicture is not solid, we do note that there's a difference between physical and virtualized instances.
REgarding VMWARE virtualized instance of our application, even under reduced load (few devices), the application can't seem to keep up. It's more noticable that we see what our developer is calling a "device Denial of Service" in VMWARE only. But it's not all implementations on virtualization. There are some that are working well on VMWARE, some that work well on HYPER-V, and all work well on PHYSICAL.
I think our issue is that all of our customers control their own VMWARE environments. So we have no control or knowledge of VMWARE configurations. We are just a VM inside their VMWARE infrastructure.
So i'm not saying it's a VMWARE issue for sure. It likely is a combination of poor architecture and perhaps some related VMWARE configuration or provisioning issue.
What i'd like to learn is how to determine or measure if there's significant packet loss happening on the interface. Why are we getting so many RST and TIME_WAITS? is it because normal socketed connection closes are losing the FIN/ACK, FIN/ACK and then the O/S RST the connection?
I've learned more about this than I really intended to learn. but it's still not enough to root cause why so many server RST and TIME_WAITS are occuring. Clearly the limits on the applicaton connection limits aren't keeping up with demand.
But it's so odd that when a connection gets established (SYN, SYN/ACK, ACK), and a push/ACK occurs starting to deliver data, that there's an immediate RST. The device gets the RST after the three-way handshake, but it's in the middle of push/ack some data. the RST comes in twice. Once after the three way handshake, then the push/ack is seen in the packet capture, then the RST comes in again.
So in the packet capture it goes like this:
So my thought is this:
THe device initiates a connection. The WIndows 2003 O/S processes that connection at the WINSOCK level. The Application has no way to handle the socket connection. So WINSOCK knows that the buffer is full, the application is backlogged, and the WINSOCK (O/S) issues an immediate RST. Before the device sees that RST, it's PUSHING data to the server (attempting to). WINSOCK (O/S) sees thsi data and sends RST again.
that's my theory.
is there a way, that you know, to limit how many TCP connections an incoming server can handle? our application is constrained in its socket connection handling and is overwhelmed. I wonder if there's a way to throttle the connections some how. Kind of like a governor, to reduce the stress on the application as a short term remediation while they investigate why the appliation can't handle the load.
Also, if a RST is issued, does it therefore mean that that socketed connection will go into TIME_WAIT because it never properly closed (Fin/ack, FIn/ack)?
answered 29 Dec '10, 06:04
Yikes, looks like a big mess. I'd say your primary problem is that you don't seem to have any kind of leverage on the developers to get their design fixed to scale better. You should really look into this and see if there's anything you can do to get them to cooperate on this - from what you tell us I'm pretty sure your trouble is caused by the application design/application scalability.
Regarding the SYN - SYN/ACK - ACK - RST sequence: I've seen that happen whenever an application is listening on a port and after the TCP socket has handled the connection establishment and tells the application that there is a new communication partner the application denies the new connection (for whatever reason the programmer chose). I don't think it's just a simple buffer issue; it's the application forcefully denying a new connection. For example: in one of my latest cases it was an FTP server that would deny connections from any IP except those coming from a specific range. Sometimes the client is happy that the three way handshake worked and sends data right away, which is why you get two resets: one caused by the application denying the new connection and one from the tcp stack that receives the client data after the first reset and resets again. So you got that one right.
And no, as far as I know there is no way to limit TCP connection except by deploying a firewall in front of it that will only let a certain number of connection through to the server. In my eyes this is some kind of a bad workaround and won't do much good.
Regarding the TIME_WAIT: this is NOT a result of a RST to a connection - those are immediately shut down and don't do TIME_WAIT. TIME-WAIT happens after a graceful shutdown (FIN/ACK/FIN/ACK), and on Windows blocks ressources by default for 240 seconds IMHO. That is an ancient mechanism (regarding the 240s) to cope with late arrival network packets. Since you say the application only allows 8 concurrent connections your problems may be caused by waiting too long for TIME_WAIT to be complete, because it will block further connections to be accepted. You should configure the TIME_WAIT delay as low as possible (30 seconds on windows), which is done through registry parameters:
Set the parameter "TcpTimedWaitDelay" at HKLM\SYSTEM\CurrentControlSet\Services\TcpIp\Parameters to 30.
The other action you should take is to find a good way to handle more concurrent connections at the application level, which is where you need to talk to the developers. They could implement some sort of broker/agent architecture (or improve the existing one if there is any), or maybe a loadbalancer can balance the incoming transaction to multiple application servers.
Regarding VMware: if you think the problem is with the virtualized hosts you need to investigate what kind of environment it is. Is it an enterprise setup, or someone running VMs on free virtualization solutions on cheap hardware? How many VMs are there? Is there a ressource management in place, and does it grant enough ressources to the application VMs? How about network bandwith, NIC teaming, traffic shaping etc? So far I haven't seen bad problems like yours being caused by VMware alone.
Hope this helps a bit.