Apparent Failure to Negotiate TCP Session

asked 2017-12-01 18:51:09 +0000

We have a web server running Windows Server 2012 [email protected] and a SQL Server Analysis Server running Windows Server 2016. Both servers are virtualized machines in ESXi 6.5 Update 1. The web server is in our DMZ, the SQL server is in our LAN with connectivity through our Palo Alto firewall. We are getting intermittent errors when placing a load on the Analysis server running queries. The firewall shows no dropped packets. By default both operating systems have enabled the TCP enhancement of ECN. NIC settings in both have the default settings including TCP Chimney and various offloading.

A packet capture on an error shows apparent failure to negotiate the TCP session, I don't see any data transmitted.

web server sends SYN, ECN, CWR (indicates congestion) Window size value: 8192 [Calculated window size: 8192]

sql server sends with SYN, ACK, ECN Window size value: 8192 [Calculated window size: 8192]

web server sends ACK Window size value: 4106 [Calculated window size: 1051136] [Window size scaling factor: 256]

web server sends SYN, ACK, ECN Window size value: 8192 [Calculated window size: 8192]

sql server sends SYN, ACK, ECN Window size value: 8192 [Calculated window size: 8192]

web server sends ACK Window size value: 4106 [Calculated window size: 1051136] [Window size scaling factor: 256]

sql server sends RST, ACK Window size value: 0 [Calculated window size: 0] [Window size scaling factor: 256]

sql server sends RST, ACK Window size value: 0 [Calculated window size: 0] [Window size scaling factor: 256]

We've increased the timeouts in Analysis Server.

We've executed the following command on just one server, the SQL server, to turn off ECN and did not reboot: netsh int tcp set global ecncapability=disabled

The errors still occur. Can anyone give a clue as to what might be causing the issue? Thank you.

edit retag flag offensive close merge delete

Comments

We need a trace file, otherwise we canĀ“t help you. Here you find secure way for sharing trace files. https://blog.packet-foo.com/2016/11/t...

Christian_R gravatar imageChristian_R ( 2017-12-01 20:25:16 +0000 )edit

Christian, Thank you for your reply! I am a novice to this site so I had to figure out how to attach the captures. Our firewall generated three packet captures, they seem to essentially show the same thing.

Users are reporting, at least so far, the errors are no longer occurring. It may be that the turning off ECN did the trick and just took a while to take effect since we didn't restart the server.

If you or anyone sees anything in the trace that might explain the errors it will be much appreciated.

https://www.cloudshark.org/captures/3...https://www.cloudshark.org/captures/5...https://www.cloudshark.org/captures/b...

Alan

adoughe gravatar imageadoughe ( 2017-12-01 22:28:07 +0000 )edit

Okay, I spoke too soon, the problem still exists. We restored a duplicate of the web server and made changes to it so its on our LAN instead of our DMZ. When the web server is on our LAN, and not going through our firewall, there are no errors. If the traffic has to go through our firewall it produces errors, an example in the packet captures. Three different firewall techs have told us it's not the firewall so we continue to troubleshoot it. We're back on the phone to the firewall vendor. If anyone sees anything in the captures that identifies the issue please let me know. Thanks, Alan

adoughe gravatar imageadoughe ( 2017-12-01 22:46:03 +0000 )edit

HM, strange of course. For sure we can say, the Webserver initiates the session and it seems he also sends an ACK (or is it only the FWwe see here). After 14 seconds the SQL server terminates the session, because he never has get some data from the Webserver after the ACK. So a trace on or next to Webserver would be usefull to see if Webserver sends some packets.... to the firewall after the ACK.

Christian_R gravatar imageChristian_R ( 2017-12-02 00:29:43 +0000 )edit

Christian, Thank you for the reply. My co-worker ran a packet capture on both servers and from within the firewall. Even on the analysis server it simply shows the ACK from the web server and then a RST from the analysis server about 14 seconds later. Somewhere in the packet captures he said there are some successful transmissions, it doesn't fail all the time, so I'm looking through it now to try to what differences there are between a successful and failed query.

adoughe gravatar imageadoughe ( 2017-12-04 21:38:34 +0000 )edit