Ask Your Question
0

Troubleshooting dropped TCP connections

asked 2020-04-25 15:06:17 +0000

We're trying to figure out why connections to this server are dropping. Looking at the capture it seems like both sides go silent at one point then 2 hours later the client tries to keep the connection alive and the server resets it. Looking at the capture, it would seem that either the server should realize it didn't get an acknowledgement for missing segments and resend them, and the client should continue sending duplicate acks until the server resends it. The capture I have was done on the client so maybe the server is retrying but they're MIA. Though there's other traffic on the connection that indicates data is still passed. When traffic from the server stops, is it possible it hit the limit of the congestion window? Any insight to the problem is appreciated.

https://imgur.com/a/1VL43Fj

edit retag flag offensive close merge delete

Comments

We much prefer to analyze packets instead of pixels. Are you able to share the capture file on a public file share? Have a look at Tracewrangler if you need to anonymize the file.

If possible, a trace on both the client and server side would help the analysis.

SYN-bit gravatar imageSYN-bit ( 2020-04-26 07:16:16 +0000 )edit

Interesting little tool. Here's the anonymized packets in that stream. http://streaming2.thedavidcorrigan.co...

We're working on getting a capture from both sides. Right now we have the server owners, and the network team trying to figure this out and I've been bouncing ideas off both of them while we all try and figure it out. It's infrequent enough to be difficult to pin down but frequent enough to annoy a lot of people running big jobs.

dcorriga gravatar imagedcorriga ( 2020-04-26 18:02:36 +0000 )edit

1 Answer

Sort by ยป oldest newest most voted
0

answered 2020-04-26 22:24:57 +0000

SYN-bit gravatar image

There are some packets missing at the beginning of the large data-transfer. Every time the client receives a segment it did not expect yet, it sends a DUP-ACK. The server responds witha fast-retransmission after receiving the 3rd DUP-ACK (since the trace was made on the client side, it looks like it sends them out only after receiving more than 3 DUP-ACKs as packets need to travel back and forth).

This mechanism works well, until the client asks for the segment starting at sequence number 29567. This segment does not get retransmitted by the server (or gets lost on the way). But since there are no RTO retransmissions, it looks like the connection with the server is lost. The client stops sending DUP-ACKs by then as no more data comes in (the server is done sending data as you can see by the non-full-MSS segment in frame 189). It is up to the server to make sure the data gets to the client and is acknowledged.

After 7200 seconds the TCP stack sends out a TCP-Keep-Alive, which most OSses do by default. The server however does not recognize the TCP session anymore and sends a TCP-RST.

Can you tell us what kind of devices are in between the client and the server? I see a TTL of 61 for the server packets, which suggests 3 routing hops between the server and the client. Are there any firewalls and or loadbalancers involved? Or just routers/L3-switches? I'm particularly interested in these kind of devices as they are session based and might intervene.

As the problem only occurs sporadically, are you aware of using dumpcap instead of wireshark to do the capturing? You can use the -b options to create a ring-buffer to capture for a long time without filling your disk. Have a look at dumpcap in your wireshark folder and use "dumpcap -h" for options.

The server side trace would be the most interesting one in this case as from the clients perspective it stops transmitting data (which might or might not be the case).

edit flag offensive delete link more

Comments

This is part of a proprietary file-syncing protocol and we've set it to use multiple threads. The other threads can be seen sending way more dup-acks attempts throughout their streams than this stream does at the end which is puzzling me. Being on the client side, I would've expected to see them repeat until the server hit the reset even if it lost all the RTOs from the server. Or can it only continue to send them as a response to other server packets?

I'm trying to get more information about the network topology from our network team and get them to check packet counters on every link just to see if there's any physical issues exacerbating the packet loss we're seeing. They've basically said that the bits are getting through so it shouldn't be a network issue, and while I believe their ...(more)

dcorriga gravatar imagedcorriga ( 2020-04-26 22:47:49 +0000 )edit

I would've expected to see them repeat until the server hit the reset even if it lost all the RTOs from the server. Or can it only continue to send them as a response to other server packets?

DUP-ACKs are only sent as a response to packets that are received with a segment of data that does match the point in the datastream. So if I have received up till byte 1000 of the data stream. I expect to see a sequence number of 1001 on the next packet. Any received packet with a sequence number higher than 1001 will trigger the sending of a DUP-ACK whit ACK value 1001 indicating that the receiving side is still expecting data from byte 1001 onwards.

SYN-bit gravatar imageSYN-bit ( 2020-04-26 23:00:26 +0000 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

2 followers

Stats

Asked: 2020-04-25 15:06:17 +0000

Seen: 168 times

Last updated: Apr 26