Intermittent Network Slowness/Complete loss of Connectivity

asked 2021-06-24 13:08:07 +0000

updated 2021-06-24 13:51:00 +0000

cmaynard gravatar image

I have a network stood up with vSphere. Over the past couple of years I have been experiencing occasional drops in network latency, or a complete loss of connectivity between servers. The interesting part here is that it's always the same servers that seem to have the issue. (i.e. I have a script that I wrote to detect network instability between one host many others, grepping through months of that data, several servers have upwards of 200 detected events, while others have 0).

I have been trying desperately to determine the source of these network issues. Recently, I wrote a script that would fire at the end of a cron that I have that detects the network events. The script tests ssh latency between one host and many, but before it starts the latency test, I start a packet trace using tshark and filtering on traffic coming from or going to the host that I'm testing and coming from or going to the host that I'm running the script from. It also filters on traffic on port 22 as I use ssh commands with the latency test.

Any help on this would be greatly appreciated. I'm a software engineer, not a network guy, just have enough knowledge to get this far.

Here is the tshark output I collected when a server was experiencing network degradation:

    10 0.574704304    host -> client    TCP 74 56104 > ssh [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=443780045 TSecr=0 WS=128
 33 0.622798903    client -> host    TCP 74 ssh > 56104 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=443784521 TSecr=443780045 WS=128
 34 0.622823233    host -> client    TCP 66 56104 > ssh [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=443780093 TSecr=443784521
 35 0.622872658    host -> client    SSH 87 Client Protocol: SSH-2.0-OpenSSH_7.4\r
 36 0.671884861    client -> host    TCP 66 ssh > 56104 [ACK] Seq=1 Ack=22 Win=29056 Len=0 TSval=443784570 TSecr=443780093
530 16.713179167    client -> host    SSHv2 527 Server: Key Exchange Init
531 16.713203567    host -> client    TCP 66 56104 > ssh [ACK] Seq=22 Ack=462 Win=30336 Len=0 TSval=443796184 TSecr=443800624
532 16.714881821    host -> client    SSHv2 1314 Client: Diffie-Hellman Key Exchange Init
533 16.715060860    client -> host    TCP 66 ssh > 56104 [ACK] Seq=462 Ack=1270 Win=31872 Len=0 TSval=443800639 TSecr=443796185
534 16.717441079    client -> host    SSHv2 474 Server: New Keys
535 16.718751241    host -> client    SSHv2 146 Client: New Keys
536 16.719009654    client -> host    TCP 130 [TCP segment of a reassembled PDU]
537 16.719097445    host -> client    TCP 146 56104 > ssh [PSH, ACK] Seq=1350 Ack=934 Win=31360 Len=80 TSval=443796189 TSecr=443800643[Reassembly error, protocol TCP: New fragment past old data limits]
538 16.724163039    client -> ...
(more)
edit retag flag offensive close merge delete

Comments

/rant
How I love debugging a network issue with a wall of text (better than a screenshot though). Having a capture file that I can open in Wireshark and use all the wonderful facilities makes life so much easier. It's also a network traffic trace not a stack trace.
/rant

I think you have the host\client names wrong in the text, it should be the client initiating the connection, not the host.

At first glance there seems to be a long delay between the client acknowledging the SSH header in frame 16 and then the client starting the Key Exchange in frame 530, approx. 16 seconds, so this would seem to be a client issue.
There also seems to be some frames missing, i.e. the "Server: Protocol" message.

grahamb gravatar imagegrahamb ( 2021-06-24 13:35:34 +0000 )edit

@grahamb sorry.... Like I said, I'm not a network engineer, I get things like host and client mixed up all the time... I thought it was host initing the connection. I wasn't sure if that time was in ms or seconds, but it sounds like that's actually seconds. Also the missing frames, that's really interesting. I removed IPs because I had to, but I didn't touch any total lines. I wonder why it's missing... I have packet captures from a "good" connection. Let me see if I can upload both files for you, if that's easier.

Do you know what New fragment past old data limits, might mean?

cdr113254 gravatar imagecdr113254 ( 2021-06-24 13:42:08 +0000 )edit

How do I upload a file?

cdr113254 gravatar imagecdr113254 ( 2021-06-24 13:46:37 +0000 )edit

Save captures to a public share, e.g. Google Drive, DropBox etc. and then post a link to it back here.

The TCP message is from the Wireshark reassembly code (anything in square brackets is inferred or synthesised by Wireshark) and indicates that it failed because the TCP segments don't appear to "add up". Access to the capture file will allow better analysis as we can then look at sequence numbers.

grahamb gravatar imagegrahamb ( 2021-06-24 14:13:52 +0000 )edit

Ok, I'll see what I can do about getting it hosted somewhere. Is there a difference between a capture file and what I have here? Do you mean like using the -w option on tshark and using that file? Currently, I'm using a display filter, which, from what I understand, cannot be used with the -w option and am instead just > the STDOUT to a file and killing the tshark process after my latency test completes.

cdr113254 gravatar imagecdr113254 ( 2021-06-24 14:30:25 +0000 )edit