Intermittent Network Slowness/Complete loss of Connectivity
I have a network stood up with vSphere. Over the past couple of years I have been experiencing occasional drops in network latency, or a complete loss of connectivity between servers. The interesting part here is that it's always the same servers that seem to have the issue. (i.e. I have a script that I wrote to detect network instability between one host many others, grepping through months of that data, several servers have upwards of 200 detected events, while others have 0).
I have been trying desperately to determine the source of these network issues. Recently, I wrote a script that would fire at the end of a cron that I have that detects the network events. The script tests ssh latency between one host and many, but before it starts the latency test, I start a packet trace using tshark and filtering on traffic coming from or going to the host that I'm testing and coming from or going to the host that I'm running the script from. It also filters on traffic on port 22 as I use ssh commands with the latency test.
Any help on this would be greatly appreciated. I'm a software engineer, not a network guy, just have enough knowledge to get this far.
Here is the tshark output I collected when a server was experiencing network degradation:
10 0.574704304 host -> client TCP 74 56104 > ssh [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=443780045 TSecr=0 WS=128
33 0.622798903 client -> host TCP 74 ssh > 56104 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=443784521 TSecr=443780045 WS=128
34 0.622823233 host -> client TCP 66 56104 > ssh [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=443780093 TSecr=443784521
35 0.622872658 host -> client SSH 87 Client Protocol: SSH-2.0-OpenSSH_7.4\r
36 0.671884861 client -> host TCP 66 ssh > 56104 [ACK] Seq=1 Ack=22 Win=29056 Len=0 TSval=443784570 TSecr=443780093
530 16.713179167 client -> host SSHv2 527 Server: Key Exchange Init
531 16.713203567 host -> client TCP 66 56104 > ssh [ACK] Seq=22 Ack=462 Win=30336 Len=0 TSval=443796184 TSecr=443800624
532 16.714881821 host -> client SSHv2 1314 Client: Diffie-Hellman Key Exchange Init
533 16.715060860 client -> host TCP 66 ssh > 56104 [ACK] Seq=462 Ack=1270 Win=31872 Len=0 TSval=443800639 TSecr=443796185
534 16.717441079 client -> host SSHv2 474 Server: New Keys
535 16.718751241 host -> client SSHv2 146 Client: New Keys
536 16.719009654 client -> host TCP 130 [TCP segment of a reassembled PDU]
537 16.719097445 host -> client TCP 146 56104 > ssh [PSH, ACK] Seq=1350 Ack=934 Win=31360 Len=80 TSval=443796189 TSecr=443800643[Reassembly error, protocol TCP: New fragment past old data limits]
538 16.724163039 client -> ...
/rant
How I love debugging a network issue with a wall of text (better than a screenshot though). Having a capture file that I can open in Wireshark and use all the wonderful facilities makes life so much easier. It's also a network traffic trace not a stack trace.
/rant
I think you have the host\client names wrong in the text, it should be the client initiating the connection, not the host.
At first glance there seems to be a long delay between the client acknowledging the SSH header in frame 16 and then the client starting the Key Exchange in frame 530, approx. 16 seconds, so this would seem to be a client issue.
There also seems to be some frames missing, i.e. the "Server: Protocol" message.
@grahamb sorry.... Like I said, I'm not a network engineer, I get things like host and client mixed up all the time... I thought it was host initing the connection. I wasn't sure if that time was in ms or seconds, but it sounds like that's actually seconds. Also the missing frames, that's really interesting. I removed IPs because I had to, but I didn't touch any total lines. I wonder why it's missing... I have packet captures from a "good" connection. Let me see if I can upload both files for you, if that's easier.
Do you know what New fragment past old data limits, might mean?
How do I upload a file?
Save captures to a public share, e.g. Google Drive, DropBox etc. and then post a link to it back here.
The TCP message is from the Wireshark reassembly code (anything in square brackets is inferred or synthesised by Wireshark) and indicates that it failed because the TCP segments don't appear to "add up". Access to the capture file will allow better analysis as we can then look at sequence numbers.
Ok, I'll see what I can do about getting it hosted somewhere. Is there a difference between a capture file and what I have here? Do you mean like using the -w option on tshark and using that file? Currently, I'm using a display filter, which, from what I understand, cannot be used with the -w option and am instead just > the STDOUT to a file and killing the tshark process after my latency test completes.