Random Application Slowdown - Many TCP Retransmissions

asked 2021-05-13 22:31:22 +0000

KBolt
1 ●1 ●1 ●2

updated 2021-05-14 15:05:36 +0000

2379 ●8 ●48 ●59

Hello all. I've been asked to look into an application performance issue. Certain procedures in app require SMB operations but there are instances where the app itself freezes for about 20 secs. The Windows PC itself isn't having a resource issue neither is the server so I thought check the network and that's where I see a lot of TCP retransmissions and other black packets in Wireshark.

I've sanitized and uploaded a capture. I'm hoping someone might help me understand the many TCP Dup Acks and Retransmissions. My switches don't show any errors on their interfaces or any signs of congestion that would point to packet loss so I can't see why there'd be retransmits. I'd appreciate any help offered.

https://www.cloudshark.org/captures/7...

Dropbox link: https://www.dropbox.com/s/56m4rx4kpx4...

I didn't realize Cloudshark was so restricted - apologies.

edit retag flag offensive close merge delete

Comments

Can you make the captures public to look at without a login?

Chuckc ( 2021-05-14 00:20:20 +0000 )edit

Hi @Chuckc I've added a Dropbox link to the pcapng file. Thanks for your quick reply.

KBolt ( 2021-05-14 12:32:09 +0000 )edit

@KBolt, you have to set the option in CloudShark to make the capture publicly visible.

grahamb ( 2021-05-14 13:06:57 +0000 )edit

Since the capture was anonymized with Trace Wrangler, can you confirm:
IP address of client and server
Where the capture was made
What is the network architecture (inbound and outbound traffic are from different MAC addresses)

Chuckc ( 2021-05-14 13:29:19 +0000 )edit

Thank you @grahamb I've made it public.

@Chuckc - 192.168.0.1 s the client, 10.10.10.1 is the server - The capture was made at the server - The different pairs of MAC addresses (2, I believe) is because we use HSRP at the server side so 00:00:0c:77:4f:30 is really 00:00:0c:9f:f0:02 so server -> client sees the HSRP MAC but client -> server sees actual switch interface MAC and server MAC.

The architecture (simply put) is client - access sw - (portchannel) - aggregate switch stack in VPC - (portchannel) - core switch stack in VPC - (HSRP) - server farm

KBolt ( 2021-05-14 15:09:58 +0000 )edit

see more comments

2 Answers

Sort by » oldest newest most voted

answered 2021-05-14 15:01:48 +0000

Eddi

2379 ●8 ●48 ●59

Hello KBolt

The very first packets of your trace look like a capture taken from a SPAN port. Depending on the configuration, individual packets can be send to the SPAN port twice: Once when the packet arrives at the switch, and again when the packet is delivered to the destination port.

This becomes immediately clear when you look at the three-way-handshake: The SYN-ACK from the server was recorded twice with a delta-time of 150 microseconds. All other packets from 10.10.10.1 also show up twice. This is usually caused by SPAN port definition.

Clearly, Wireshark is confused by the duplicate packets. You might want to use the editcap utility, which is part of the Wireshark distribution. Run editcap -d to remove these duplicate packets.

Since you are using SMB2 or SMB2 I suggest to try Wiresharks excellent Service Response Time feature: Statistics -> Service Response Time -> SMB2

In most cases, Read and Write operations should complete within a few milliseconds. Anything longer is worth investigating.

You can locate long response times by using the Find-packet feature (Ctrl-F): The display filter smb2.time > 0.1 would bring you to the next SMB2 transaction that took longer than 0.1 seconds.

Good luck

Eddi

edit flag offensive delete link

Comments

Impressive analysis, thank you for this feedback Eddi. I'll look into that over the weekend. The only part I wonder about is the SPAN port. There's nothing like that however, this server is a Hyper-V VM on a cluster. Could any of that possibly cause the retransmissions?

And based on your feedback, perhaps there's nothing problematic in the capture itself, because all those retransmits could be due to Wireshark seeing the packets twice. Even those large blocks of retransmits at around packet 8394. That's possible?

KBolt ( 2021-05-14 16:07:05 +0000 )edit

After speaking with server admins, it seems there may be a load balancer before the server so maybe that might be the cause of duplicates. I’m waiting on full confirm on that. I’ve removed the duplicates and since I saw some negative time deltas, I used reordercap to handle those.

I found the single SMB2 Read Response that took up to 0.4389202 seconds. It was after a large block of TCP retransmits which (since the duplicates have largely been removed) I assume is due to packet loss. The retransmits seem to be from both endpoints so now I'm wondering how I'd confirm if this was due to application or network delays.

I use Solarwinds Quality of Experience tool which tries to show the source of latency/packet loss by comparing the TCP 3-way handshake time, to the time taken to see first byte from the ...(more)

KBolt ( 2021-05-15 17:30:17 +0000 )edit

add a comment

answered 2021-05-14 19:14:57 +0000

BigFatCat
31 ●3 ●90 ●5

updated 2021-05-14 19:17:02 +0000

You can remove most of the duplicate packets with editcap -d or -D option. There are still some duplicates, but not impossible to analyze. You can verify if the packet is duplicate with the IP identification numbers. If they are the same, then it duplicates. There is packet loss in both directions. More from 10.10.10.1 to 192.168.0.1 direction.

An example with this filter (TCP relative sequence is turned off) tcp.seq==2271806588 || tcp.ack==2271806588 || tcp.options.sack_re==2271806588

Series of events

10.10.10.1 sent the TCP sequence 2271806588
192.168.0.1 complains about it didn't receive 2271806588 in the TCP SACK
192.168.0.1 starts send duplicate ACKs
10.10.10.1 resends the TCP sequence 2271806588
192.168.0.1 stops complaining because it received 2271806588

edit flag offensive delete link

Comments

I did a little out-of-box analysis. There are 17516 packets after removing all packets with the same IP identification. I am focusing on the IP identification because TCP doesn’t tell the networking layer what the IP identification to use and networking layer does not re-transmit packets.

BigFatCat ( 2021-05-14 20:27:01 +0000 )edit

Thank you for this. I guess I’ll need to revisit the server itself to ID the cause of packet loss since multiple clients have the same issue and the network interfaces along the path aren’t complaining.

Question about TCP Fast Retransmission, isn't that supposed to kick in way before the client had to send all those duplicate ACKs you mentioned in event 3? That's over 300 retransmits. I'd assume that then happens due to packet loss but since the pcap was taken at the server, these retransmits definitely arrived at the server's NIC. Do I need to consider that application hosted at the server was unresponsive and thus unable to reply to the retransmits?

I'll be capturing at both sides in the coming week.

KBolt ( 2021-05-15 16:33:52 +0000 )edit

add a comment

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Random Application Slowdown - Many TCP Retransmissions

Comments

2 Answers

Comments

Comments

Your Answer

Question Tools

Stats

Related questions

Random Application Slowdown - Many TCP Retransmissions edit

Comments

2 Answers

Comments

Comments

Your Answer

Question Tools

Stats

Related questions

Random Application Slowdown - Many TCP Retransmissions