Ask Your Question
0

what is slowing down the restore?

asked 2025-03-01 21:57:34 +0000

net_tech gravatar image

updated 2025-03-01 21:58:10 +0000

Hi,

Here is the first 15 seconds of a SQL database (500GB) restore from a physical Rubrik appliance (Ubuntu) https://drive.google.com/file/d/1HYy0...

Rubrik is at 192.168.243.229, SQL server VM is at 172.30.138.132. (capture has been wrnaglered) Both systems are on the same physical network, 1 hop away. The RTT to ACK the 3-way handshake is 0.000083000.

This is a 10Gbps network, however the restore is only going at around 2-3 Gbps. There is a small amount of tcp.analysis.flags, but they could be a result of the span port capture at the virtualization layer, I don't have a tap. I am only seeing [TCP ACKed unseen segment] and [TCP Previous segment not captured]

What are the indicators of the slow file restore in this capture ? Is it the source ? Is it the destination OS or not fine tuned SQL server ?

image description

Thank you

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
0

answered 2025-03-02 08:50:23 +0000

SYN-bit gravatar image

First the lowhanging fruit, is the window size too small, not scaled enough? For this to calculate, you need the used windowsize and the roundtrip time. The window size on the receiving end is ~2MB, but the roundtrip time is between 0,144 and 7,5 ms. This is a result of the way the capture was made, that variation is not real. Even so, with these numbers a bandwidth per TCP stream could be reached of 2 (7,5ms RTT) and 40 (0,144 RTT) Gbit/s. As there are 16 parallel streams, you could easily fill up the 10 Gbit/s pipe. So Windowsizes and scaling are not the issue.

Then looking at the TCP bandwidth of a couple of those sessions, you can see that bandwidth is rising, then dropping and then rising, but not as high as it was before. This is typical for a congestion windows that is influenced by network conditions. Looking at the packets, there are cases of TCP fast retransmissions, indicating there might actually be some packet loss on this connection. This pcap makes it hard to tell, as there are indeed more cases of packets just being missing from in the capture.

Without a clean 100% acccurate packet capture, it is impossible to tell from the packets what the cause really is. Look at all the packets that have a 0 microsecond delta and all the missing packets. So if you need proper analysis, get a 10 Gbit/s TAP and capture solution.

As that might be an issue, start by looking at the port statistics of both servers and both switchports to see if there are any errors and/or discards. As I suspect a bit of packetloss tuning down the congestion window, limiting the throughput. If there are fiber connections involved, make sure they are cleaned properly, as that is a very common source of just a tiny bit of errors, lowering the congestion window.

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2025-03-01 21:57:34 +0000

Seen: 32 times

Last updated: yesterday