what is slowing down the restore?

tcp
copy

asked 2025-03-01 21:57:34 +0000

net_tech
145 ●36 ●44 ●56

updated 2025-03-01 21:58:10 +0000

Hi,

Here is the first 15 seconds of a SQL database (500GB) restore from a physical Rubrik appliance (Ubuntu) https://drive.google.com/file/d/1HYy0...

Rubrik is at 192.168.243.229, SQL server VM is at 172.30.138.132. (capture has been wrnaglered) Both systems are on the same physical network, 1 hop away. The RTT to ACK the 3-way handshake is 0.000083000.

This is a 10Gbps network, however the restore is only going at around 2-3 Gbps. There is a small amount of tcp.analysis.flags, but they could be a result of the span port capture at the virtualization layer, I don't have a tap. I am only seeing [TCP ACKed unseen segment] and [TCP Previous segment not captured]

What are the indicators of the slow file restore in this capture ? Is it the source ? Is it the destination OS or not fine tuned SQL server ?

image description

Thank you

edit retag flag offensive close merge delete

add a comment

2 Answers

Sort by » oldest newest most voted

answered 2025-03-19 20:47:01 +0000

Christian_R

2059 ●11 ●74 ●51 http://crnetpackets.com

Hello, It is a little bit hard to read you capture as you cut in the middle of the tcp header, so that SACK is not visible. At around 0.15 is a gap in the capture…we also see several other packet not captured warning. Can be an indication for packet loss which could drop the rate in some cases. But I think the trace is maybe not good enough for doing it reliable.

edit flag offensive delete link

add a comment

answered 2025-03-02 08:50:23 +0000

SYN-bit

18600 ●9 ●361 ●255 https://SYN-b.it

First the lowhanging fruit, is the window size too small, not scaled enough? For this to calculate, you need the used windowsize and the roundtrip time. The window size on the receiving end is ~2MB, but the roundtrip time is between 0,144 and 7,5 ms. This is a result of the way the capture was made, that variation is not real. Even so, with these numbers a bandwidth per TCP stream could be reached of 2 (7,5ms RTT) and 40 (0,144 RTT) Gbit/s. As there are 16 parallel streams, you could easily fill up the 10 Gbit/s pipe. So Windowsizes and scaling are not the issue.

Then looking at the TCP bandwidth of a couple of those sessions, you can see that bandwidth is rising, then dropping and then rising, but not as high as it was before. This is typical for a congestion windows that is influenced by network conditions. Looking at the packets, there are cases of TCP fast retransmissions, indicating there might actually be some packet loss on this connection. This pcap makes it hard to tell, as there are indeed more cases of packets just being missing from in the capture.

Without a clean 100% acccurate packet capture, it is impossible to tell from the packets what the cause really is. Look at all the packets that have a 0 microsecond delta and all the missing packets. So if you need proper analysis, get a 10 Gbit/s TAP and capture solution.

As that might be an issue, start by looking at the port statistics of both servers and both switchports to see if there are any errors and/or discards. As I suspect a bit of packetloss tuning down the congestion window, limiting the throughput. If there are fiber connections involved, make sure they are cleaned properly, as that is a very common source of just a tiny bit of errors, lowering the congestion window.

edit flag offensive delete link

add a comment

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

what is slowing down the restore?

2 Answers

Your Answer

Question Tools

Stats

Related questions

what is slowing down the restore? edit

2 Answers

Your Answer

Question Tools

Stats

Related questions

what is slowing down the restore?