Trying to understand why RTT changes during a call

asked 2020-07-10 09:53:45 +0000

labreu gravatar image

Hello guys.

I'm trying to use Wireshark to debug a web service call that is taking more than one second. Looking at the package list, it's possible to see that the RTT increases during the TCP conversation required for getting the ACK of one of the packages. Since I'm new, I can't upload images, so I'm dumping the 2 entries here:

116 3.854565    10.50.100.28    10.50.110.100   TLSv1.2 470 Application Data
119 4.907522    10.50.110.100   10.50.100.28    TCP 1514    443 → 34588 [ACK] Seq=5359 Ack=1048 Win=64384 Len=1448 TSval=1807772277 TSecr=1446266633 [TCP segment of a reassembled PDU]

I'm not really a network admin and my TCP/IP knowledge is limited, so I was wondering if anyone could give me any hints on why the RTT goes up from ms to more than 1 second. Any clues on what's going on here? Btw, these traces were capture between 2 Ubuntus which are on the same network.

Thanks

edit retag flag offensive close merge delete

Comments

Maybe the ACK was not captured? Did you make traces on both sides? Are you able to share the traces on a public file sharing service OneDrive, GoogleDrive, Dropbox, etc?

SYN-bit gravatar imageSYN-bit ( 2020-07-10 10:29:08 +0000 )edit

Hello. Thanks for the quick reply. The logs I'm sending were captured by our firewall (we've tried putting the servers on different networks, but there were no changes on the RTT). As I said, the problems happen between 10.50.100.28 and 10.50.110.100 (the first calls a web service on the second). Please let me know if you have any problems downloading the file: https://www.dropbox.com/s/4scvbh05owg...

labreu gravatar imagelabreu ( 2020-07-10 11:05:48 +0000 )edit

Capture downloads fine. Not a TCP expert, but it seems as though the second TLS application data blob out of the three is held up somehow. Is this Nagle? (i.e. is TCP_NODELAY set on the server socket)

Jaap gravatar imageJaap ( 2020-07-10 16:58:37 +0000 )edit

Looks like the server/application causes the delay. When you filter to the second of the two streams in the capture (tcp.stream == 2), you will see that the TCP and TLS handshake RTTs are fine. After the TLS handshake the client sends data to the server (#48), and the server ACKs the packet after 40ms (#49). But it tooks another 900ms until the server send back data (#52). The first stream looks nearly the same. To confirm you should capture the traffic on the server 10.50.110.100.

@Jaap: I'm not an TCP expert too, but if I understood the Nagle algorithm right, it should not cause such a hugh delay.

JasMan gravatar imageJasMan ( 2020-07-10 21:42:19 +0000 )edit

Hello again. Thanks for the answers.

are you saying that the delay is on the web service and not on the tcp stack? Before using wireshark, we've activated xdebug on both sites (portal and services) and sampled it by navigating around with a single user (only one user and activated both logs at the same time). it seemed like the time spent inside the app that has the web services was minimum (for instance, the portal had 32 seconds of curl calls vs 1 second total time processing on the services app). Thanks.

labreu gravatar imagelabreu ( 2020-07-11 13:30:48 +0000 )edit