Can you explain this TCP sequence

answered 2018-07-19 15:49:31 +0000

11115 ●12 ●322 ●166 https://www.igt.com/

Given the 2 supplied capture files ...

I used Tracewrangler to restore the original IP's, namely 192.168.195.45 => 10.247.166.16 and 192.168.50.157 => 172.28.12.164 to more easily compare them and to be able to refer to the same IP addresses in both capture files.

The captures were apparently taken with a snaplen of 54 bytes, which is a bit unfortunate as we don't have full frames for checksum verification. That said, I focused on a single packet, namely the 1st retransmission of the 1460 byte segment being sent from 10.247.166.16 to 172.28.12.164 with sequence # 3020828.

In the snd.pcapng file, this is frame #58376; in the recv.pcapng file, this is frame #13878. Comparing these 2 frames, the only differences are:

TTL: It is 128 in snd.pcapng and 127 in recv.pcapng. This TTL difference was already noted by Packet_vlad.
TCP Checksum: It is 0x3f4f (unverified) in snd.pcapng and 0x31fa (unverified) in recv.pcapng.

Unfortunately, the TCP checksums can't be verified because of the snaplen used. That said, the TCP Checksums should be identical since TTL isn't included as part of the 96-bit pseudo header that goes into the TCP Checksum algorithm. The TCP checksums would/could differ though if:

There was some NAT'ing going on, in which case the IP addresses could have changed. Is this the case?
TCP Checksum Offloading is being done on the sender side, in which case 0x3f4f might not be the actual TCP checksum that was calculated and transmitted. If this could be the case, then it would be better to capture outside of the 10.247.166.16 host.

It might be worthwhile to capture both sides again, but without applying a snaplen.

In any case, IF the TCP checksum is wrong when the TCP segment arrives at its destination, then this would explain why the receiver never ACK's it.

edit flag offensive delete link

Comments

The checksum can be wrong, but it also can be due to offloading at least in the trace with the huge frames.

Christian_R ( 2018-07-19 18:01:58 +0000 )edit

I thought about that but how to explain such selectiveness - only original segment with sequence # 3020828 and all its subsequent retransmissions were dropped, while all of them had different checksums (because of different TCP ACK fields). At the same time none of pure ACKs from the same source was dropped.

Packet_vlad ( 2018-07-19 19:26:53 +0000 )edit

I guess, that the packets get stuck in the queue of the stack, as we can see that the stack is under stress due the advertised RWin of 4096. So I agree with Vladimir slowing down the traffic could be tried.

Christian_R ( 2018-07-19 20:43:40 +0000 )edit

Thanks, Maynard. This is a good hint - unfortunately, the original registrations are complete (possible that TraceWrangler truncates the payloads).

From Wireshark, I enabled checksum verification on the receiver, and the checksum are indeed correct, according to Wireshark.

Additional note: yes, there is Checksum Offloading enabled on both side, so checksum can only be verified on the registration side showing packet receive.

Luca ( 2018-07-20 08:15:08 +0000 )edit

Thanks, @Christian_R and @Packet_vlad. I see all the points about reducing/slowing down the traffic by tuning the TCP window scaling factor and the TCP buffer size. And they are indeed very pertinent. But both seems palliative actions: I would like to understand which is the root cause of the TCP behavior here and why the back pressure is not activated on the sender side.

Luca ( 2018-07-20 08:33:17 +0000 )edit

Hi Luca, I assumed reducing Window Scale factor as a measure to improve our observation ability. If you reduce WS factor, you'll be able to see what is actual (or at least more granular) RWIN size at the moment, not the one aligned to 4096-Bytes steps. And this would give us a possibility to do further analysis. Of course if it reveals us RWIN big enough to store a couple of MSS, the hypothesis is wrong and we'll search elsewhere.

Packet_vlad ( 2018-07-20 09:05:47 +0000 )edit

If we are right, and the systems are just under stress, then you can either slow down connections or making the systems itsself faster (checking and tuning CPU load, system load....and so on; It can be a complex task)

Christian_R ( 2018-07-20 11:02:52 +0000 )edit

Thanks guys. We will retry with the TCP scaling and window size changed, then. What remains unclear to me is why in such stress condition, the TCP machinery fails in triggering the (so much needed) congestion control.

Luca ( 2018-07-20 12:21:29 +0000 )edit

I talk about the snd.pcapng

It is triggered. The congestion controls are in play:

We see a lot of Zero Windows
We see also small advertised windows.
Congestion Avoidance comes into play with the Retransmission (#58376 for the SrcPort 64613 and # 91696 for the SrcPort 14004), as they are Retransmission of time (RTO) they will reduce the sending window dramatically but first at this point, but your RTT is small, too. So BDP might be still high.

Christian_R ( 2018-07-20 13:01:05 +0000 )edit

Luca, TCP assumes the sender will take care of: 1) not overloading the receiver; 2) not overloading the network. (2) is actually met, the sender stops sending further after packet was lost and no ACK came back. (1) - the sender keeps to retransmit the packet because it sees no Zero Window therefore it assumes the receiver has some room available. So TCP on the network works as needed.

The question is: what happened to the receiver so it ceases to clear its own receiving buffer and keeps sending minimal 4096 calculated R-window.

Check it out: 192.168.1.157 seems to get its buffer begin filling from packet No. 12801 down to No.13190, where it starts to report 1 (4096Bytes) Window Size and opens it only 14+ seconds later (to 5 (20480)Bytes, packet 17959). You can plot Windows Scaling graph and see this clearly.

All this time the ...(more)

Packet_vlad ( 2018-07-20 13:16:09 +0000 )edit

I understand the receiver application is temporarily busy and not consuming data from the receive window, while it is perfectly able to send data out. We know this could be normal for such kind of application at startup. However, either of the two must be true: 1) the receive buffer is eventually full: a Zero Window must be acknowledged. 2) the receive buffer is not full: the incoming retransmission packet must be (at least partially) accepted and acknowledged.

In the registration, each retransmission of len 1460 is not acknowledged, yet a receive window of 4096 is advertised. This is the only thing that still sounds inconsistent to me.

Luca ( 2018-07-20 13:43:54 +0000 )edit

The receiver application is a Java application, using java.nio and non-blocking sockets. An incoming request could easily take the application busy for a while, producing a significant amount of data out in result. So, no surprise its receive window shrinks from time to time -- but if it's full, it must go to 0 and not let the sender silently think the packet is lost.

Luca ( 2018-07-20 13:50:32 +0000 )edit

And this is exactly the point I'm trying to investigate :) Let's say receive buffer has 1000 Bytes free, it is not full. How are you going to report it IF your Window scaling Factor is 4096? Even if you set 1 as window size - it should be scaled and interpreted by sender as "Free 4096 Bytes". You can't decrease Scaling factor during connection, because it is negotiated in 3-way handshake and is constant throughout all connection.

Honestly I do not know how TCP has to behave in this sutiation. Set to 1? It's no true, the buffer is smaller. Set to 0? Not true either... Please guys point me where to read about such case.

I've also never seen such thing as "Partial ACK", when receiver ACKs less then one packet sent to it although it is not forbidden as I know.

Packet_vlad ( 2018-07-20 13:57:55 +0000 )edit

What will be if the packet is silently dropped by one of the internal queues? Just a thought.... have no evidence for this.

Christian_R ( 2018-07-20 14:12:48 +0000 )edit

Not sure whether it could be pertinent or it just adds another variable into the picture, but... The issue seems to be empirically solved by enabling jumbo frames on the sender box. Empirically means it used to reproduce 8 out of 10 tries before; it does not replicate (yet) now with jumbo frames on. Does this makes any sense? Clearly, the level of congestion both sides is identical, since neither the data volumes nor the application behavior changed.

Luca ( 2018-07-20 15:16:31 +0000 )edit

Jumbo Frames will fit better with your scaling factor, as Vladimir has mentioned about the size of 4096. And of course do Jumbo Frames take away some load from your system. So yes it can make sense especially with your high Window Size values and your small RTT.

But you must ensure that every box supports Jumbo Frames even the receiver. Otherwise Jumbo Frames will not be used. https://crnetpackets.com/2016/01/27/t...

Christian_R ( 2018-07-20 16:03:09 +0000 )edit

Could you please share this new trace for education purpose? I'm so interested to see current behavior.

Packet_vlad ( 2018-07-20 17:09:22 +0000 )edit

Yes, @Christian_R. It turned out jumbo frames were enabled everywhere, but on the sender box: so enabling it on the sender box made them flow all the way end-to-end.

@Packet_vlad, sure: I will do this first thing when I get my hands on them.

Luca ( 2018-07-20 17:17:00 +0000 )edit

"Let's say receive buffer has 1000 Bytes free, it is not full. How are you going to report it IF your Window scaling Factor is 4096?"

RFC 1323, which covers Window Scaling, does not specify how to handle this, but it shows one method in a model Window Scaling implementation.

A Window Scaling multiplier of 4,096 is a shift factor of 12. That is, the system that receives the packet should take the number in the window size field, expressed in binary, and shift it left 12 places. The system sending the packet should perform the reverse operation. It should take the true window size and shift it right 12 places and store that value in the window size field before sending the packet. If the true window size is 1,000 bytes (decimal), that's 11 1110 1000 binary. The system sending the packet should shift that ...(more)

Jim Aragon ( 2018-07-21 10:46:59 +0000 )edit

Many thanks, @JimAragon. So we're basically stuck at the beginning: we have a recv box dropping 1460-byte packets, yet acknowledging a 4k recv window.

Luca ( 2018-07-21 11:11:39 +0000 )edit

Jim, great explanation, thanks! I think I'll try to perform some experiments to observe how it happens in real systems. Luca, maybe this is a good idea to check some OS counters regarding dropped or error packets at the time when issue is happening.

Packet_vlad ( 2018-07-21 17:56:06 +0000 )edit

Hello, guys. Thanks for you interest so far: this is getting pretty catchy.

So I have another registration, washed with TraceWrangler, for the connection NOT working: https://drive.google.com/open?id=1qYp... This time the recv window gets down to 2k and there it remains until the RESET: see 14012 with packet seq #2937795, when the retransmissions start.

I also have the registration with Jumbo frames on, washed with TraceWrangler: https://drive.google.com/open?id=1PIA... This time, same applications, same data volumes, the connection stays perfectly alive. The receiver is eventually able to send Zero Window, see 17475 which seems to be first occurrence.

Luca ( 2018-07-23 09:06:05 +0000 )edit

Quite interesting. For the Jumbo Frames. We see packet loss and Fast Retransmission and RTO. Maybe something is wrong with your driver. Maybe there is a driver issue with the Offloading functions. Not so impossible. Do you use the newest Version of the drivers?

Christian_R ( 2018-07-23 09:12:48 +0000 )edit

And also it is interesting what OS and version is this..

Packet_vlad ( 2018-07-23 10:13:39 +0000 )edit

@Packet_vlad: Yes that would be interesting. And what network cards and drivers are used.

Christian_R ( 2018-07-23 10:36:57 +0000 )edit

What is interesting to me is the fact once you've decreased WS factor to 2048, the host started to report exactly 1(=2048) instead of 1(=4096). Could you please do couple steps further and lower it to, say, 256 or even 128? You have largest "Bytes in flight" number of 640k, so WS factor of 16 would be enough to handle this load.

In "Jumbo-trace" the receiver reported 3(=6144) RWIN size which is less than 1MSS (=8946) so the sender was able to create packet which fits perfectly into RWIN (packets 17472 + 17473) which caused Zero Window (this is good correct behavior).

Packet_vlad ( 2018-07-23 11:25:30 +0000 )edit

Sender is a Windows desktop, while Receiver is on RedHat 6, June update, Kernel 2.6.32-754.2.1

Luca ( 2018-07-23 12:08:08 +0000 )edit

Hello, guys. So basically, the issue is systematically reproducible even with simulator applications, under the following conditions: 1) the Sender running on Windows box 2) then Receiver running on Red Hat 6 box (we are on Kernel 2.6.32-754.2.1) 3) the Windows Scaling Factor of the Receiver to 4096 4) the Receiver uses SO_RCVBUF option on socket 5) Congestion is generated from the Sender to the Receiver, in such a way the Receiver is slow but keeps reading data.

With the above, after enough time the connection gets stuck in retransmissions, while the Receiver fails in acknowledging a Zero Window.

It seems certain conditions or combination of parameters are preventing Zero Window acknowledgements in the OS TCP stack.

Luca ( 2018-07-23 12:10:17 +0000 )edit

Have you ever tried "netstat -s" at receiver side?

I guess your system 10.113.130.171 of the new JumboTrace has problems with the Oflloading features (maybe it is like @cmaynard already mentioned a problem with the TCP checksum calculation during the offloading process). Otherwise I can´t explain packet #19.

Christian_R ( 2018-07-23 12:26:10 +0000 )edit

@Luca, does it mean we're facing Kernel TCP bug, what do you think? Could you run the same test on later Kernel versions? Or the same test with artificially lowered WS factor?

@Christian_R what's that you don't like in JumboTrace packet 19? The sender sends 5 big 18k segments, every of them are split into 2 x 9k jumbo frames. One of these 9k-ers is lost on the path, the receiver signals about that via dup ACKs, the sender crafts 1 9k-er Fast retransmission, which happens to be only one had been lost. The receiver ACKs whole sequence of 5 18k segments. Am I missing something? For sure it is bad we don't use SACK.

Packet_vlad ( 2018-07-23 13:15:00 +0000 )edit

Can the TCP Window Scaling factor be directly configured? I see it is now 2048 in the second registration for the failing connection (it used to be 4096 in the first), but not sure where they changed it

Luca ( 2018-07-23 13:17:19 +0000 )edit

@Luca, as I've seen in a couple of articles , it can't be changed directly, but only by changing socket memory allocation settings.

Packet_vlad ( 2018-07-23 13:26:17 +0000 )edit

@Packet_vlad: #19 is a partial retransmission of the 17946 byte large segment #11

And t does not happen only once in the trace, it does happen often; looks to bad for me.

Christian_R ( 2018-07-23 14:15:50 +0000 )edit

@Christian_R, we don't have direct access to the machine for netstat -s, but we will ask for it.

Luca ( 2018-07-23 14:18:09 +0000 )edit

@Luca: I missed it all the time. Please enable Selective ACK at the server side.

This is really really important.

Otherwise only normal ACK or Partial ACK algorithms will be used. The client uses it already:

For Redhat please use: echo 'net.ipv4.tcp_sack = 1' >> /etc/sysctl.conf

Redhat Network Tuning

Christian_R ( 2018-07-24 07:24:31 +0000 )edit

@Christian_R, will do that. I guess you actually meant tcp_sack = 1.

Luca ( 2018-07-24 07:35:56 +0000 )edit

Of course mean 1. I have corrected it.

Christian_R ( 2018-07-24 10:46:30 +0000 )edit

@Luca Have you solved the issue?

Christian_R ( 2018-08-01 17:26:46 +0000 )edit

@Christian_R, the issue did not reproduced with Jumbo Frames and reduced TCP window, so that was enough to close the case. The root cause remains unknown.

Luca ( 2018-08-09 10:47:47 +0000 )edit

see more comments

Without a trace it is hard to say, camn you share us a trace: https://blog.packet-foo.com/2016/11/t.... And the receiver side is done with Segmentation Offloading which does not reflect the situation on the wire.

Christian_R ( 2018-07-18 11:01:20 +0000 )edit

Hello, Christian. Thanks, I uploaded the receiver capture on Google Drive: https://drive.google.com/open?id=1V1m...

The capture shows the connection established at 15:35:47 from port 64613 to 14004. The TCP conversation goes on for almost a minute, transferring amount of data, with occasions of congestion in both directions.

The issue seems to start at 15:36:20.526157000: from this time on, the receiver seems to ACK seq=3020828 for each of the 5 packet retransmissions, until the connection RESET, received at 15:36:39.431508000 (when the Windows sender gives up and closes the connection).

Luca ( 2018-07-18 12:25:26 +0000 )edit

The capture file appears to have been taken on the machine with IP 192.168.50.157. Is there a corresponding capture file available from the other side, namely at 192.168.195.45?

cmaynard ( 2018-07-18 18:43:12 +0000 )edit

Yes, cmaynard. Note the registrations were sanitized via TraceWrangler, so the IPs will appear randomized. I do have the corresponding registration from the connection peer: I will just upload and make it available tomorrow, when I can access the registration file again. Thanks!

Luca ( 2018-07-18 19:07:44 +0000 )edit

The problem is very strange. The missed 1460 byte segment w/seq # 3020828 is retransmitted with the next seq # correctly indicating 3022288; yet 192.168.195.45 continues to only ack 3020828. It would appear that 192.168.195.45 never receives it, but the capture file from the other side would confirm or deny that.

cmaynard ( 2018-07-18 19:17:31 +0000 )edit

Can you explain this TCP sequence

Comments

1 Answer

Comments

Your Answer

Question Tools

Stats

Related questions

Can you explain this TCP sequence edit

Comments

1 Answer

Comments

Your Answer

Question Tools

Stats

Related questions

Can you explain this TCP sequence