Ask Your Question
0

TCP Dup ACK after reconnection - sequence number problem?

asked 2021-04-29 06:08:32 +0000

soapy gravatar image

updated 2021-05-03 07:43:37 +0000

grahamb gravatar image

Hi there,

currently I'm debugging the network traffic between two devices in the same network. The network architecture is quite simple.

Device 1 <-> Switch 1 <-> Switch 2 <-> Device 2

To verify the software on the devices, I check different scenarios. One of them is the correct reconnection after I unplugged the network cable between Switch 1 and Switch 2 and plugged it in again after few seconds.

I uploaded a wireshark capture to my onedrive: Wireshark capture

Packets 1-9 are correct communication.

Between Packet 9 and Packet 10 the cable is unplugged.

In Packet 10 Device 1 tries to send data to Device 2 without receiving ACK.

In Packets 11/12 Device 2 sends Keep-Alive messages.

Between Packet 12 and 13 the cable is plugged in again.

Packet 13 ACKs the last Keep-Alive message, which seems fine.

From now on, it gets weird. I only see TCP Dup ACK messages.

I assume that the TCP stacks get confused on the difference in the sequence numbers. While Device 1 thinks it's own sequence number is 49, Device 2 thinks it is 37. Device 1 does not support Fast Retransmission.

Can someone explain what is happening here? I'm struggeling to understand where the problem is. Is the problem in Device 1 where the TCP stack thinks it is on sequence number 49 while the package is not yet acknowledged or is it in Device 2?

I really appreciate your help.

Kindly, Philipp

Post answer captures:

ServerCapture_reconnection_after_unplugging_cleaned.pcapng ClientCapture_reconnection_after_unplugging_cleaned.pcapng

edit retag flag offensive close merge delete

Comments

My understanding is the cable between switches was unplugged not the end device.

BigFatCat gravatar imageBigFatCat ( 2021-04-29 07:11:13 +0000 )edit

That is correct. For both devices the link is still up. The application has a timer. But the time the cable was unplugged is shorter that the timer. Thus it will not see the link disconnected. For my understanding, TCP offers the possibility to unplug and plug a cable and still keep the connection. But maybe this is just theoretical and practically one should use a way shorter time than I do. What would more experienced network architects recommend as a timer?

soapy gravatar imagesoapy ( 2021-04-29 07:26:04 +0000 )edit

2 Answers

Sort by ยป oldest newest most voted
0

answered 2021-04-29 10:04:45 +0000

SYN-bit gravatar image

From the captured packets, I think there is a bug in de TCP/IP stack of device-2. I assume the span port was mirroring traffic on the inter-switch like because there is no traffic for 6 seconds. I also assume there were already some more TCP KeepAlive message sent by device-2 and when the new data from device-1 comes in, it does not ACK the data as it might be strickly waiting for a KeepAlive-ACK.

I also think there is a bug in the TCP/IP stack of device-1, as it does not retransmits the segment starting at seq 37. But that might just be that the RTO falls outside of the capture interval and that after a while it does retransmit this data and things get back to normal. Please capture over a longer period next time (either until the connection restores itself or until it is reset by either one of the devices).

edit flag offensive delete link more

Comments

I'll try to investigate if there have been other Keep-Alive messages. Would it be a valid behaviour if the TCP/IP stack simply "forgets" about the KeepAlive-ACKs if there is valid data on the line? In my understanding that would be a correct behaviour, but I didn't find any resources on that.

Yes, there might be a bug on device 1 side. Normally there should be retransmissions. Not even fast retransmit. Simple retransmission should be seen on the line, as the query has never been ACKed. RTO is set to 1000ms, so there should definitely be retransmissions.

The problem is, that it seems not to restore the connection at all. I will try to wait for a RST in the next capture.

soapy gravatar imagesoapy ( 2021-04-29 12:20:56 +0000 )edit

It appears that my capture was too short in time.

And apparently there were no major bugs in the TCP stacks.

Server side capture || Client side capture

I took captures on both sides - client and server side. It appears that some retransmissions get lost due to the disconnected link. This is what @grahamb noted, that - inspecting ip.id - it shows that there are packages lost. Those packages are not shown on the server side capture but very well on the client side capture (tcp.id = 0x0622 - 0x0624)

The Round Trip Time in my client device is set to 3 seconds. I would have assumed that a retransmission would occur after 3 seconds. Apparently the first retransmission occurs after 6 seconds followed by a whole lot of TCP Dup ACK segments. After another 12 seconds the second retransmission occurs which ends the TCP Dup ACKs and the connection is correct again ...(more)

soapy gravatar imagesoapy ( 2021-05-03 07:12:07 +0000 )edit

@soapy, I uploaded and added the captures to the question.
That's a very long TCP timeout, not great for LAN performance.

grahamb gravatar imagegrahamb ( 2021-05-03 07:45:15 +0000 )edit

The client side capture is a bit confusing to read. Wireshark displays TCP ACKed unseen segment and TCP spurious Retransmission, which is apparently due to a slow capturing device (which applies to the client side).

It seems the client side capture is having a different timing for the receiving and sending path. From a quick look at the trace, it seems deducting 50 microseconds from the packets from 192.168.1.19 does the trick to have a correct view on the order of packets:

tcpdump -r client-16200278029053096.pcapng -w 19 src host 192.168.1.19
tcpdump -r client-16200278029053096.pcapng -w 17 src host 192.168.1.17
editcap -t -0.000050 19 19b
mergecap -w client-corrected.pcapng 17 19b

(for my interest, how was the capture on the client-side made?)

SYN-bit gravatar imageSYN-bit ( 2021-05-03 12:40:42 +0000 )edit

A RTO (retransmission timeout) of 6 seconds is really really high. 3 seconds used to be a generally used value like 10+ years ago. Most systems now use a dynamic RTO, based on the RTT being experienced. Systems with a fixed RTO are usually using 100-300ms. Loweing the RTO in these devices would reduce the time to converge.

So basically, the first retransmission after 6 seconds is missed and the second one after 12 seconds fixed the session. In the mean time, there is a duplicate ACK storm that should not happen IMHO. Only data segments should trigger ACK packets. So frame 2360 should trigger the ACK in frame 2361, but that ACK should not trigger a DUP ACK, as there was no data received.

SYN-bit gravatar imageSYN-bit ( 2021-05-03 13:27:22 +0000 )edit
0

answered 2021-04-29 08:02:19 +0000

grahamb gravatar image

The capture was made using an older version of Wireshark (3.2.7) on a Windows 10 (1803) system. I'm guessing that the capture was made on the endpoint (1.19) simulating the Modbus device due to the very small (or 0) time delays between the Modbus query and the response. When running experiments like this it helps to capture at both ends, or even better off-machine e.g. with a tap or a span or mirror port from the switch(es).

I'm assuming that as stated in the 6 second gan between frames 9 & 10 the link between the 2 switches was disconnected and then reconnected, I'm wondering if there has been some other traffic that hasn't been captured, e.g. ICMP?

The other thing that seems a little odd is that the Modbus client (1.17) is querying the Modbus device very quickly with no delay between the previous response and the next request and then suddenly jumps to a 6 second delay as the link is disconnected?

The basic issue is that the Modbus query in Frame 10 hasn't been acknowledged (relative seq. #49) by the Modbus device and the Modbus client should timeout and retransmit it, especially when notified by the duplicate ACK messages from the Modbus device that only acknowledge relative seq. # 37.

Questions for @soapy

  1. How was the capture made?
  2. What capture filter, if any was used?
  3. Why the 6 second delay in the query in frame 10?
  4. Can the capture be done again, this time at both ends if possible and with an up to date version of Wireshark so any capture filters are recorded?
  5. Is the Modbus device a real PLC or a simulated device?
edit flag offensive delete link more

Comments

Thanks for sharing your thoughts. Let me better explain the capture settings.

  1. Switch 2 is a port mirror switch which mirrors the traffic to the capturing device. Unfortunately Switch 1 is not a port mirror switch and Device 1 does not support Wireshark. If I can't come up with a solution I might have to use a port mirror switch for switch 1 as well. See ASCII diagram below.
  2. The capture filter was simply filtering the addresses (ip.src == 192.168.1.17 || ip.dst == 192.168.1.17). No other traffic like ICMP was seen, only what is displayed here.
  3. That is a good question. If I rethink it, it might be due to the unplugged connection. After plugging it again, the query which is in the send buffer of the Modbus client is transmitted to the Modbus server. So there is no actual delay in the query ...
(more)
soapy gravatar imagesoapy ( 2021-04-29 08:24:55 +0000 )edit

So device 1 is the client and device 2 is the server?

Inspecting ip.id you can see that frame 9 has id 0xda7 and then frame 10 has id 0xda9, so it would seem that there is at least one query from the client that hasn't been captured, likely when the link was disconnected.

grahamb gravatar imagegrahamb ( 2021-04-29 08:46:52 +0000 )edit

The SPAN configuration on switch 2, did it mirror the port to switch 1 (I assume from the packets) instead of the port to device-2? Could you mirror the port to device-2 in the next capture session?

Also, as @grahamb pointed out, it would be perfct if you capture on both sides at the same time. If that is not possible, is it possible to swap the devices for a second test to get an idea on what is happening on the side of device-1 too?

SYN-bit gravatar imageSYN-bit ( 2021-04-29 09:54:38 +0000 )edit

@grahamb Exactly, device 1 is the client, device 2 is the server. I'll try to get another capture to see if there are packets filtered out somehow with captures on both sides to compare. @SYN-bit actually it mirrors all ports, both to device 2 and to switch 1.

soapy gravatar imagesoapy ( 2021-04-29 12:10:46 +0000 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2021-04-29 06:08:32 +0000

Seen: 1,257 times

Last updated: May 03 '21