Receiver sends window update instead of DUP ACK
I have an issue whereby sender sends a bursty chunk of data across a WAN link of 50Mbps to receiver and some of the packets were lost in transit.
However the receiver, instead of sending DUP ACKs to sender for those packets which it did not receive (receiving instead those out of order packets with higher SEQ number), repeatedly sends many window_update packets, each time updating the receive window by 1 or 2 (window scaling is 12 ie. 4096 bytes).
This is despite the fact that the bytes in flight is much lower than the existing receive window (i.e. around 40kbytes in flight but 300*4096=1.2Mbytes of receive window.
The end result is that instead of getting multiple DUP ACKs from that receiver, therefore triggerring the sender to a fast retransmit, the sender in the end timed out based on RTO 200ms later to send the lost packet. This was ACK by receiver but the issue does not end here. The sender waited in binary backoff time ie. 400ms to send the next lost packet, and 800ms the third lost packet and so on. Throughput was slowed down tremendously.
What can I do to remediate this situation?
Questions:
- is the receiver behaving correctly - I expected it to send DUP ACKs instead of window_update.
- can the sender reset the RTO timer after getting the first ACK from the retransmission - I read that there are TCP New Reno partial acknowledgement that can speed up the recovery.
Does the connection have SACK enabled?
Edit: Added the missing and vital S to ACK.
I think @grahamb means SACK :-)
Are you able to post a capture file on a public files sharing service like Dropbox, Onedrive, etc? If so, please do and share the link here (and please make sure there is no sensitive data in it, you can slice the packets after the TCP header for this analysis)
SACK is not enabled. https://www.dropbox.com/sh/akzfjk2nb7... I have removed data field leaving the headers behind.
Thx for the traces. Do you know the value of the Scaling factor the client and the receiver are advertising? Or even better could you provide us a trace which includes the 3 way handshake?
I am still trying to get the 3-way TCP handshake which happens hours before the segment of the capture that points to the above issue (ie. application complaints of data loss in network after waiting for a long time (due to numerous RTO with binary backoff). However, from previous traces, the window scaling value of the receive was 12 indicating 4096 bytes window scaling factor (ie. 2^12). I don't think it is changed every session but let me confirm again once I get the 3-way handshake for the above tcpdump in a day or two.