There's a very consistent regularity to the way the packets flow from the client to the server. A particular pattern is repeated again and again - at roughly 7 second intervals - with about 500,000 KB transferred per interval. I'll define the large burst of packets as the start of the pattern.
Here are my key observations (with some supporting TCP Trace charts below):
1) The client sends a large burst of 250 KB, but large portions are lost after the first 100 KB. In the first TCP-Trace chart below, we see the 100 KB successful burst, the yellow area with no packets, a few subsequent packets that made it through, then one RTT later the second 100 KB burst.
2) The server's receive window is close to 1 MB, but the sender appears to use its own RWIN of 261,288 bytes as its own transmit "limit". The sender manages to maintain close to this "in flight" value throughout the whole period.
3) One RTT later, the sender receives ACKs for the 100 KB that wasn't lost and manages to transmit a further 100 KB without any errors. The large number of original lost packets trigger many Dup-ACKs and in response, the sender retransmits a single packet to begin to fill the gap. Following the horizontal "Ack line" on the chart we see the single retransmitted packet and the step up of the Ack line.
4) One RTT after that, there's another single packet retransmission.
5) The large initial gap in the received data is then filled in at the rate of just one packet per RTT. Also, after several RTTs (perhaps as the sending congestion window is opened), the sender begins to send small bursts of new data so that the in-flight value of 250 KB is maintained.
6) On the second chart below, we've zoomed-out to encompass a full pattern and the start of the next one. The dark blue circle is around the initial two large bursts, the red circle is around all the single packet retransmissions and the light blue circle is around all the small bursts of new packets. It looks like the sender eventually waits for every sixth round trip so that it can send a full 6-packet application "block" (there's a Push flag at the end of these 6-packet bursts).
7) Eventually, the original large gap has been completely filled-in and we see the Ack line jump all the way up as the original two large bursts and all the smaller "new" bursts are fully acknowledged. At this point, in-flight data is zero and the sender is now free to begin the whole pattern all over again.
So, what are the things that need to investigated further?
A) The bulk packet loss, always after a large burst of 100 KB, points to a device in the path that only has a 100 KB buffer space. The most likely candidate will be the router where the path ... (more)
Could you please enable SACK on both endpoints and do the capture again? An absence of SACK option makes loss recovery very inefficient.
What is the "sender" capture location? It has a bit strange IP TTL of 60. Is it several hops away from the endpoint or just non-usual TTL?
The sender is an AIX server which is why the TTL is unusual and starts at 60. The receiver is a Linux server with SACK already enabled. I will check the setting on the AIX sender.
Based on what you're seeing so far, what do you think is the most likely cause?
I need to take a closer look on it but actually it looks like micro-bursting with 1Gbit interface speed hitting a buffer or policer so strongly so it is causing bulk packet loss. At the same time recovery process is extremely slow because of SACK absence.
The 3-way handshake in the sender capture tells us that the AIX sender doesn't support SACK. It's possible that SACKs may not help in this case - but as @Packet_vlad suggests, SACK is more efficient in general and so you should enable it if you can.
The MSS=1380 in the server's SYN-ACK is a strong clue that there's a Cisco ASA firewall in the path.
The minimum RTT of 68.1 ms means that the client and server are relatively far apart.
Thanks for this very interesting capture. There are a couple of elements to the problem.