Windows 10 stops answering TCP packets
I have been hitting my head on this problem for the past few days.
I have 2 devices, (an ESP32 with the LWIP network stack via wifi, and windows 10 pro via ethernet) that are connected to a network. The ESP32 sets up a TCP server and the windows machine connects to it. This communication goes well for a long time, but at some point the windows machine stops acknowledging the incoming packets.
The capture was done on the Windows 10 machine. 192.168.4.42 belongs to the ESP and 192.168.4.245 is the windows 10 PC.
What could be the reason for this behavior? Is there something in this trace that shows the reason why the PC stops responding?
I tried to disable the windows firewall but this did not solve the problem. If the ESP does not receive a message from the PC after a certain time it will terminate the connection. You can see this happening in the trace. Increasing this time only postponed the disconnect.
A network trace can tell you _what_ is happening, not _why_. In this case the application log could be more informative.
Should windows not handle TCP acknowledgements not the application?
Sometimes a network capture may point to application issues. But it could alos mislead you sometimes.
What specific "time" value have you increased? Does the Win10 machine always appear to stall at the one hour (3600 seconds)? From this particular capture the Windows machine appeared to stop responding after 3600 seconds. Over the next few seconds the ESP32 sent a few more packets along with a few TCP Keep Alives before aborting the connection with the first RST. Interestingly 432ms after the ESP32 send the first RST the WIN10 machine finally responded ACKing all of the ESP32's data and sending 424 bytes of new data to the ESP32, but obviously by then it was too late.
The application on both sides shuts down the connection if it does not receive a packet for x time. This was the time that i increased. I have enabled the TCP_NODELAY option on both sides. One thing I tried was to run the client side application on linux. In this case it did not disconnect after 24 hours. So either the ESP does something wrong which confuses the windows driver. The way the application uses the TCP socket is incorrect and this causes this behavior on windows, or there is a bug in the windows driver which would be unlikely.
The issue is not always at 1 hour. sometimes it happens within minutes, sometimes it takes hours. It seems related to congestion on the network. A busier network decreases the time it takes to occur.