Ask Your Question
0

TCP DUP ACK/TCP Retransmission flood my network

asked 2019-10-23 08:02:51 +0000

loic gravatar image

Hi, I did a wireshark capture and discover that i have a huge tcp dup hack on my network. It occurs mainly with a wsus in hyper-V on a proliant Ml30-gen9. In this capture i was using VNC (port 5900) to communicate with wsus (ip 10.5), my ip is 10.100. I have got tons of red and black line on my vlan 10 even if i dont use VNC. This problem create problem, big spykes in my all my network, making disconnection for Internet or connection to local server dropping. 10.2 is the gateway for vlan 10

bare metal proliant ML30-gen 9 with 2 NICs: - 1st NIC 10.7 is a windows server 2016 and ILO - 2nd NIC for Hyper-v running wsus (10.5) in external mode)

Here is a capture: https://www.cloudshark.org/captures/1...

I already ask to a network engineer, with spent 5 hours reducing broadcast,storm control and stuff on switchs (cisco 2960 1Gb, running last firmware) without success.

I already have some "TTL expired in transit" with pinging my main rooter 0.1 (vlan 1).

Someone has an idea ?

edit retag flag offensive close merge delete

2 Answers

Sort by ยป oldest newest most voted
0

answered 2019-10-23 22:27:25 +0000

SYN-bit gravatar image

I totally agree with @Packet_vlad's analysis. Traffic towards 192.168.10.5 seems to be looping (L3) and forking. This causes each IP packet to 192.168.10.5 to reach 192.168.10.5 mutiple times (with descending TTL's). This causes DSACK messages from 192.168.10.5 telling 192.168.10.100 that it received the same segment multiple times.

From your descriptions and the trace file I understand the following:

  • There is one HP ML30-gen9 server with two NICs (and an ILO-NIC)
  • NIC1 has mac HP_f8:d4:d1 and IP address 192.168.10.6
  • NIC2 has mac HP_f8:d4:d0 and IP address 192.168.10.7
  • Both NICs and IP's belong to the same bare-metal Win2016 server, which also runs Hyper-V
  • There is a VM running in Hyper-V with the IP address 192.168.10.5 running WSUS

What is strange is that the ICMP TTL exceeded has source IP address 192.168.10.6, yet, it has the MAC address of 192.168.10.7 and also has a TTL of 127. Other packets from 192.168.10.6 have a TTL of 128. So the ICMP TTL exceeded packet from 192.168.10.6 was routed over 192.168.10.7 I don't have any Hyper-V experience, but this is something that needs to looked in to.

Also, since there are 2 NIC's connected to the switch in the same VLAN, those NIC's should either be teaming (with or without LACP) on both the HP as well as on the switch. Or they should be put in Active/Standby configuration in which the host (win2016) should take care that only one interface is active. This does not seem to be the case in your situation, as I see traffic coming from both MAC-addresses.

edit flag offensive delete link more

Comments

Thanks for helping.

I knew that hyper-v was a big part of the problem but it is not the only one creating this issue.

If i fix the pbl without running wsus, i will try to move it to another vlan, let say vlan99 and add a rule to my FW to allow computer to update through my wsus.

I turned off hyper-v and i disativated the 2nd NIC this morning , now there is only 10.7(WS2016 with AD,..) running on the proliant ML30.

I am still losing a lot of packet:

https://www.cloudshark.org/captures/9...

I lost the ping to 0.1 (my main router) sometimes and ping can go over 20ms very often.

As i a monitoring the ping per 0.1 and google.com with a ping plotter i will advice to filter this packet in this capture (not icmp).

The offices are closed ...(more)

loic gravatar imageloic ( 2019-10-24 06:51:06 +0000 )edit

In the new trace I do not see pings (to 192.168.0.1) being lost or over 20ms. So if you do see that in your client, then there may be a problem on your client.

SYN-bit gravatar imageSYN-bit ( 2019-10-24 12:28:03 +0000 )edit

my 2nd NIC on the HP proliant was misconfigured for hyper-v, i should uncheck everything excecpt hyper-v protocol.

so the 1st NIC 10.7 (1st NIC) was ok, while the 10.6 (2nd NIC)was a bad idea.

But i still have a pbl, we unplugged everything today and plug a laptop directly to the cisco 881, the main rooter (pool 192.168.0.1/24 -> public ip), our laptop was 192.168.0.240/24 GW 192.168.0.1 and we did get ping like 1-5 or 10 ms.

My network enginner said it wasn't a big pbl, i think it was.

How a ping might be so high when connected together on the same device only?

I notice some high request on my public ip for visio life-size (for 213.xxx.xxx.218) when wiresharking on vlan 1.

we are like:

  • 213.xxx.xxx.220 ...

(more)
loic gravatar imageloic ( 2019-10-24 14:00:22 +0000 )edit

This seems to be a different question, if so please ask a new one.

grahamb gravatar imagegrahamb ( 2019-10-24 14:15:42 +0000 )edit

my 2nd NIC on the HP proliant was misconfigured for hyper-v, i should uncheck everything excecpt hyper-v protocol.

so the 1st NIC 10.7 (1st NIC) was ok, while the 10.6 (2nd NIC)was a bad idea.

Does this mean the problem with the ACK storms is now solved?

But i still have a pbl, we unplugged everything today and plug a laptop directly to the cisco 881, the main rooter (pool 192.168.0.1/24 -> public ip), our laptop was 192.168.0.240/24 GW 192.168.0.1 and we did get ping like 1-5 or 10 ms.

My network enginner said it wasn't a big pbl, i think it was.

Ping packets to networking devices are handled by the CPU, not the data-plane. And networking devices don't give priority to responding to ping packets. So it is not uncommon that response ...(more)

SYN-bit gravatar imageSYN-bit ( 2019-10-24 20:06:32 +0000 )edit
1

answered 2019-10-23 08:30:08 +0000

updated 2019-10-23 08:36:12 +0000

From the very high packet rate and TTL not decreasing on per-packet basis I guess you have switching loop. Please review redundant links, STP settings.

UPD. Sorry, missed monotonically increasing IP ID field. This shows us packets are not looped, but new ones are being injected. Now it more looks like a bug in FW.

edit flag offensive delete link more

Comments

Hi, STP is already activated on the switches with RPVST and some older switchs have PVST only.

There is nothing strange in the switch log except big packets dropped due to this flooding, i guess. Shoudl i enable STP Loopguard too ?

I did change the server to another switch last week and i still got the problem.

I did change the FW too (installed a new fresh pfsense), 2 weeks ago with new NICs and new CPU.

I unplugged all unnecessary switch as you are closed at the moment, so i am running only the minimum.

loic gravatar imageloic ( 2019-10-23 09:09:44 +0000 )edit

Could you please share network diagram (even simple one, just to see traffic path and endpoints) and capture point location?

Packet_vlad gravatar imagePacket_vlad ( 2019-10-23 09:21:09 +0000 )edit

Here is the image:

https://imgur.com/Qc9qkq5

My FW are running auto routing rules, i removed everything, traffic shapper, squid proxy...

Last week , everything was connected to SW #2, and i did get the pbl too.

Then i moved the server Proliant ML30 to another room and connected it to another switch #3 but there is still the pbl.

Therer is a 4th switch 3750 connected to the SW2, but there is no routes defined on it and STP is on.

I unplugged all other switches last week, to make some test.

I think this problems appeared when i did change some 2960 - 100 Mb with some 2960 1Gb.

May be, there was the pbl before but it has been amplificated since i changed switchs for some fastest one.

There is a some stuff running on VLAN 11 with multicast to ring the bell of the school every hours ...(more)

loic gravatar imageloic ( 2019-10-23 14:07:27 +0000 )edit

Thanks for the detailed information, will take a look soon. ..By FW I meant firmware, not firewall, that could have been misleading.

Packet_vlad gravatar imagePacket_vlad ( 2019-10-23 18:10:20 +0000 )edit

Do you have a possibility to arrange packet capture on the Hyper-V side?

For me it looks the next:

packets from 10.100 are getting routing(?)-looped on the path hitting 10.5 server (but we don't see these looped packets because of capture point location, they do not come back to 10.100).

Server, being bombarded with looped packets, responds like it should do - issuing DupAcks which we're observing. From time to time when TTL of looped 10.100's packets reaches 1 -> NIC with IP 10.6 comes into play and issues ICMP TTL exceeded.

So, we have to find routing loop source. There IS a loop whose consequence we're observing, even if we don't see the loop itself.

So we're hunting for: tons of packets from 10.100 with decreasing TTL coming to Hyper-V NIC.

Packet_vlad gravatar imagePacket_vlad ( 2019-10-23 18:50:40 +0000 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2019-10-23 08:02:51 +0000

Seen: 5,617 times

Last updated: Nov 07 '19