PROBLEM DEFINITION:

Conversations between Server1 and Server2 are fraught with "TCP Out-of-Order" messages. Server1 is in VLAN X while Server 2 is in VLAN Y. I run Wireshark from a laptop connected to a switchport upon which SPAN is enabled.

PRLBLEM ENVIRNOMENT:

  1. All three devices (Server1, Server2, and WireShark laptop) are connected to a Cisco 6509 switch.
  2. Communications between Server1 and Server2 traverse a firewall services module (FWSM) on the Cisco 6509.
  3. Server1 is an application server while Server2 is a SQL server.

OBSERVATIONS:

  1. When I run Wireshark from Server1 (instead of the laptop), I do not receive the Out-of-Order messages.
  2. When I move Server 1 into VLAN Y (so that Server1 and Server2 are in the same VLAN/subnet), and then run Wireshark from the laptop, I do not receive the Out-of-Order messages.

I have upgraded the FWSM on the Cisco 6509 to no avail. Does anyone have any thoughts/suggestions/feedback? I have been troubleshooting this issue for over a week and have ordered the Wireshark Network Analysis book from Amazon to further my research of this issue. Book has not arrived yet and so I'm sending this plea for help!

asked 25 Oct '10, 14:13

JamesClassV's gravatar image

JamesClassV
1222
accept rate: 0%


Until the book gets there...

(FYI - I always check to see if the packets are truly out-of-order as some Cisco boxes duplicate the packets when spanning is enabled - since you don't see the condition when you move Server 1 to VLAN Y, then we have to assume this is not the issue and they truly are out-of-order. Always good to check though. Honestly, I'd rather see you have a full-duplex tap in front of each of the servers rather than rely on the span information.)

Your observations seem to indicate that the condition occurs when crossing from one VLAN to another VLAN. Got any load balancing set up along the path(s)? Any packet expediting?

There's an interesting write-up about troubleshooting FWSM performance at http://isamology.blogspot.com/2010/02/troubleshooting-fwsm-performance.html. Out-of-order packet issues are addressed in that article. My finger would be pointing at the FWSM as it does all sorts of freaky things to TCP traffic (see the Wireshark Tip #47 about 4 NOPS at http://www.wiresharktraining.com/tips-41-60.html.)

link

answered 25 Oct '10, 20:29

lchappell's gravatar image

lchappell ♦
1.1k2728
accept rate: 8%

I'll second the vote on FWSM doing screwy things with packets. Is the 6500 also routing between the VLANS, or are you doing inter-VLAN routing on another device? Are you SPANning on the trusted or untrusted side of the FWSM? Does Server1 have a super-duper NIC that handles buffering and pre-processing of packets? I agree that a TAP will let you see exactly what's going down the wire. SPANs are red-headed step children from a switch processing priority point of view.

link

answered 27 Oct '10, 06:41

GeonJay's gravatar image

GeonJay
4554820
accept rate: 6%

Hey! What's wrong with red-heads? <g> Just kidding. You know no matter how much I scream to use a tap rather than span it doesn't seem to get across... sigh.

(29 Oct '10, 22:55) lchappell ♦

You didn't tell us how you're actually spanning on the switch. Hopefully, you didn't span the entire VLAN and chose to span the incoming or exiting port (not the bump in the wire vlan for the FW module).

The way to tell is by looking at the IP ID field. See if you see duplicate IDs. Are they the same? If so it's nothing more than duplicate packets causing Wireshark grief. You can use "editcap -d origfile newfile" to get rid of duplicates. This assumes that IP, MAC, or VLAN headers are the same.

When you have a "MASSIVE...." amount of OO packets, it's rarely as bad as it seems. Check for the duplicate IP ID.

link

answered 01 Nov '10, 14:33

hansangb's gravatar image

hansangb
7661619
accept rate: 10%

Hello hansangb

I am having a very similar problem to what is being discussed in this thread, in that some of the traffic off of the client SAN is loaded with OOO frames, as well as Dup Ack's, a few retransmissions here and there.
I dug into a trace a minute ago after reading your response, and what i see repeatedly is an ID field of all zero's on the OOO frames. What does this mean? thx...

Oh and by the way, the issue is that devices on the SAN are intermittently disconnecting. The SAN configuration has ESX hardware that connects to a SAN to get to its VMDK's and its storage.

(01 Feb '11, 11:02) kmnruser

Hello Laura,

Thank you very much for your response. I will thoroughly investigate the articles you referenced and aggressively pursue them. I have already implemented the sysoption np completion-unit command has early in my troubleshooting effort.

In the meantime, I have one follow-up question: how do you explain that the condition does not manifest when I capture packets from Server 1 instead of the laptop connected to a span port? In other words, with everything being the same (each server is in its own VLAN), I simply change the packet collection point from a third party (laptop) to the Server 1 itself. When I do this, I no longer receive the Out-of-Order condition.

I have lost sleep over this issue and, along the way, I have been inspired to pursue the Wireshark Network Analyst certification as a result. Thank you very much Laura.

link

answered 26 Oct '10, 09:00

JamesClassV's gravatar image

JamesClassV
1222
accept rate: 0%

Here's what I see in my head (correct me if I am wrong).

  • Scenario 1: Capture on Server 1 in VLAN X - no OoOs (Out-of-Orders)
  • Scenario 2: Capture on laptop (spanned port) Server 1 in VLAN X - OoOs
  • Scenario 3: Capture on laptop (spanned port) Server 1 in VLAN Y - no OoOs

My experience with port spanning has yielded such crappy results in the past because of poor implementations by the switch vendors, that I just don't trust it. Can you get a tap in there? It would be good to know what's REALLY crossing the network rather than relying on the switch to forward traffic down a switch port.

What is the main issue though? Are you seeing performance issues? Is the communication slower/faster when you move Server 1 to VLAN Y?

I can look at a trace if you'd like. Take a deep breath... get some sleep and hang in there!

link

answered 26 Oct '10, 09:12

lchappell's gravatar image

lchappell ♦
1.1k2728
accept rate: 8%

Hello Laura,

You are correct on the two scenarios: Scenario 1: Capture on Server 1 in VLAN X - no OoOs (Out-of-Orders) Scenario 2: Capture on laptop (spanned port) Server 1 in VLAN X - OoOs

Scenario 3 is something I have never pursued. We are trying to resolve a performance issue with the application -- with 30 concurrent users, we experience degradation.

When I moved Server 1 to VLAN Y, we didn't have enough time to conduct a performance test, just enough time to capture Wireshark packets (it's a production environment).

How can I send you my trace file? Would I be able to send it to you by e-mail directly?

Thanks Laura.

link

answered 26 Oct '10, 11:21

JamesClassV's gravatar image

JamesClassV
1222
accept rate: 0%

You can send the trace over to [email protected] and it will get forwarded to me today.

(26 Oct '10, 11:50) lchappell ♦

Thank you Laura and GeonJay:

I spent the last few days getting all the equipment I needed to implement a tap. Instead of "Out-of-Order" messages, I now have "TCP Retransmission" errors instead. This outcome is explained in Chapter 13 Using Wireshark's Expert System (page 277) -- Wireshark sometimes flags a packet as Out of Order when it is a retransmission due to its inablity to relate retransmissions to earlier packets.

To answer GeonJay's question: Server1 and Server2 are connected to a Cisco 4506 which uplinks to a Cisco 6509, which facilitates inter-VLAN routing and provides FWSM functionality. I tapped into the uplink using a Netscout Multimode Fiberoptic splitter, which required me to install a Fiber NIC card (hard to find!) into my Wireshark computer.

Laura, I e-mailed you the trace. Unfortunately, I had to filter it due to its size. In the event it doesn't contain the information you seek, please let me know and I will send another trace. I look forward to solving this mystery!

link

answered 29 Oct '10, 14:28

JamesClassV's gravatar image

JamesClassV
1222
accept rate: 0%

Ahh... interesting. Ok! I'll go grab the trace file and look at it tomorrow - I will let you know what I can discern/possible next steps soon (most likely over the weekend as I am curious).

(29 Oct '10, 22:53) lchappell ♦

WireShark can also misidentify retransmissions as well as OOSs, so be careful.

So you are using the 6509 as a router on a stick. Both servers connect to the same 4506, which uplinks to the 6509. Is this correct? What's the volume of traffic going across the uplink? Have you checked CPU/mem utilization on both switches? Be careful how you monitor utilization. I would gather a high resolution utilization baseline for both switches (grab an hour of info or so) then run your transaction and see what kind of impact it has.

I'd also verify that MTUs are set consistently across the tiers.

(02 Nov '10, 06:35) GeonJay

Hi Laura,

I just wanted to drop a note to follow-up to see if you've had a chance to look at the capture file.

Hello GeonJay, CPU utilization is low, below 10% consistently. Server MTUs are set to default (1500 bytes). I'll verify that the switches are set the same.

--James

link

answered 10 Nov '10, 15:12

JamesClassV's gravatar image

JamesClassV
1222
accept rate: 0%

James, it's very important to distinguish real pkt loss from fake ones (e.g. caused by overloading the span port, or from winpcap not being able to keep up). In most cases, analysis of the trace file will provide the hints necessary to determine if it's a fake loss or not. This is so common that one of my case studies in Sharkfest had to do with this exact scenario. Feel free to email me the trace if you want and I'll take a look as well. Please use "editcap -s 128 origfile newfile" to truncate the files to just 128 bytes. My email is: hbae at nyc.rr.com

(11 Nov '10, 08:15) hansangb
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×356
×20
×6
×6

Asked: 25 Oct '10, 14:13

Seen: 32,091 times

Last updated: 01 Feb '11, 11:02

powered by OSQA