This is a static archive of our old Q&A Site. Please post any new questions and answers at ask.wireshark.org.

iSCSI Communication appears incorrect handshaking

0

I have a customer that's having performance problems. Looking at their trace it seems clear why. See first 50 packets at link below. This pattern basically repeats itself. http://cloudshark.org/captures/68a28bf0fd38

client IP = 172.18.1.14 Storage IP = 172.18.1.15 All on same switch and same VLAN. First of all the client sends a read LUN request in frame 8. This looks fine. But I'm not sure why he keeps sending ACKS. Then in frame 17 the storage sends back the data requested in the read. However, there's a few things to note here. 1) This ia an ACK to frame 8. 2) The PUSH bit is set. 3) The frame is not a full jumbo frame of 9014 bytes.

Normally what I see when iSCSI is working efficiently

Client sends a read LUN request. Next frame from storage is an ACK only from the storage (60 byte packet) no data Next frame from the storage is read data requested with full 9000 bytes. ACK bit is set but no push

The storage is capable of sending a larger frame. The largest it sends is 7583 bytes. But never a full 9000 bytes. And PUSH bit is set here as well.

I have some suspicions about the NIC on the client not acting correctly based on some cases I've had in the past where the delayed ACK wasn't working. But I would like more proof of what this network pattern is telling me.

asked 29 Jan '13, 12:13

gipper's gravatar image

gipper
30121216
accept rate: 0%


One Answer:

0

Sorry to rain on your parade, but I think your interpretation is a bit off. Lets see if I can bring some light into the matter here.

So, in frame 8 there is a read request, and then you do NOT see any data packets coming in from the server BUT the client increases it's acknowledge number in huge steps. For example you see that in frame 9 the relative ack number is 155697, while the relative ack number in frame 8 was 131121. That means that your client received 24576 bytes (155697-131121) between frame 8 and 9. Your trace does not contain the packets with those bytes, but when the client acknowledges them they MUST have been on the wire. In case you're wondering what "TCP ACKed unseen segment" means (starting in frame 19): it means exactly what I just explained - there were packets that are acknowledged but they're NOT in the trace.

Next question that will come up is "yeah, but why does this message only show up in frame 19, when there were already unseen packets before that?". Good question. Answer is: because your trace starts right in the middle, not at the SYN, and before frame 17 there NEVER was a data frame coming back, so Wireshark obviously only states "unseen segment" when it knows that it has at least seen one of them at all before. By the way, I think frame 17 only made it into the capture because it is not the full 9000 bytes. And it makes sense that PSH is set because it is the last remaining bytes of a larger chunk, most of which was sliced into 9000 byte segments. So it is the signal saying "this is it, process, please".

Final verdict: unfortunately, your trace is useless and cannot be used to diagnose any trouble of the iSCSI connection. Your capturing device was not fast enough to record the jumbo data frames to disk as they appeared, and only was able to write the small ACKs and the occasional non-full jumbo frame to disk. You cannot diagnose iSCSI - or any high bandwith shared storage protocol - with capture hardware that doesn't at least write 120MB/s to disk. Because that's what a full Gigabit link will slam your capturing NIC with if it is doing full throttle in one direction. It gets worse if you capture a Gigabit link with full throttle in both directions, because then you need to write about 240MB/s. So unless you captured with a fast RAID disk configuration or an SSD setup you're not fast enough. A laptop with a single non-SSD disk will never be fast enough for this if the storage system gets going at full speed.

Sorry for the bad news, but it's better to know when the captured data is no good than spending hours and hours trying to make sense of what isn't there.

answered 29 Jan '13, 16:45

Jasper's gravatar image

Jasper ♦♦
23.8k551284
accept rate: 18%

edited 29 Jan '13, 16:48

On a side note - because that often happens with setups where a VMware virtualization host (in your case an ESX) connects to a iSCSI device: make sure that the customer understands the ESX virtual network loadbalancing strategies. It usually doesn't help to have multiple Gigabit cards aggregated to gain Speed, because if the communication is Single-ESX-VMKernel-IP to Single-Storage-IP you'll only be using ONE link all the time, while the others are doing nothing.

This might help (in case you need it): http://kensvirtualreality.wordpress.com/2009/04/05/the-great-vswitch-debate%E2%80%93part-3/

(29 Jan '13, 16:55) Jasper ♦♦

with capture hardware that doesn't at least write 120MB/s to disk.

wouldn't it be sufficient to capture only the first 120-200 bytes of a frame to analyze performance problems (at least as a first shot), so a much lower disk I/O bandwidth would be needed?

(30 Jan '13, 12:53) Kurt Knochner ♦

yes it would, good point. I just mentioned the write speed because the sample capture was taken with full frame size.

(30 Jan '13, 12:56) Jasper ♦♦