This is a static archive of our old Q&A Site. Please post any new questions and answers at ask.wireshark.org.

iSCSI Format Failure from Windows Server 2012

0

I don't expect anyone to be able to help me with this, but I didn't expect any help last time I posted a message yet was pleasantly surpised to get a very spot-on answer.

This one is a bit tougher though.

I've been working on an iSCSI server, mostly just for fun. Been chipping away at it in my spare time. I've managed to get through a lot of things. The Device is seen by Windows Server 2012. I can connect to it. I can initialize a disk. I can create partitions on the disk as far as I can tell (windows's own cache could be lying to itself about some of those things). But the actual formatting of the volume is where I run into trouble.

I've implemented all forms of ModeSense, ReadCapacity, Read, Write, Data-In, and Data-out. Also implemented a task-management class which seems to be functioning as intended.... but by golly, I just have no idea why this is still failing.

Some clues and relevant information:
1) For starters, here's the capture

2) Windows Server 2012 Successfully completes the "Gather information" step and the "Create new partition" step, but during the "Format Volume" Step reports "Failed to format volume - One or more parameter values passed to the method were invalid."

3) The failure occurs immediately AFTER Packet Number 6384 which I verified with breakpoints.
4) Packet 6384 is the longest Read operation during the process (128 blocks -- 65536 bytes).
5) This Read operation follows a write operation to the same LBA of the same length (128 blocks -- 65536 bytes)
6) The write operation is also the longest write operation of the entire process up to the point of failure. It is the first write operation to be >= 128 blocks.
7) There are other multi-block read and write operations that do not return errors, which to the best of my knowledge point the finger away from the Task-management system. I'll try to briefly and succinctly describe how the task management system works
7a).The long writes will send up to 8192 bytes in the first request, but the "Final" flag in the PDU will NOT be set.
7b). The Absense of the "Final" flag causes a task to be registered in a list, indexed by the "InitiatorTaskTag"
7c). "DataOut" PDUs follow the Write request and reference the InitiatorTAskTag which keeps information about the origin of the write request. The data-out packets are written using an offset of the original Task.
7d). If the DataOut packet has the "Final" flag set, the task is cleared at the end of its operation.

All of that internal stuff kinda seems to be working, but the format still fails. I've been looking at this problem for quite some time and I'm out of ideas and will need to shelve it for a while. Unfortunately, I think that any one of these 6000+ packets could potentially be causing the problem. I wish windows would just tell me WHICH parameter is incorrect.

asked 22 Jul '13, 15:36

ev1lr0b0t's gravatar image

ev1lr0b0t
1224
accept rate: 50%

edited 22 Jul '13, 15:44

The failure occurs immediately AFTER Packet Number 40943

The posted capture file ends at frame 6442

but I didn't expect any help last time I posted a message

BTW: What was your user name last time?

(22 Jul '13, 15:43) Kurt Knochner ♦

I fixed the errant packet numbers in an edit. The exported packet numbers were different than what was displayed on my screen. I was unable to login with the account I used before for some reason. evilrobot.

(22 Jul '13, 15:46) ev1lr0b0t

BTW thanks for looking. 6384 is the last packet before the error (a reply to a Read(10) request that is 128 blocks. --Jason

(22 Jul '13, 15:50) ev1lr0b0t

6 Answers:

0

The wireshark capture might not show it. But after some prodding and poking and creating volumes of varying sizes, I determined that my WRITE operations were not writing the proper amount of data. I discovered this by comparing the subsequent READ operation that followed the WRITE operation. The fact is that the write operation would actually only write the first block to disk.

answered 24 Jul '13, 11:29

ev1lr0b0t's gravatar image

ev1lr0b0t
1224
accept rate: 50%

would you mind to post one capture file with a failure and one with a successful operation?

I wonder if the problem would be visible in the capture file. This would be a good chance to learn something more about iSCSI (for me) ;-)

(25 Jul '13, 11:28) Kurt Knochner ♦

I got sorta past that point, but unfortunately I don't yet have 100% success. My current issue is as big of a pain in the butt as my previous issue. Part of my problem is that I cannot find any sample captures that show Read() operations that exceed the MaxRecvDataLength value negotiated during login. It kinda seems like that for some reason my login negotiation is being ignored by the client, even though it matches the sample I used practically byte for byte. When talking to the Target that I reverse engineered, the client never requests more than 128 blocks of data per read (65536 bytes) which also matches the MacRecvDataLength negotiated during login. However, when talking to MY target, it will request up to 256 blocks, and my response to this request causes the client to reset the connection with no warnings or messages logged. I respond with Data-In followed by Data-In with the Final bit set. (2 PDUs). Maybe I need to try 3 PDUs sending a separate PDU for the status.... I dunno... running out of ideas.

(25 Jul '13, 11:34) ev1lr0b0t

0

In frame 6385 the 'client' (172.16.0.82) sends a SCSI Mode Sense. This packet is ACKed in the next frame. However, then the 'server' (172.16.0.33) does not react for 119 seconds!! Then it ACKs a packet that is not in the capture file (see frame 6387). This is totally different to previous behavior in the capture file. If we can assume that the capture file is complete (can we?), there seems to be a bug in the iSCSI server implementation. However, I don't think you will find the reason for the problem just by looking at the capture file, as neither the inactivity for 119 seconds can be explained with the observed behavior in the capture file nor the ACK for an unseen packet.

Regards
Kurt

answered 22 Jul '13, 15:57

Kurt%20Knochner's gravatar image

Kurt Knochner ♦
24.8k1039237
accept rate: 15%

edited 22 Jul '13, 16:04

The 119 second pause is the breakpoint. I didn't notice the ACK issue, I'm not sure I ever handle them... I think they're handled at the TCP layer if I'm not mistaken, and therefore I have no control over them. Whether I break on it for 119 seconds or not.. it fails in the same place in the same way. So I'm not sure that'll help me. Maybe I can find some insight with your help, however. Thanks for looking. -- Jason

(22 Jul '13, 16:30) ev1lr0b0t

The 119 second pause is the breakpoint.

Ah, that's O.K. then.

I didn't notice the ACK issue, I'm not sure I ever handle them... I think they're handled at the TCP layer if I'm not mistaken,

Yes, and the TCP layer should not ACK something it did not receive, if that was the case !?! Anyway, I'm not sure if this could be caused by a bug in your code as well.

Whether I break on it for 119 seconds or not.. it fails in the same place in the same way.

Can you please post a capture file without a breakpoint and 'add' some information about the frame where you think the problem shows up?

(22 Jul '13, 16:38) Kurt Knochner ♦

I will, but probably won't be able to get to it until tomorrow. Thanks again.

(22 Jul '13, 17:35) ev1lr0b0t

119 seconds delay would cause the initiator's upper layer timer to expire and it should perform some type of SCSI layer error recover if a frame was not received.

However, if there there was some type of protocol error in the frame that was not captured then it would be normal of the initiator to close the TCP connection provided that the 2 devices were operating at error recovery level 0.

NOT the iSCSI standard requires devices that only support error level 0 to close the connection when a protocol error is encountered

(14 Nov '16, 03:10) LouD

0

We've run into the same issue when assigning SAS disks (LUN's) to a server. Turned out that when automount is disabled (which we had because we don't want to see all SAS disks as drive letters since they're mounted in directories) this happened. So enabling automount through diskpart fixed the issue immediately. After assigning the disks to the directories we disabled automount again and all was well.

answered 18 May '15, 03:44

binbash's gravatar image

binbash
61
accept rate: 0%

0

For what it's worth, there was a bug in version 1.0.37 (and presumably earlier) of "tgt" on linux that prevented Windows initiators from being able to format.

(http://lists.wpkg.org/pipermail/stgt/2013-May/012368.html)

This also seems to have caused problems with file contents for a CentOS 7 guest running with the iscsi device as a root disk in OpenStack. Other guests seem to be fine.

answered 25 May '15, 10:43

cbf123's gravatar image

cbf123
61
accept rate: 0%

0

Hi I faced the same issue and finally I found that using Microsoft bridged NIC was behind all issues, I removed all NIC from bridge and gave dedicated IPs, this reaolved it with immediate action. Regards,

Ayed

answered 13 Nov '16, 06:47

Ayed's gravatar image

Ayed
61
accept rate: 0%

0

Sorry, I am reposting this because I posted it in a wrong location

119 seconds delay would cause the initiator's upper layer timer to expire and it should perform some type of SCSI layer error recover if a frame was not received. Since there is no SCSI layer error recovery I am guessing the frame was not sent. Unless there was some type of iSCSI protocol error in the frame - then it would be normal of the initiator to close the TCP connection provided that the 2 devices were operating at error recovery level 0.

NOT the iSCSI standard requires devices that only support error level 0 to close the connection when a protocol error is encountered.

answered 14 Nov '16, 04:48

LouD's gravatar image

LouD
6112
accept rate: 0%

edited 14 Nov '16, 06:34

sindy's gravatar image

sindy
6.0k4851