Help please. The Situation: Eight identical Dell computers (1 yr old) with Windows 7 Pro SP1 connected via an HP switch (all wired) and sharing via a homegroup. There is one shared folder on Computer 'S' containing a commonly queried file, no person sits at or works on Computer 'S', and the query software prevents any 2 people from querying the file at the same time (ie. no overlapping queries).
The Problem: If no-one queries that file from their computer for 15 seconds or more, then the next person to query it from any computer will receive their results back fast (<2 sec), and the Wireshark capture indicates that most packets are like: TCP 1514 [TCP segment of a reassembled PDU]
However, in the 15 seconds after someone queries the file, one of two things happens: 1) If the same person (same computer) re-queries the database (with the same or different query) inside 15 seconds, they will get fast results again (<2 sec) with TCP and 1514 Lengths just as above. But... 2) If anyone else (a second person at a second computer) queries the database in less than 15 seconds, they will get slow results (>20 seconds), and Wireshark captures frame pairs like: SMB2 191 Read Request Len:1024 Off:24194048 File: S Drive\database.xyz SMB2 1182 Read Response If this second person sits and waits for 15 seconds instead of placing the query right away, then they will again have a fast response (<2 sec) and will have frames again like "TCP 1514 [TCP segment of a reassembled PDU]".
Unfortunately, making each person wait 15 seconds since the last person's query is not feasible. We need to place queries every 5-10 seconds, and we need them to all be fast. As things are now, if we keep placing queries every 5-10 seconds apart, then the 15 seconds of wait time never passes, so all the queries take >20 seconds.
The problem is like a swinging "bridge"... if it's pointed to your computer, you get fast queries. But unless anyone else waits 15 seconds after your last query for the bridge to "free up", they will get a slow response. (Again, 2 overlapping queries is not possible; all queries are place sequentially.) I suspect that Windows 7 is enforcing a policy that selects TCP via IPv6 for the first query (fast), and then it selects SMB2 and NetBios (slow) for any other subsequent queries, unless 15 seconds is allowed to lapse from the end of the last query (either TCP or SMB2); the 15 seconds seems to allow it to go back to TCP. I've searched for documentation of such a Windows behavior, but can't find it, nor have I found any way to affect this behavior. But I'm pretty clueless, so I'd really appreciate any ideas you all have. Thank you for your consideration. I'd be happy to include more details from the Wireshark captures if it would be helpful.
Thank you very much.
asked 15 Feb, 15:27
Your question and the brief traces reveal multiple problems.
You have a database application and place a file on an SMB share. Multiple clients compete for access to the file. Both file server and application have to make sure, that the file contains valid data at all times.
SMB offers a feature called "opportunistic locking". Client A can send a Lock Request to the server to signal that he wants exclusive access to a certain region. If client B is holding this lock the server has to send a Lock Request to client B. Client A can only continue, once client B releases the lock. How fast that happens depends on client B. There are situations where the application has to be in a proper state before the lock can be released, like flushing buffers.
Unfortunately this opportunistic locking only works for a small number of clients: Lock management can put a tremendous load on the server.
It would be advisable to use a database for this application. "Database" would be something like MS-SQL, MySQL or any similar database that can be reached through a dedicated network port. SQLITE databases do not count, as they are implemented as a file and you have the whole locking game going on.
Server Performance / Application Behavior
Your "slow trace" shows that the client is sending a large number of requests, each 1024 byte at a time. Each request is answered within 1 millisecond. But 15.000 times 1 milliseconds still results in a response time of 15 seconds.
The "fast trace" shows read requests of 24 kByte, where each request is served within 1 millisecond. I wonder if the "fast trace" shows large block sizes as well. In theory, this should complete within 1 second.
Image you host a party and want order pizza for you and your guests. Within the "slow trace" you order one slice at a time. The next order is only placed after the first slice has been consumed. No wonder it takes all night long until everybody is fed. The "fast trace" orders 24 slices at a time.
Depending on the number of guests it might still take some time to feed everybody. At least the guests stop raiding your kitchen ...
Your question mentions that all systems run Windows 7, which is a workstation OS. Workstations are not optimized to run as file servers, while Windows server editions are not optimized for desktop applications.
If all the tuning and analysis does not help, you might want to consider a test with a Windows Server version.
Why the small block sizes?
There are a few possibilities. One point is definitely the application. You might want to talk to your vendor or developer and have them check their code.
Another option is the server. Since the server uses SMB2 the client needs credits to ask run I/O operations. My first check would be, if the client has sufficient credits.
For a full analysis we need a trace take at the server side. No screen shots or text-listings please.
answered 17 Feb, 12:07