I/O starvation on NetWare, another update

| 2 Comments
I've spoken before about my latency problems on the MSA1500cs. Since my last update I've spoken with Novell at length. Their own back-line HP people were thinking firmware issues to, and recommended I open another case with HP support. And if HP again tries to lay the blame on NetWare, to point their techs at the NetWare backline tech. Who will then have a talk about why exactly it is that NetWare isn't the problem in this case.

This time when I opened the case I mentioned that we see performance problems on the backup-to-disk server, which is Windows. Which is true, when the problem occurs B2D speeds drop through the floor; last Friday a 525GB backup that normally completes in 6 hours took about 50 hours. Since I'm seeing problems on more than one operating system, clearly this is a problem with the storage device.

The first line tech agreed, and escalated. The 2nd line tech said (paraphrased):
I'm seeing a lot of parity RAID LUNs out there. This sort of RAID uses CPU on the MSA1000 controllers, so the results you're seeing are normal for this storage system.
Which, if true, puts the onus of putting up with a badly behaved I/O system onto NetWare again. The tech went on to recommend RAID1 for the LUNs that need high performance when doing array operations that disable the internal cache. Which, as far as I can figure, would work. We're not bottlenecking on I/O to the physical disks, the bottleneck is CPU on the MSA1000 controller that's active. Going RAID1 on the LUNs would keep speeds very fast even when doing array operations.

That may be where we have to go with this. Unfortunately, I don't think we have 16TB of disk-drives available to fully mirror the cluster. That'll be a significant expense. So, I think we have some rethinking to do regarding what we use this device for.

2 Comments

Hi. I am having a very similar issue with an MSA1500cs. In fact, I am currently doing some testing in the Novell SuperLab and I can put the SAN on its knees by continuously copying a single 1MB file to the SAN. We setup a 4-node NetWare 6.5 cluster and performance is terrible. The strange part is that everything was working good until we had 50 machines running a batch file that looped a copy job. We saw the disk requests go to 1000, just like you did, and the node would be cast out, and the resource would fail over. The next node would suffer the same fate. We figured that we were overloading the SAN and stopped the machines from copying. We then had to reboot the nodes in the cluster, and at that point we started having problems. It took 45 minutes for the first node to join the cluster, and an additional 20 minutes for the other 3 nodes to join. After many hours of trying all kinds of stuff, deleting the SBD and re-creating it, deleteing the logical drives in the array and re-creating them, even deleting the cluster and all its objects from eDir and re-creating them, but the result was the same. We then added a /MAXLUNS=6 parameter to the qlogic load line, and now it can join the cluster in about 3 minutes, instead of the usual 30 seconds. Failovers take quite a while to. It's as if the SAN got mad at what we did, and doesn't want to play anymore.Have you any luck in getting better performance?Any help/insight would be greatly appreciated.Thanks

Oh really? How interesting. I've done some concurrency testing in the past with some help from our desktop support team. That was done against an EVA3000 and a single netware server. NetWare held up just fine under that. I haven't done the same test with the MSA because that test is hard to set up (I need a computer lab over-night, that's hard to get during term).As for performance, there is not a whole lot I can tell you. One thing that may give you more detailed data about how the MSA is performing is to take advantage of the serial console of the active MSA1000 controller in the MSA1500. The default serial settings are 19200 8-N-1. Some commands that are downright handy:start perf This starts the built in performance monitor. Polls every second and builds trends over time.show perf This gives a high level view of performance on the MSA. This shows Average Command Latency (the number that goes three digits when my own cluster nodes get 1000+ disk requests), as well as average CPU utilization on the controller. Avg Completions/Sec tells you how may I/O ops are comitted a second, useful for tracking.clear perf This resets the performance monitor. Handy. I'll let it run at idle for a while to see what idle looks like. Then I'll start a test case, and "clear perf" to reset the averages so I can see what the test case does to the MSA.show perf logical Shows details on individual LUNs, including average write latency. This number is the number that "show perf" averages to get the overall command latency. It can show you specifically which LUN is experiencing the problem.show perf physical Not quite as useful. It shows the "maximum queue depth" statistic for each individual drive. This can get high when I/O is backing up.show this_controller This tells you if the controller is doing anything special, such as expanding a LUN. Or whether or not the cache is enabled.These commands won't fix your problem, but can give you a much better feel for the state of the MSA and should allow you to better correlate MSA state with cluster performance.