That darned MSA again

| 2 Comments
I'm not sure where this problem sits, but I'm having trouble with this MSA1500cs and my NetWare servers. I've found a failure case that is a bit unusual, but things shouldn't fail this way.

The setup:
  • NetWare 6.5, SP5 plus patches
  • EVA3000 visible
  • MSA1500cs visible
  • Pool in question hosted on the MSA
  • Pool in question has snapshots
  • Do a nss /poolrebuild on the pool
Do that, and at some point you'll get an error like this one:
 7-19-2007   9:48:22 am:    COMN-3.24-1092  [nmID=A0025]
NSS-3.00-5001: Pool FACSRV2/USER2DR is being deactivated.
An I/O error (20204(zio.c[2260])) at block 36640253(file block
-36640253)(ZID 1) has compromised pool integrity.
The block number changes every time, and when it decides to crap out of the rebuild also changes every time. No consistency. The I/O error (20204) decodes to:

zERR_WRITE_FAILURE 20204 /* the low level async block WRITE failed*/

Which, you know, shouldn't happen. And this error is consistent across the following changes:
  • Updating the HAM driver (QL2300.HAM) from version 6.90.08 (a.k.a 6.90h) to 6.90.13 (6.90m).
  • Updating the firmware on the card from 1.43 to 1.45 (I needed to do this anyway for the EVA3000 VCS upgrade next month)
  • Applying the N65NSS5B patch, I had N65NSS5A on there before
PoolVerifies, a pure Read operation, do not throw this error.

I haven't thrown SP6 on there yet, as this is a WUF cluster node and this isn't intersession ;). This is one of those areas where I'm not sure who to call. Novell or HP? This is a critical error to get fixed as it impacts how we'll be replicating the EVA. It was errors similar to this, and activities similar to this, that caused all that EXCITEMENT about noon last Wednesday. That was not fun to live through, and we really really don't want to have that happen again.

Call Novell
Good:Bad:
  • Their storage geeks know NetWare a lot better.
  • Much more likely to know about Fibre Channel problems on NetWare.
  • Not likely to know HP-specific problems.
  • More likely to recommend, "Well, then don't move your arm like that," as a solution.
The next step here is to delete these pools and volumes, recreate them, and see if things go Poink in quite the same way. I'm not convinced that'll fix the problem, as the errors being reported are Write errors, not Read errors, and the faulting blocks are different every time. I'm suspecting instability in the Write channel somewhere that is unique to a nss /poolrebuild, as I didn't get these errors when FILLING these volumes. Write channel in this case has a lot of Fibre Channel in it.

2 Comments

This might be a ridiculous suggestion from an NCS newbie(4 days of experience!), but what would be the harm of installing NetWare 6.5.6 on a plain old box, joining it to the cluster, migrating the pool/volume resource to that server, and trying the rebuild again? You could then just pull that server from the cluster regardless of the outcome.Clearly, it's a shot in the dark.

Oops, missed this bit:"I haven't thrown SP6 on there yet, as this is a WUF cluster node and this isn't intersession ;)."I'm still curious however, what would the impact of adding a single sp6 node be?