Thursday, July 19, 2007

That darned MSA again

I'm not sure where this problem sits, but I'm having trouble with this MSA1500cs and my NetWare servers. I've found a failure case that is a bit unusual, but things shouldn't fail this way.

The setup:
  • NetWare 6.5, SP5 plus patches
  • EVA3000 visible
  • MSA1500cs visible
  • Pool in question hosted on the MSA
  • Pool in question has snapshots
  • Do a nss /poolrebuild on the pool
Do that, and at some point you'll get an error like this one:
 7-19-2007   9:48:22 am:    COMN-3.24-1092  [nmID=A0025]
NSS-3.00-5001: Pool FACSRV2/USER2DR is being deactivated.
An I/O error (20204(zio.c[2260])) at block 36640253(file block
-36640253)(ZID 1) has compromised pool integrity.
The block number changes every time, and when it decides to crap out of the rebuild also changes every time. No consistency. The I/O error (20204) decodes to:

zERR_WRITE_FAILURE 20204 /* the low level async block WRITE failed*/

Which, you know, shouldn't happen. And this error is consistent across the following changes:
  • Updating the HAM driver (QL2300.HAM) from version 6.90.08 (a.k.a 6.90h) to 6.90.13 (6.90m).
  • Updating the firmware on the card from 1.43 to 1.45 (I needed to do this anyway for the EVA3000 VCS upgrade next month)
  • Applying the N65NSS5B patch, I had N65NSS5A on there before
PoolVerifies, a pure Read operation, do not throw this error.

I haven't thrown SP6 on there yet, as this is a WUF cluster node and this isn't intersession ;). This is one of those areas where I'm not sure who to call. Novell or HP? This is a critical error to get fixed as it impacts how we'll be replicating the EVA. It was errors similar to this, and activities similar to this, that caused all that EXCITEMENT about noon last Wednesday. That was not fun to live through, and we really really don't want to have that happen again.

Call Novell
Good:Bad:
  • Their storage geeks know NetWare a lot better.
  • Much more likely to know about Fibre Channel problems on NetWare.
  • Not likely to know HP-specific problems.
  • More likely to recommend, "Well, then don't move your arm like that," as a solution.
The next step here is to delete these pools and volumes, recreate them, and see if things go Poink in quite the same way. I'm not convinced that'll fix the problem, as the errors being reported are Write errors, not Read errors, and the faulting blocks are different every time. I'm suspecting instability in the Write channel somewhere that is unique to a nss /poolrebuild, as I didn't get these errors when FILLING these volumes. Write channel in this case has a lot of Fibre Channel in it.

Labels: , , ,


Comments: Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?