I believe I know the root cause but want to see if their was someone who could validate, and also help me with knowing how to better 'root cause' these issues faster on 3i in the future. The details are a bit long. I applogize but I belive it provides the nessiary information.
Background:
I have a lab in which I have a number of servers. I started with 3.5U2 and got a new toy so installed it with 3i. I have 70+ virtual machines on my SAN. The SAN is a IBM DS4300 single controller unit with 16x300GB Fibre Drives. I decided to break the solution into two arrays (7+P Array1 VMFS0) (6p Array2 VMFS1)+1HS. Also important is that all servers in my lab are SAN boot. I have
twenty 3GB boot LUNS sliced off the first array for VMWare ESX boot
targets. The issue I ran into was that the eight spindles were being crushed to the point the VMs were timing out. I then decided the only fix was to drop the (6p Array2 VMFS1) and make Array1 VMFS0 a(14P) array to get more aggregate spindles working for the VMs. I discovered that I coud not expand Array 1 by more then two spindles, so I figured I would upgrade the SAN controller from firmware 5.34.10.00 (last supported firmware on the IBM single controller DS4300) to 06.60.17.00 which is suppose to be for dual controller vesions, but I would just ignore the "missing controller" error if I got the functionality of being able to expand the array. I completed the upgrade and all my VMWare servers booted fine. I expanded the array and again, tested that the ESX servers booted fine with the new (14+P) Array.
All seemed to go well... .. but.....
I went to start my virtual machine pool back up. I see that the VMFS0 volume is still there, but all my virtual machines are greyed out like they are offline or not acessable. I go to the Fibre HBA on a few of the servers and (as they are SAN booting fine) I figured my SAN zoning and partitioning was ok. I scan the HBAs for LUNS and it sees all the normal LUNs, including LUN 20 which is my VMFS0 1.5TB volume. I go into storage and see VMFS0 is still their but when I right click and browse all I see are "vpxa.log" and other versions Ex: vpxa-0.log vpxa-1.log etc..... . I ssh into the box and see:
root@8877cle2 volumes# pwd
/vmfs/volumes
root@8877cle2 volumes# ls -alh
total 0
drwxr-xr-x 1 root root 512 Nov 22 17:23 .
drwxrwxrwt 1 root root 512 Nov 22 08:04 ..
root@8877cle2 volumes#
As such was a bit perplexed about where my volume went.
root@8877cle2 volumes# tail /var/log/messages
... was not a huge help.
I looked at dmesg and found only this odd message about my vmfs0 volume lun.
SCSI device sdv: 3383503093 512-byte hdwr sectors (1733327 MB)
sdv: sdv1
SCSI device sdw: 40960 512-byte hdwr sectors (21 MB)
sdw: unknown partition table
Vendor: IBM Model: Universal Xport Rev: 0617
Type: Direct-Access ANSI SCSI revision: 05
VMWARE SCSI Id: Supported VPD pages for sdx : 0x0 0x80 0x83 0x85 0xc0 0xc1 0xc2 0xc3 0xc4 0xc5 0xc7 0xc8 0xc9 0xca 0xd0
VMWARE SCSI Id: Device id info for sdx: 0x1 0x3 0x0 0x10 0x60 0xa 0xb 0x80 0x0 0x39 0xbf 0xa7 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x1 0x93 0x0 0x8 0x20 0x24 0x0 0xa0 0xb8 0x38 0x9a 0x97 0x1 0x94 0x0 0x4 0x0 0x0 0x0 0x1 0x1 0xa3 0x0 0x8 0x20 0x4 0x0 0xa0 0xb8 0x38 0x9a 0x97
VMWARE SCSI Id: Id for sdx 0x60 0x0a 0x0b 0x80 0x00 0x39 0xbf 0xa7 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x55 0x6e 0x69 0x76 0x65 0x72
Disk sdx is a pseudo device. lid = 31, ro = 0, cap: (512 * 40960) = 20971520
VMWARE: Unique Device attached as scsi disk sdx at scsi2, channel 0, id 0, lun 31
Everything else looked fine... I was at that point swearing at my SAN Those 70VMs is about 1 1/2 months of my life working on building demo systems!!!
Soo... after a cup of cofee...
I started back with the physical. QLogic saw all the LUNS ok, the SAN zoning looked fine, the controller looked like all the WWN were partitioned ok. So I then removed on of the servers I have totaly and rebuilt it from scratch. At that point I noticed that the only host-type option for the DS4300 was "Linux" vs "Linux Cluster" which I use to set it to.. BINGO!... I then realized that ALL my SAN HBAs in my VMWare farm were now set to AIX !
I reset all my hosts to Linux, reboot, but same issue, but as I suspected, that would not fix the issue in that the SCSI tag reservations of host type "Linux Cluster" was required for ESX to properly tag and mount a VMFS volume.
What happened after reading lots on the DS4xxx product is that IBM added hosts kits for Linux later on that they charged for. They moved the "Linux Cluster" into the new "VMWare" host kit, and now charge for it. So when I upgraded my DS4300, I lost the option to set host type to "Linux Cluster" and so all the HBAs had 'invalid host types set" and were reverted to the default top option of the host type list aka... AIX.
*************
Question:
1) How can I validate what I believe to be the case above concerning SCSI reservation locking?
2) I tried other host types such as "Windows NT 4.0 Cluster" "Windows 2003 Cluster" etc... but none work. Can someone explain a bit of detail about what SCSI tagging VMWare is looking for before it will mount the VMFS volume?
3) The only way I would have been able to debug this is through the VMWare 3.5 console. I have one 3i that I was just bringing up before this all happened and can't imagine how someone could debug / root cause an issue like this without a shell. The only message in the VMWare logs on the 3.5 system was that the vm systems could not be located. On the 3i server the message was:
11/22/2008 7:58:50 am, Issue detected on 8878cle1 in aesscle IBM: LVM: 4476: vml.020014........ (1:14:00:25:092 cpu8:440588)
Is their a VMFS flow diagram of how the mount and validation process is done so we can key in on these issues faster. The above took me all day to work through. I can't imagine someone being in a production environment where they just lost 70 production systems due to a simple upgrade. Even if this is a "issue of IBM's SAN" that does not help IT people know how to point back at the SAN group that their VMWare servers are not the issue, that it is the storage controller's issue.
PS: The inability to highlight and copy text from the VMWare client log is very irritating.