Good morning,
This past weekend I had a major outage in my ESX environment. At first glance my entire infrastucture went down due to a faulty fiber channel switch in my environment. All hardware components in my infrastructure are supposed to be redundant. With one switch taking a dump, it was assumed that I/O would fail over to our secondary switch.
Each server has four paths to our backend array (Clarion CX4). Two paths go through one switch, (Brocade DCX) and the other two paths goes to a second switch (Brocade DCX).
Two weeks ago, the first DCX switch rebooted and all paths failed over as expected to the second switch. What we didn't notice was that the first two paths didn't come back up after the switch recovered. The Emulex drivers for the kernel to mark the paths as dead!
So, when the secondary switch rebooted this weekend, unbenowkn to me, I/O didn't fail back to the first two paths because the paths had been marked dead for over a week!!
So here is my question,
Can someone assist me with writing a script that can poll the VMkernel logs for such an event? I'd like to poll for the following:
$ grep EXPIRED *
vmkernel.4:May 25 02:41:03 sknxbldesx01 vmkernel: 2:14:45:15.264 cpu0:1024)<4>lpfc1:0250:DIe:EXPIRED nodev timer Data: x661713 x3 x7
vmkernel.4:May 25 02:41:03 sknxbldesx01 vmkernel: 2:14:45:15.264 cpu0:1024)<4>lpfc1:0250:DIe:EXPIRED nodev timer Data: x661715 x4 x8
Can a cronjob be writen to search for such an expression then send me an email?
Thanks In advance.
Tim