Hi, we are running ESX 3.5 (original - no updates. We have another client that sees this same thing with u4)
We have a VM that experiences lock ups at seemingly random times. The machine is running RHEL 5, has 2 vCPUs, 3GB or RAM, 4 disks (two of these have been made into a single large (500GB) volume within the VM using LVM) and two NICs. The machine is running Apache and acts as a app server that connects to another VM running Postgres. The machine in question also has a fairly large file store.
In general, this machine is very lightly loaded at night and just moderately loaded during the day.
What we see is the CPU(s) on the machine shoot to 100%. At this point, the VM needs to be powered off or reset. The option to Shut Down guest OS is not available. The summary tab also stops showing the IP address of the machine and the status of VMware tools.
Timing on these lockups appears to be random. It may happen under load or it may happen in the middle of the night with nothing going on.
The vmdk is on an iSCSI SAN. When I look at the /var/log/vmkernel log, I see quite a few of the following errors. These seem to come in batches every 20-40 minutes and last for time period below. They do NOT always correlate to a problem with the machine in question. But - I am guessing they may be playing a part.
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu1:1066)LinSCSI: 3201: Abort failed for cmd with serial=13802204, status=bad0001, retval=bad0001
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu2:1097)iSCSI: session 0x352180c0 sending mgmt 511874685 abort for itt 511874679 task 0x352025e0 cmnd 0x5a2fa80 cdb 0x2a to (1 0 1 0) at 4232667423
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu1:1066)LinSCSI: 3201: Abort failed for cmd with serial=6445015, status=bad0001, retval=bad0001
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu1:1066)LinSCSI: 3201: Abort failed for cmd with serial=12844297, status=bad0001, retval=bad0001
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu6:1075)iSCSI: session 0x35203f90 sending mgmt 214328217 abort for itt 214328211 task 0x35202180 cmnd 0x5a2d200 cdb 0x2a to (1 0 0 0) at 4232667423
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu2:1201)iSCSI: session 0x35240320 sending mgmt 280194341 abort for itt 280194330 task 0x35202ce0 cmnd 0x5a3af80 cdb 0x2a to (1 0 3 0) at 4232667423
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu0:1098)iSCSI: session 0x352180c0 abort success for mgmt 511874685, itt 511874679, task 0x352025e0, cmnd 0x5a2fa80, cdb 0x2a
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu2:1097)iSCSI: session 0x352180c0 sending mgmt 511874686 abort for itt 511874680 task 0x35202490 cmnd 0x5a31d80 cdb 0x2a to (1 0 1 0) at 4232667423
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu0:1076)iSCSI: session 0x35203f90 abort success for mgmt 214328217, itt 214328211, task 0x35202180, cmnd 0x5a2d200, cdb 0x2a
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu6:1075)iSCSI: session 0x35203f90 sending mgmt 214328218 abort for itt 214328216 task 0x35201fc0 cmnd 0x5a2c080 cdb 0x2a to (1 0 0 0) at 4232667423
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu4:1202)iSCSI: session 0x35240320 abort success for mgmt 280194341, itt 280194330, task 0x35202ce0, cmnd 0x5a3af80, cdb 0x2a
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu2:1201)iSCSI: session 0x35240320 sending mgmt 280194342 abort for itt 280194335 task 0x35202260 cmnd 0x5a3cd80 cdb 0x2a to (1 0 3 0) at 4232667423
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu0:1098)iSCSI: session 0x352180c0 abort success for mgmt 511874686, itt 511874680, task 0x35202490, cmnd 0x5a31d80, cdb 0x2a
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu2:1097)iSCSI: session 0x352180c0 sending mgmt 511874687 abort for itt 511874683 task 0x35202c00 cmnd 0x5a33900 cdb 0x2a to (1 0 1 0) at 4232667423
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu0:1076)iSCSI: session 0x35203f90 abort success for mgmt 214328218, itt 214328216, task 0x35201fc0, cmnd 0x5a2c080, cdb 0x2a
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu6:1075)iSCSI: session 0x35203f90 (1 0 0 0) finished error recovery at 4232667423
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu4:1202)iSCSI: session 0x35240320 abort success for mgmt 280194342, itt 280194335, task 0x35202260, cmnd 0x5a3cd80, cdb 0x2a
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.412 cpu2:1201)iSCSI: session 0x35240320 sending mgmt 280194343 abort for itt 280194336 task 0x35201070 cmnd 0x5a3d000 cdb 0x2a to (1 0 3 0) at 4232667423
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.413 cpu0:1098)iSCSI: session 0x352180c0 abort success for mgmt 511874687, itt 511874683, task 0x35202c00, cmnd 0x5a33900, cdb 0x2a
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.413 cpu2:1097)iSCSI: session 0x352180c0 (1 0 1 0) finished error recovery at 4232667423
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.413 cpu4:1202)iSCSI: session 0x35240320 abort success for mgmt 280194343, itt 280194336, task 0x35201070, cmnd 0x5a3d000, cdb 0x2a
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.413 cpu2:1201)iSCSI: session 0x35240320 sending mgmt 280194344 abort for itt 280194337 task 0x35202b90 cmnd 0x5a3d780 cdb 0x28 to (1 0 3 0) at 4232667423
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.413 cpu4:1202)iSCSI: session 0x35240320 abort success for mgmt 280194344, itt 280194337, task 0x35202b90, cmnd 0x5a3d780, cdb 0x28
Jun 9 09:00:57 virthost04 vmkernel: 489:21:24:33.413 cpu2:1201)iSCSI: session 0x35240320 (1 0 3 0) finished error recovery at 4232667423
I spent an hour on the phone with HP yesterday and they ran diags, checked logs on the SAN, etc. and said the SAN was fine. They agreed that it appears to be a timeout issue, but thought that it was coming from ESX.
Is there a setting to increase the timeout limit? Has anyone else seen this? In searching for the above errors and for a scenario like this, I found a few posts, but most look to either have cleared themselves or the solution was never found.
Any help is much appreciated.