Quantcast
Channel: VMware Communities : Popular Discussions - VI: VMware ESX® 3.0
Viewing all articles
Browse latest Browse all 60069

ESX Disaster This Past Weekend

$
0
0

 

Well, I almost had a heart attack.

 

 

I had our LAN and Network Guy reconfigure ESX this past weekend to use trunking.  In addition, we were moving two of our physical ESX boxes within a cluster to another physical network into an existing ESX cluster.  The two ESX boxes were running ESX 3.01.

 

 

Instead of doing an upgrade to ESX 3.02 and trying to change hostnames and IP addresses, the LAN guy did a re-install ESX.

 

 

The problem was, he did not disconnect the SAN cable before doing the install.   It basically wiped out the LUN that hosted 12 other critical VMs.

 

 

I got the call around 3AM and immediately went to work to assess the situation. 

 

 

Even though we had server tape backups, my image backups of ESX were 2.5 months old.

 

 

It took about 6 hours to recopy all my images back over to ESX and restore the necessary program files and data to get the servers up and running.  The only thing that kept me from running to our hostsite down in Florida is that I did have image backps.  Without these, the situation would have been much worse and we probably would have been down for a couple days... maybe even longer with many issues.

 

 

To top it it, a day later we were trying to add a third LUN to our ESX environment to better load mix our VMs so all our critical apps were not on the same LUN.  We discovered that some of our ESX servers could not recognize the new LUN or in some cases one of the two original LUNs.  I rebooted one of our ESX servers and it only found one LUN instead of three.  The next day, I had our network guy that has SAN experience call vmware to determine what the issue was -- not knowing if it was related to the the isse we had earlier during the week or something brand new.

 

 

After about an hour of working with a very knowledgeable vmware tech, they finally determine that the master partition on the SAN was also wiped clean from the weekend.  Any ESX boxes that were already up and running knew about the LUNs.  Any that would have been rebooted, could not find the partition information.

 

 

Luckily for me we stumbled across this early.  This could have easily happened 2-3 months from now when we preformed some other maintenance on the servers and ran into this issue and not realize it was related to this weeks issue.

 

 

All in all, I'm very happy with vmware and our ESX environment.   The uptime and reliablity of ESX is top notch.  I only wish the price was less so I could justify buying more servers!

 

 

One of my biggest concerns about this past weekend was the fact the one simple human error causes over half of our data center to be down.  Since we implemented our ESX environment in January, which took a lot of arm twisting and justifcation, I could have easily seen this incident be turned around.  The fact that all our ESX boxes are on a SAN means a SAN failure (whether human or whatever) could be catastrophic for us.  I know the likelihood of this is small, but the fact that it was so easy for someone to wipe out our SAN could have cost some folks their jobs.

 

 

Since this time, I have begun looking at esxRanger.  I now have it installed and using the trial for the past few days.  This product seems pretty wonderful and I think it will add a lot of value to our current backup strategy, plus allow us to reduce the disk space and amount time it takes to copy an image from ESX to an external NAS server.

 

 

 

 

 


Viewing all articles
Browse latest Browse all 60069

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>