VMWare HA is not Oracle RAC
This originally started out as a much longer rant about EMC/VMWare vs. Oracle in general, but WordPress managed to eat the draft, and there’s no way I’m going to rewrite the whole thing, so I’m going to instead write a series of blog posts, each one covering a point in EMC’s recent salvo against Oracle and their lack of VMWare support.
If you haven’t been aware, a couple of EMC bloggers, Chuck Hollis and Jeff Browning have been taking on Oracle for their lack of VMWare support. While I don’t necessarily disagree with their perspective, they’ve gone beyond that to claiming that Oracle RAC can be supplanted by VMWare HA, that it’s a more cost-effective solution, it’s easier to administer, and that it offers better performance. Let’s start by looking at why VMWare HA is not the same thing as Oracle RAC.
Jeff has a very nice-looking reference case from back in 2008, in which they configured and benchmarked a two-node Oracle EE RAC cluster against a standalone Oracle SE with VMWare HA, and have shown that the standalone+VMWare HA config is cheaper, better, faster, etc. However, it misses the boat on the subject of HA, and why Oracle RAC is more than HA.
Let’s begin by looking at the scenarios that VMWare HA will protect you from. Admittedly, I have not worked with VMWare HA, so all I have to go on is Jeff’s blog post and the documentation from EMC’s website. However, from that investigation, it appears that VMWare HA will trigger the failover and restart of a VM in these situations:
- Physical server failure/crash
- OS lockup/kernel panic
That’s it. While this is definitely better than running a standalone Oracle database with no HA at all, let’s look at the list of failure scenarios that are not covered by VMWare HA in an Oracle environment:
- Listener crash
- Accidental listener shutdown
- Oracle instance crash
- Oracle instance shutdown
- Listener IP failure
- Storage array failure
- Too many user sessions
- Oracle instance out of memory
- Oracle session crash
- ORA-600 errors
- Deletion of the Oracle binaries, intentional or accidental
- Downtime required for patching, upgrades
That’s quite a list. I didn’t even bother including the sort of transient problems that could create issues but the machine is still up – i.e. for some reason a FC switch path or storage array is overly slow, etc. etc. In effect, the only situation that will cause a failover with VMWare HA is one where the whole system goes down, and it’s actually a lot more likely that user error or a required patch event or an Oracle bug will cause a downtime than a loss of hardware. If you hit any of the latter list of situations in a VMWare HA environment, your database will be down, and VMWare will be completely oblivious to that fact.
In contrast, Oracle RAC handles all of the things on both of these lists, because it’s a true active-active cluster. I can lose an instance, storage, run out of memory, lose my binaries, patch my databases, all without taking any downtime. Which brings me to my second point:
VMWare HA failover leverages an OS boot. Even if you do hit one of the only two cases where VMWare HA will kick in, your OS has to boot, and your database will have to be restarted. Your clients will need to have their connections reset, and their sessions re-established. Again, this is better than no HA at all, but let’s look at the RAC case.
In a RAC environment, clients will transparently fail over to the other node. If I’m using connection pools or something similar with TAF configured, sessions connected to the down node with open cursors will be told to retry, and most sessions will never even be aware a failure occurred. I’ve seen RAC shops where they shut down nodes during production operations, since they know that their clients will simply get moved over to another active node. Zero downtime.
And of course, VMWare HA doesn’t give you the scale-out capability that RAC does. Instead, Jeff simply stands up some straw men and argues why that doesn’t really matter. We’ll discuss that in the next blog post.
