VMWare HA is not Oracle RAC

This originally started out as a much longer rant about EMC/VMWare vs. Oracle in general, but WordPress managed to eat the draft, and there’s no way I’m going to rewrite the whole thing, so I’m going to instead write a series of blog posts, each one covering a point in EMC’s recent salvo against Oracle and their lack of VMWare support.

If you haven’t been aware, a couple of EMC bloggers, Chuck Hollis and Jeff Browning have been taking on Oracle for their lack of VMWare support. While I don’t necessarily disagree with their perspective, they’ve gone beyond that to claiming that Oracle RAC can be supplanted by VMWare HA, that it’s a more cost-effective solution, it’s easier to administer, and that it offers better performance. Let’s start by looking at why VMWare HA is not the same thing as Oracle RAC.

Jeff has a very nice-looking reference case from back in 2008, in which they configured and benchmarked a two-node Oracle EE RAC cluster against a standalone Oracle SE with VMWare HA, and have shown that the standalone+VMWare HA config is cheaper, better, faster, etc. However, it misses the boat on the subject of HA, and why Oracle RAC is more than HA.

Let’s begin by looking at the scenarios that VMWare HA will protect you from. Admittedly, I have not worked with VMWare HA, so all I have to go on is Jeff’s blog post and the documentation from EMC’s website. However, from that investigation, it appears that VMWare HA will trigger the failover and restart of a VM in these situations:

  • Physical server failure/crash
  • OS lockup/kernel panic

That’s it.  While this is definitely better than running a standalone Oracle database with no HA at all, let’s look at the list of failure scenarios that are not covered by VMWare HA in an Oracle environment:

  • Listener crash
  • Accidental listener shutdown
  • Oracle instance crash
  • Oracle instance shutdown
  • Listener IP failure
  • Storage array failure
  • Too many user sessions
  • Oracle instance out of memory
  • Oracle session crash
  • ORA-600 errors
  • Deletion of the Oracle binaries, intentional or accidental
  • Downtime required for patching, upgrades

That’s quite a list.  I didn’t even bother including the sort of transient problems that could create issues but the machine is still up – i.e. for some reason a FC switch path or storage array is overly slow, etc. etc.  In effect, the only situation that will cause a failover with VMWare HA is one where the whole system goes down, and it’s actually a lot more likely that user error or a required patch event or an Oracle bug will cause a downtime than a loss of hardware.   If you hit any of the latter list of situations in a VMWare HA environment, your database will be down, and VMWare will be completely oblivious to that fact.

In contrast, Oracle RAC handles all of the things on both of these lists, because it’s a true active-active cluster.  I can lose an instance, storage, run out of memory, lose my binaries, patch my databases, all without taking any downtime.  Which brings me to my second point:

VMWare HA failover leverages an OS boot.  Even if you do hit one of the only two cases where VMWare HA will kick in, your OS has to boot, and your database will have to be restarted.  Your clients will need to have their connections reset, and their sessions re-established.  Again, this is better than no HA at all, but let’s look at the RAC case.

In a RAC environment, clients will transparently fail over to the other node.  If I’m using connection pools or something similar with TAF configured, sessions connected to the down node with open cursors will be told to retry, and most sessions will never even be aware a failure occurred.  I’ve seen RAC shops where they shut down nodes during production operations, since they know that their clients will simply get moved over to another active node.  Zero downtime.

And of course, VMWare HA doesn’t give you the scale-out capability that RAC does.  Instead, Jeff simply stands up some straw men and argues why that doesn’t really matter.  We’ll discuss that in the next blog post.

Share

“SSH macbook”

“ssh macbook” is one of the most random large hits to this blog.  Unfortunately, it leads people to the first section of a post I have yet to complete about setting up ssh equivalence.  And that post has nothing to do with ssh on the macbook, I just get hits because my hostname has “macbook” in it.

So – Internet.  Tell me.  What do you need to know about ssh on the macbook?

Share

“Cloud is the new dotcom”

George Zachary is a partner at Charles River Ventures, and at the recent TechCrunch Cloud Computing Summit, he said, “Cloud is the new dotcom”.  This statement has inspired a variety of responses, but one I’d like to highlight is Reuven Cohen’s.  Reuven is the Founder of Enomaly, a company that makes a software product designed to manage virtualized environments.  He’s also managed to carve out a niche as one of the “leading minds” in cloud computing, which seems to mean a lot of long blog posts wherein he muses about the future of cloud computing.  One in particular, however, got my dander up, and I feel like it’s worth responding to.

Read the rest of this entry »

Share

Virtualized Hardware Faster than the Real Deal?

Timothy Prickett Morgan over at The Register has penned an article entitled, “Fake server beats real server in web test”.  The gist of the article is that VMWare has released results showing that virtual Linux servers running on VMWare’s ESX hypervisor have garnered the highest single-server performance for a 16-core machine, and significantly beat out a “comparably configured” non-virtualized server.  How is that possible? Read the rest of this entry »

Share

Using OCFS2 the right way

After responding to Jeremy’s message on Oracle-L, it got me reading his blog.  On one post, he asks if OCFS2 has a future given the rumored introduction of “ASMfs“, and if it’s worth considering for various purposes, specifically:

  • database binaries (vs local files or NFS)
  • diag top (11g) or admin tree (10g) (vs local files or NFS)
  • archived logs
  • backups”

I’ll go through my opinion on each of these four scenarios in a sec.  But first, I do believe OCFS has a long and prosperous future ahead of it.  First, it’s part of the mainstream Linux kernel – so it’s grown beyond just something Wim Coekearts cooked up in the lab.  Second, even if Oracle does introduce an ASM filesystem driver for Linux (I actually got about 1/2 through writing one myself, and I’m glad I dropped it if Oracle is building one), there’s going to be people who don’t want to trust ASM as a filesystem store, or sysadmins who believe filesystems are meant to be in the OS, dammit.  Third, even if there’s an ASM filesystem, chicken-and-the-egg problems mean that things like Oracle binaries won’t be able to be stored in ASM.  So – Oracle ain’t dropping OCFS2 anytime soon.

But – back to the original question.  What is OCFS2 good for?  I’ve seen how a lot of shops use RAC, and there are some clear advantages to using OCFS2 for certain things in a block environment.   Let’s go through Jeremy’s areas and see where there’s a fit:

  • Database binaries – definitely not.  I don’t believe in shared ORACLE_HOME installs to begin with, whether it’s on NFS, OCFS2, GFS, or any CFS.  Shared storage is usually expensive storage, and having multiple homes allows you to survive patching problems or horrible operator error.  Local disk space is cheap.
  • Diag top/admin directories – yes, this probably makes sense. It’s convenient to be able to put your tracefiles and everything else in one place. Performance isn’t a real consideration there too, and this doesn’t take a bunch of disk space.
  • Archive logs – definitely, especially if you’re using ASM in addition to OCFS2, keeping a copy of archive logs outside of ASM is just what the doctor ordered for when your ASM instance explodes and you need to recover your database
  • Backups – well, it’s not clear what kind of backups we’re talking about here, but I typically don’t recommend putting your FRA on OCFS2, as its just easier to put it in ASM.  If we’re talking about exports, rman backups, etc. – I would say it’s probably easier to just put that on non-clustered shared disks, and you’ll get better performance.  You can always unmount and remount the disks on another node if you need to.

So there we go.  But even if you’re looking at an environment where all the datafiles will be in ASM, there’s one more area where OCFS2 can help:

  • OCR and voting – why dedicate little baby SAN disks to OCR and voting when you can just put them on an OCFS2 filesystem?  If you’re at an unfortunate organization where you can only get disks of a certain size (9G,18G,etc.), no need to waste the whole disk on the OCR and voting, make them OCFS2 filesystems, put the OCR and voting on them, and then use the rest of the space for scripts, dump areas, etc.

So, there we go.  IMHO, how to use OCFS2 the right way.

Share

Why OS Packages and Databases Don’t Mix

There was an interesting post to the Oracle-L mailing list today about using OS packages in cluster database environments.   A quick snippet from the post: Read the rest of this entry »

Share

Plura Processing’s in ur Browser, stealing yr cycles

So, I apologize upfront for the title, but I couldn’t resist. Plura Processing came to my attention through my participation in the Cloud Computing Form as someone with a genuinely interesting idea – harvesting browser time spent on websites to process compute cycles on behalf of third-parties.

Read the rest of this entry »

Share

Google is broken!

I’m not actually pleased about this, since I used to deal with a 24/7 web infrastructure and having to keep it up (and how painful that is), but Google is so omnipresent that any downtime is devastating. At the moment, every link from a search gives you a “Warning – visiting this web site may harm your computer” link:

google is broken

It’s nice that Google is looking out for my safety, but protecting me from Wikipedia is probably going a little too far

UPDATE: That was quick, as I expected it would be. It works now. Hopefully the folks who got called were already awake on a Saturday morning.

Share

Fun & Games with SSH Equivalence Part 1: The good

In my day job, I rarely get the opportunity to get as “hands on” as I used to – I still work with Oracle, and of course I still work with our software every day.  But the days when I get to roll up my sleeves and dig into a UNIX, storage, or networking problem are few and far between, and I miss it sometimes.

So, today a customer contacted me and asked me for a favor – they were setting up a server to be part of an Oracle RAC cluster, and the ssh equivalence wasn’t working, and could I help try to figure it out?  I was more than happy to give it a shot. Read the rest of this entry »

Share

Time() Gets Sequential

A friend of mine pointed out today that UNIX time is approaching an interesting event – one day next month, the number of seconds since the Epoch will (for one second) be 1234567890.

As if that’s not weird enough, it will happen on Friday the 13th at 6:31.  Clearly the world will be coming to an end.  Consider yourself notified.

Share