A Decade in High Availability

A recent post from Elden Christensen, Sr. Program Manager Lead for Clustering & High Availability, reminded me of one of my former employers. When I joined that company back in 2000 for starting up a professional services based on Windows Server 2000 Data Center Edition, the company was already an established professional services provider in the business critical computing niche market, e.g. Tandem/Compaq/HP NonStop systems, mostly used in the financial markets, e.g. banks or stock exchanges. The Windows platform was regarded as inferior at that time by the NonStop folks and they had good arguments back then.

Remember, those were also the early days where no one was surprised to see an occasional blue screen (people were also using Windows 9x) and what we now know as virtualization was already happening on mainframes in the form of partitioning. At that time, Microsoft with their Windows Server platform had ambitions to enter the data center environment, where the NonStop platform was an established platform for ages and professionals had developed best practices for those environments.

Another part of the discussion was the Fault-Tolerance  versus High Availability topic, where NonStop was already an established Fault Tolerant solution for business critical environments, Windows still had only ambitions to move towards that market with the Data Center product. A logical move, looking at the status of (web) applications, SQL and last but not least, Exchange and where it was going and what customers expected of those products regarding availability and reliability. To repeat an infamous quote of a NonStop colleague back then, “E-mail is not business critical”. But that was almost 10 years ago, things have changed .. or haven’t they?

Single Point of Failure
First I’ll start by introducing the availability concept, which revolves around eliminating the single point of failure. This is an element in the whole system of hardware, software and organization that can cause downtime for a system, i.e. disruption of services. After identifying a single point of failure, we want to eliminate it to prevent downtime which is, after all, the ultimate goal for a business critical system. We can approach this task using two different strategies, Fault Tolerant (FT) or High Availability (HA). The task of identifying and eliminating single points of failure is an ongoing process, as most IT environments are subject to change over time.

Availability
To understand the Fault Tolerant and High Availability strategies we need to define the term “Availability”. In the dictionary, availability is defined as the quality or state of being available or an available person or thing, where in both cases available means present or ready for immediate use. The availability is mostly expressed as a percentage, for example when used in a service level agreement, but what does that percentage mean? To explain this take a look at the following diagram:

Lifecycle

I assume this lifecycle speaks for itself. Using this diagram, the availability is calculated as follows: MTBF / (MTBF + MTTR). The related expected downtime is calculated as ( 1 – Availability% ) * 1 year. Note that the time between failure and recovery isn’t used in the calculation.

I’ll use a simple example, a 500 GB Seagate Barracuda 7200.12 (ST3500412AS) with a MTBF specification of 750,000 hours. You have a 24 hours replacement contract and need about 4 hours to restore the backup. The availability would then be 750,000 / ( 750,000 + 28 ) = 0.9999626680% resulting in a yearly downtime of ( 1 – 0.9999626680) * ( 365 days  * 24 hours * 60 minutes ) = 19,6 hours.

Of course with hardware these numbers are theoretical and to some extent a marketing thing; how else can Seagate specify an MTBF of 750,000 hours ( 85 years ). I tend to look at it as an indication of the reliability you can expect. For example, compare the MTBF of 7200.12 drive with an enterprise class drive, Seagate’s ES product line. The ST3500320NS has an MTBF of 1,200,000 hours.

That’s the reason you should use enterprise class drives in your storage solution instead of desktop drives, which aren’t supposed to run in 24×7 environments. To add to that, the MTBF decreases when used in series (RAID 0 = 1 / (1/MTBF1 + .. 1/MTBFn)) or increases when used in parallel (RAID 1 = MTBF * ( 1 + 1/2 + 1/n) ) configurations. When trying to do calculations for the whole supply chain, with all the elements and their individual specifications and support contracts, this can get very complex.

The 9’s
imageWhen talking about availability this is often shown using a series of 9’s, e.g. 99.9%. The more 9’s it has, the better (less downtime). Note that for each increased level of availability, the required effort increases significantly. By effort, don’t think of technical solutions only. It also means organizational measures like having skilled personnel and proper procedures.

A fact is that only a small percentage of the causes for outage is technical, the majority of incidents is due to human error. And yes, that includes that bad driver which is programmed by humans. This is why changes in properly managed infrastructure should always go through test and acceptance procedures in environments representative or identical to the production environment. Unfortunately, this doesn’t always happen as not all IT departments have this luxury, mostly because of financial reasons.

Availability% Downtime / Year Downtime / Month
99.0% 3.65 days 7.3 hrs
99.9% 8.76 hrs 43.8 min
99.99% 52 min 4.3 min
99.999% 5.2 min 26 sec
99.9999% 31 sec 2.6 sec

Fault Tolerant
imageThe goal of a Fault Tolerant solution is to maximize the Mean Time Between Failure (MTBF). This is achieved by mirroring or replicating systems. These monolithic systems run software in parallel on identical hardware. This is called Lockstep (which, for your information, refers to synchronized marching).

Because Fault Tolerant systems run in parallel, the results of an operation can be compared. When the results don’t match, a fault occurs. Since the faulty system can’t be identified using 2 parallel systems, there’s also a variation to this architecture where one server functions as master and one as slave, the slave functioning as a hot-standby. To solve the ambiguity, you could use three systems where the majority of the systems determine the right output.

When faults are detected in a Fault Tolerant system, the failing component (or system) is disabled and the mirror takes over. This makes the experience transparent for the end-user. There is one caveat: since Fault Tolerant systems run software in parallel, software faults are also mirrored.

Examples of Fault Tolerant components are ECC RAM, multiple NICs in Fault Tolerant configuration, multipath network software, RAID 1+ disk systems or storage with replication technology. Examples of Fault Tolerant systems are HP NonStop (propriety), Stratus ftServer or Unisys ES7000. There are also software-based solutions like Marathon EverRun or VMWare’s FT offering.

High Availability
High Availability aims to maximize minimize the MTTR. This can be achieved by redundant or standby (cold, hot) systems or non-technical measures like on-site support contracts. Systems take over the functionality of the failing system after the failure occurred. Therefor, High Availability solutions aren’t always completely transparent for the user. The effects of a failing system and the consequences for end end-user depend on the software, e.g. a seamless reconnect or requirement to login again. Another point of attention is the potential loss of information caused by pending transactions being lost because of the failure. To make the experience more transparent for the user, application need to be resilient, e.g. detecting failure and retrying the transaction.

Examples of High Availability technologies are load balancing – software or hardware-based – and replication, where load balancing is used for static data and replication for dynamic data.

The Present
After a decade, technology has evolved but is still founded on old concepts. Network load balancing is still here and clustering (anyone remember Wolfpack?), although we moved from shared nothing to to replication technology, remains largely unchanged. This means either there hasn’t been much innovation or the technologies do a decent job; After all, it’s still a matter of demand and supply. Yes, we moved from certified configurations-only shared storage solutions to flexible Database Availability Groups (hey, this is still and Exchange blog), but most changes are in the added functionality category or to take away constraints, e.g. cluster modes (majority node set, etc.), multiple replicas and configurable replication.

Windows Server
Data Center Edition
x86 x64
2000 Max. 32 GB
32 CPUs
4 nodes
N/A
2003 SP2 Max. 128 GB
32 CPUs
8 nodes
Max. 512 GB
64 CPUs
8 nodes
2003 R2 SP2 Max. 64 GB
32 CPUs
8 nodes
Max. 2 TB
64 CPUs
8 nodes
2008 Max.  64 GB
32 CPUs
16 nodes
Max. 2 TB
64 CPUs
16 nodes
2008 R2 N/A Max. 2 TB
64 CPUs (256 logical)
16 nodes

What about Fault Tolerance and Windows’ Data Center Edition as the panacee for all your customers requiring “maximum uptime”? The issue with Fault Tolerant was that it came with a hefty price tag, especially in those days. Costs were an x-fold of the costs involved with High Availability solutions on decent (read: stable) hardware. So, for those extra 9’s you needed deep pockets. For example, around 2001 an Compaq ES7000 with Windows Server 2000 Data Center Edition, the joint-support queue (e.g. Microsoft and OEM) and services came with a $2m price tag for which you got the promise of 99,9% availability.

Compare that to buying a few Proliant’s with Windows Server 2000 Advanced Server, some Fault Tolerant components (FT NICs, RAID), off the shelf High Available technology and dedicated personnel (justifiable with that DCE price tag) for .. say, $250,000. With skilled personnel and operated in a controlled environment you could easily reach 99% availability. Is that price difference worth 3 days of downtime? Also, the simplicity to implement those technologies made High Availability in Windows accessible for the masses and now – certainly in the Exchange world – seldom see load balancing or forms of clustering not being utilized.

Note that in the past decade, I’ve never encountered Data Center for hosting Exchange. In fact, as of Exchange 2003, support for on Data Center was dropped. Nowadays, Data Center is regarded as an attractive option for large-scale virtualizations based on Hyper-V, not only because Data Center costs less than back then (about $3000 per CPU – hurray for multi core, but with a 2 CPU minimum) and runs certified on more hardware, but also because it comes with unlimited virtualization rights, meaning you may run Windows Server 2008 R2 (or previous version) Standard, Enterprise, and Datacenter in the virtual instances without the need to purchase additional licenses for those.

With all the large-scale virtualization and consolidation projects going on, virtualizing Exchange or other parts of your IT infrastructure, it’s good to know that there are other options when required by the business.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s