There is huge misunderstanding about Reliability and Availability. I recently came across a multi-million dollar project where the customer was asking for a 99.9% reliability for a web-based "service". I could swear they meant Availability.
I have seen this confusion often enough that I think a bit of explanation could be useful. So in a series of postings I will address the following points:
1- The technical definition of each word in simple terms, and their mathematical calculations using real-world examples.
2- How to configure a system to meet different levels of availability.
3- When should you care about Reliability, and when should you care about availability.
4- If you are drawing-up an RFP or a contract, what are the parameters that should be specified in each case when you ask for (or commit to) a specific level of Reliability or Availability?
This blog, the first in the series, deals with the technical definition of each word, and their mathematical calculation.
- Reliability: Reliability is the likelihood that a given component or system will be functioning when needed as measured over a given period of time.
For example let's assume that you have a system with an MTBF (Mean Time Between Failure) of 3 years, or 26,280 hours. You are interested to calculate the likelihood that this system would have no outages during any 1 year observation period. To do this use the following formula:
R = e**(-t/MTBF)
where
R = Reliability
e = 2.71828182845904, the base of the natural logarithm
t = the observation period
MTBF = Mean Time Between Failure of the given system
Using this formula, we find that the answer to the question is 0.7165. In other words chances of a break-down during any 1 year observation period are 1-0.7165=0.2835 (~28%).
- Availability: Availability is the percentage of times that a given system will be functioning as required. The measurements that form the basis of calculation for this percentage may be discrete (number of times an engine will start if tried 1000 times) or continuous (the number of hours in a year that a telephone switch will be operational in a given year).
In the converged worlds of IT/Telecom, very commonly the word "Availability" is used to refer to "Steady State Availability" or up-time ratio.
There are two formulas for availability:
- First let's consider how an "unscheduled" outage can impact the availability of a system. To do that use this formula to measure availability:
A = 1 - (t_outage/T)
where
t_outage = Duration of unscheduled outage
T = Agreed upon window of time for Availability Measurements
So as an example, let's assume that your contract calls for 99.95% availability in any 1 month period of time, what would two unscheduled outages of 15 minutes each mean?
A = 1 - [(2*15)/(30*24*60)]=> A = 99.93%. You may be in hot water!
- To estimate Availability in advance use this formula:
A = MTBF/(MTBF+MTTR)
where
A = Availability
MTBF = Mean Time Between Failure of the given system
MTTR = Mean Time To Repair
Now, let's say you are using the computer system above with an MTBF of 3 years, or 26,280, and a 4 hour MTTR. What kind of Availability can you expect?
Plugging the numbers gives you A = 26,280/(26,280+4) = 99.98%
So, let's ask a question now. Based on the example above, if you are providing a web-based service using the system above (with 3 year MTBF and 4 hours MTTR), it should be safe to commit to 99.98% availability, correct?
No, not necessarily. The calculation above applies only to the full duration of the MTBF (3 years). What it means is that on average, many systems of this type observed over a 3 year period will have an availability of 99.98%.
When promising availability one of the important factors to consider is the "observation period". As an example if your contract calls for 99.98% availability in any given 30 day period, all you need is a single outage lasting 30 minutes, and the availability for that month drops to 99.93% (as shown in the first calculation for Availability after outage).
It may not be, however, practical to ask your customers to measure the availability of the system over a period as long as 3 years, and they may insist on a 1 month observation period anyway. The question to answer is: how can you configure the system to meet this level of availability?
The answer to this question will be the topic of the next blog in this series. In the meanwhile feel free to send me your thoughts and comments.