Читать книгу Maintaining Mission Critical Systems in a 24/7 Environment - Peter M. Curtis - Страница 51
3.6.1 Review of Reliability Terminology
ОглавлениеReliability (R) is the probability that a product or service will operate properly for a specified period of time under design operating conditions without failure.
The failure rate (λ) is defined as the probability that a failure per unit time occurs in the interval, given that no failure has occurred prior to the beginning of the interval.
For a constant failure rate λ, reliability as a function of time is:
Mean time between failures (MTBF), as its name implies, is the mean of the probability distribution function of failure. For a statistically large sample, it is the average time the equipment performed its intended function between failures. For the example of a constant failure rate:
Mean time to repair (MTTR) is the average time it takes to repair the failure and get the equipment back into service.
Availability (A): Availability is the long‐term average fraction of time that a component or system is in service and satisfactorily performing its intended function. This is also called steady‐state availability. Availability is defined as the mean time between failures divided by the mean time between failures plus the mean time to repair:
High reliability means that there is a high probability of good performance in a given time interval. High availability is a function of failure frequency and repair times and is a more accurate indication of data center performance.
As more and more buildings are required to deliver service guarantees, management must decide what performance is required from the facility. Availability levels of 99.999% (5.25 minutes of downtime per year) allow virtually no facility downtime for maintenance or other planned or unplanned events. Therefore, moving toward high reliability is imperative. Since the 1980s, the overall percentage of downtime events caused by facilities has grown as computers become more reliable. And although this percentage remains small, the total availability is dramatically affected, because repair times for facility events are so high. A further analysis of downtime caused by facility failures indicates that utility outages have actually declined, primarily due to the installation of standby generators.
The most common response to these trends is reactive: that is, spending time and resources to repair the offender. If a utility goes down, install a generator. If a ground‐fault trips critical loads, redesign the distribution system. If a lightning strike burns power supplies, install a new lightning protection system. Such measures certainly make sense, as they address real risks in the data center. However, strategic planning can identify internal risks and provide a prioritized plan for reliability improvements. Planning and careful implementation will minimize disruptions while making the business case to fund these projects.
As technological advances find their way onto the data center’s raised floor, the facility will be impacted in unexpected ways. As equipment footprints shrink, free floor area is populated with more hardware. However, because the smaller equipment rejects the same amount of heat, power and cooling densities grow dramatically, and floor space for cooling equipment increases. The large footprint now required for reliable power without planned downtime – e.g., switchgear, generators, UPS modules, and batteries – also affects the planning and maintenance of data center facilities. Over the last two decades, the cost of the facility relative to the computer hardware it houses has not grown proportionately. Budget priorities that favor computer hardware over facilities improvement can lead to insufficient performance. The best way to ensure a balanced allocation of capital is to prepare a business analysis that includes costs associated with the risk of downtime. This cost depends on the consequences of an unplanned service outage in that facility and the probability that an outage will occur.