Читать книгу Maintaining Mission Critical Systems in a 24/7 Environment - Peter M. Curtis - Страница 14
1.1 Introduction
ОглавлениеContinuous, clean, and uninterrupted power and cooling is the lifeblood of any data center, especially one that operates 24 hours a day, 7 days a week. Critical enterprise power is the power without which an organization would quickly be unable to achieve its business objectives. Today, more than ever, enterprises of all types and sizes are demanding 24‐hour system availability. This means enterprises must have 24‐hour power and cooling day after day, year after year. One such example is the banking and financial services industry. Business practices mandate continuous uptime for all computer and network equipment to facilitate round‐the‐clock trading and banking processes anywhere, and everywhere, from any device in the world. Banking and financial service firms are completely intolerant of unscheduled downtime, given the guaranteed loss of business that invariably results. However, providing the best equipment is not enough to ensure 24‐hour operation throughout the year. The goal is to achieve reliable 24‐hour power, cooling, and processing at all times, regardless of the technological sophistication of the equipment or the demands placed upon that equipment by the end‐user, be it business or municipality.
Today most industries are constantly expanding to meet the needs of the growing global digital economy. The industry as a whole has been innovative in the design and use of the latest technologies, driving its businesses to become increasingly digitized in this highly competitive business environment. The industry is progressively more dependent on the continuous operation of its data centers in reaction to the competitive realities of a global economy. To achieve optimum reliability when the supply and availability of power are becoming less certain is challenging to say the least. The data center of the past required only the installation of standalone protective electrical and mechanical equipment main. Data centers today operate on a much larger scale, 24/7. The proliferation of distributed systems using thousands of desktop PCs and workstations connected through LANs, WANs, WLAN, SAN, VPN, etc. simultaneously use dozens of software business applications and reporting tools, makes each building a “computer room.” These computer rooms are also known as Intermediate Distribution Frame (IDF) and Main Distribution Frame (MDF) critical spaces. As we add the total number of locations utilized by each bank all over the world utilizing the internet, we now realize the necessity of a critical infrastructure and associated benefits of uptime, and reliability.
The reputation of Corporate America was severely harmed now almost two decades ago by a number of historically significant events: the collapse of the dot.com bubble and the high‐profile corporate scandals. These events have taken a significant toll on financial markets and have served to deflate the faith and confidence of investors. In response, governments and other global organizations enacted new or revised existing laws, policies, and regulations. In the United States, laws such as the Sarbanes‐Oxley Act of 2002 (SOX), Basel II, and the U.S. Patriot Act were created. In addition to management accountability, another embedded component of SOX makes it imperative that companies not risk losing data or even risk downtime that could jeopardize accessing information in a timely fashion. These laws can actually improve business productivity and processes.
Many companies, due to lack of awareness, a misunderstanding of reliability concepts, or other factors, fail to consider installing backup equipment or design their systems with the proper levels of redundancy commensurate with their risk profile. Then, when a major power outage occurs, these same companies suddenly discover that they will take a huge hit operationally and financially. Only then do they learn that the hit would have been avoided entirely or reduced in magnitude had they undertaken appropriate action beforehand. During the months following the Northeast Blackout of 2003, for example, there was a marked increase in the installation of UPS systems and standby generators. Small and large businesses alike learned how susceptible they are to power disturbances and the associated costs of not being prepared. Some businesses that were not typically considered mission critical learned that they could not afford to be unprotected during a power outage. The Northeast blackout of 2003 emphasized the interdependencies across the critical infrastructure and the cascading impacts that occur when one component falters. Most ATMs in the affected areas stopped working, although several had backup systems that enabled them to function for a short period. Soon after the power went out, the Comptroller of the Currency signed an order authorizing national banks to close at their discretion. Governors in a number of affected states made similar proclamations for state‐chartered depository institutions. The end result was a loss of revenue, profits, and a threat to the confidence in our financial system. More prudent planning and the proper level of investment in mission critical infrastructure for electric, water, and telecommunications utilities, coupled with proactive building infrastructure preparation, and operations, could have saved the banking and financial services industry millions.
At the present time, the risks associated with cascading power supply interruptions from the public electrical grid in the United States have increased due to the ever‐increasing reliance on computer and related technologies. This has occurred while investments in the reliability and security of the grid have not kept pace with the levels recommended by industry experts. Today there are trillions of devices and billions of people connected to the world‐wide‐web. As the number of computers and related technologies continue to multiply in this increasingly digital world, the demand for reliability increases as well. Businesses are not only competing in the marketplace to deliver whatever goods and services are produced for consumption, but now they must compete to hire the best engineers from a dwindling pool of talent who can design the best infrastructures needed to obtain and deliver reliable power and cooling. This keeps the mission critical manufacturing and technology centers up and running with the ability to produce the very goods and services that sustain them. The idea that businesses today must compete for the best talent to obtain reliable power is not new, as are the consequences of failing to meet this challenge. Without reliable power, there are no goods and services for sale, no revenues, and no profits ‐ only losses when power is not available. Hiring and keeping the best‐trained engineers employing the very best analyses, making the best strategic choices, and following the best operational plans to keep ahead of the power supply curve is essential for any technologically sophisticated business to thrive and prosper. A key to success is to provide proper training and educational resources to engineers so they may increase their knowledge and keep current on the latest mission critical technologies available all over the world, which is one of the purposes of this content. In addition, companies need to pool their efforts toward improving educational opportunities and certification programs for young mission critical engineers to help address the decreasing workforce necessary to sustain the growing mission critical industry.
It is also essential for critical industries to constantly and systematically evaluate their mission critical systems, assess and reassess their level of risk tolerance versus the cost of downtime, and plan for future upgrades in equipment and services that are designed to meet business needs and ensure uninterrupted power and cooling supplies in the years ahead. Simply put, minimizing unplanned downtime reduces risk. Unfortunately, the most common approach is reactive, that is, spending time and resources to repair a failed piece of equipment after the fact as opposed to identifying when the equipment is likely to fail and repairing or replacing it without interruption. If the utility goes down, install a generator. If a ground‐fault trips critical loads, redesign the distribution system. If a lightning strike burns out power supplies, install a new lightning protection system. Such measures certainly make sense, as they address real risks associated with the critical infrastructure; however, they are always performed after the harm has occurred. Often, such efforts proceed in haste without enough consideration of how the short‐term fix fits into the larger picture of how the facility’s systems should operate in an integrated manner. This can result in the introduction of new vulnerabilities. Strategic planning, on the other hand, can identify internal risks and provide a prioritized plan for reliability improvements that identify the root causes of failure before they occur.
In the world of high‐powered business, owners of real estate have come to learn that they, too, must meet the demands for reliable power supply to their tenants. As more and more buildings are required to deliver service guarantees, management must decide what performance is required from each facility in the building. Availability levels of 99.999% (5.25 minutes of downtime per year) allow virtually no facility downtime for maintenance or other planned or unplanned events. Moving toward high reliability is imperative. Moreover, avoiding the landmines that can cause outages and unscheduled downtime never ends. Event planning and impact assessments are tasks that are never truly completed; they should be viewed afresh at least once every budget cycle.
The evolution of data center design and function has been driven, in part, by the need for uninterrupted power. Data centers now employ many unique designs developed specifically to achieve the goal of uninterrupted power within defined project constraints based on technological need, budget limitations, and the specific tasks each center must achieve to function usefully and efficiently. Providing continuous operation under all foreseeable risks of failure such as power outages, equipment breakdown, internal fires, and so on requires the use of modern design and modeling techniques to enhance reliability. These include redundant systems and components, standby power generation, fuel systems, automatic transfer and static switches, pure power quality, UPS systems, cooling systems, raised access floors, fire protection, as well as the use of Probabilistic Risk Analysis modeling software (each will be discussed in detail later) to predict potential future outages and develop maintenance and upgrade action plans for all major systems.
Also vital to the facility's life cycle is two‐way communication between upper management and facilities management. Only when both ends fully understand the three pillars of infrastructure reliability ‐ design, maintenance, and operation of critical environments (including the potential risk of downtime and recovery time) ‐ can they fund and implement an effective plan. Because the costs associated with reliability enhancements are significant, sound decisions can only be made by quantifying performance benefits against downtime cost estimates for each upgrade option to determine the best course of action. Planning and careful implementation will minimize disruptions while making the business case to fund necessary capital improvements and implement comprehensive maintenance strategies. When the business case for additional redundancy, specialized consultants, documentation, and ongoing training reaches the boardroom, the entire organization can be galvanized to prevent catastrophic data losses, damage to capital equipment, and danger to life and limb.