Читать книгу Maintaining Mission Critical Systems in a 24/7 Environment - Peter M. Curtis - Страница 22

1.6 Documentation and Human Factor

Оглавление

The mission critical industry’s focus on physical infrastructure enhancements descends from the early stages of the trade when all efforts were placed solely in design and construction techniques to enhance mission critical equipment.

Twenty‐five years ago, the technology supporting mission critical loads was simple. There was little sophistication in the electrical load profile; at that time, the industry was in its infancy. Over time the data centers have grown from a few mainframes supporting minimal software applications to server farms that can occupy 100,000 ft2 or more – with Google and Microsoft being prime examples.

As more processing power is required to sustain our global economy, the electrical and mechanical systems supporting the critical load became increasingly complex. With businesses relying on this infrastructure, more capital dollars were invested to improve the uptime of the business’s lines. Today billions of dollars are invested on an enterprise‐level into the infrastructure that supports the business 24/7 applications; the major investments are normally in design, equipment procurement, and project management. Few capital dollars are invested in the documentation, change management, education/training, or operations and maintenance. The initial capital investment was just the tip of the iceberg (Figure 1.1).


Figure 1.1 Hidden Costs of Operations


Figure 1.2 Typical screenshot of SmartWALK™ dashboard

(Courtesy of PMC Group One, LLC.)

Years ago, most organizations relied heavily on their workforce to retain much of the information regarding the mission critical systems. A large body of personnel had a similar level of expertise. They remained with their company for decades. Therefore, little emphasis was placed on maintaining a living document for a critical infrastructure. Tables 1.4 to 1.6 identify questions with regards to managing the loss of personnel, documentation, and managing during a critical event.

Table 1.4 Managing Loss of Critical Personnel

The Issues: Employee Turnover, Retirement, Sick Leave or Vacation
Was knowledge lost?
Where is existing documentation?
How are new employees trained?
What risks are faced during the transition?

Table 1.5 Documentation Issues

The Issue: Traditional documentation systems are inconsistent, inaccessible, and unstructured.
How is information shared?
Is system data readily available?
Where is the documentation?
How are revisions approved and made available to all users?

Table 1.6 Managing During Critical Events

The Threats: Fires, Natural Disasters, Blackouts, Intentional Disruption
Who should be contacted?
Is your critical system data defined?
Where are the procedures?
Will you be able to respond in time?

Figure 1.3 SmartWALK™ mobile screenshot

(Courtesy of PMC Group One, LLC.)


Figure 1.4 Screenshot of SmartTEAM® Learning Management System

(Courtesy of PMC Group One, LLC)

The mission critical industry can no longer manage their critical system as they did twenty‐five years ago. The requirements are very different today in that the sophisticated nature of the data center infrastructure requires constant refreshing and updating of documentation. One way to achieve this is to include a living document system that provides the level of granularity necessary to operate a mission critical infrastructure into a capital project. This will assist in keeping the living document current each time a project is completed, or a milestone is reached. Accurate information is the first level of support that provides first responders the intelligence they need to make informed decisions during critical events. It also acts like a succession plan as employees retire, and new employees are hired, thus reducing risk and improving their learning curve. Remember that greater than 50% of all downtime can be tracked to human error.

Human error as a cause of hazard scenarios must be identified, and the factors that influence human errors must be considered. Human error is a given and will arise in all stages of the process. It is vital that the factors influencing the likelihood of errors be identified and assessed to determine if improvements in the human factors design of a process are needed. Surprisingly, human factors are perhaps the most poorly understood aspect of process safety and reliability management.

Balancing system design and training operating staff in a cost‐effective manner is essential to critical infrastructure planning. When designing a mission critical facility, the level of complexity and ease of maintainability is a major concern. When there is a problem, the Facilities Manager (FM) is under enormous amounts of pressure to isolate the faulty system while maintaining data center loads and other critical loads. The FM does not have the time to go through complex switching procedures during a critical event. A recipe for human error exists when systems are complex, especially if key system operators and documentation of Emergency Action Procedures (EAP) and Standard Operating Procedures (SOP) are not immediately available or have not been reviewed or updated periodically. A rather simplistic electrical system design will allow for quicker and easier troubleshooting during this critical time.

To further complicate the problem, equipment manufacturers and service providers are challenged to find and retain the industry’s top technicians within their own company. As 24/7 operations become more prevalent, the talent pool available will continue to diminish. This would indicate that response times could increase from the current standard of four hours to a much higher and less tolerable timeframe. The need for a simplified, easily accessible, and well‐documented design is only further substantiated by the growing imbalance of supply and demand of highly qualified mission critical technicians.

When designing a mission critical facility, a budgeting and auditing plan should be established. Each year substantial amounts of money are spent on building infrastructure, but inadequate capital is allocated to sustain that critical environment through the use of proper documentation, education, and training.

Maintaining Mission Critical Systems in a 24/7 Environment

Подняться наверх