Читать книгу The Failure of Risk Management - Douglas W. Hubbard - Страница 15

A “COMMON MODE FAILURE”

Оглавление

The year 2017 was remarkable for safety in commercial air travel. There was not a single fatality worldwide from an accident. Air travel had already been the safest form of travel for decades. Even so, luck had some part to play in the 2017 record, but that luck would not last. That same year, a new variation of the Boeing 737 MAX series passenger aircraft was introduced: the 737 MAX 8. Within twelve months of the initial roll out, well over one hundred MAX 8s were in service.

In 2018 and 2019, two crashes with the MAX 8, totaling 339 fatalities, showed that a particular category of failure was still very possible in air travel. Although the details of the two 737 crashes were still emerging as this book was written, it appears that it is an example of a common mode failure. In other words, the two crashes may be linked to the same cause. This is a term familiar to systems risk analysis in some areas of engineering, where several failures can have a common cause. This would be like a weak link in a chain, but where the weak link was part of multiple chains.

I had an indirect connection to another common mode failure in air travel forty years before this book came out. In July 1989, I was the commander of the Army Reserve unit in Sioux City, Iowa. It was the first day of our two-week annual training and I had already left for Fort McCoy, Wisconsin with a small group of support staff. The convoy of the rest of the unit was going to leave that afternoon, about five hours behind us. But just before the main body was ready to leave for annual training, the rest of my unit was deployed for a major local emergency.

United Airlines flight 232 to Philadelphia was being redirected to the small Sioux City airport because of serious mechanical difficulties. It crashed, killing 111 passengers and crew. Fortunately, the large number of emergency workers available and the heroic airmanship of the crew helped make it possible to save 185 onboard. Most of my unit spent the first day of our annual training collecting the dead from the tarmac and the nearby cornfields.

During the flight, the DC-10's tail-mounted engine failed catastrophically, causing the fast-spinning turbine blades to fly out like shrapnel in all directions. The debris from the turbine managed to cut the lines to all three redundant hydraulic systems, making the aircraft nearly uncontrollable. Although the crew was able to guide the aircraft in the direction of the airport by varying the thrust to the two remaining wing-mounted engines, the lack of tail control made a normal landing impossible.

Aviation officials would refer to this as a “one-in-a-billion” event2 and the media repeated this claim. But because mathematical misconceptions are much more common than one in a billion, if someone tells you that something that had just occurred had merely a one-in-a-billion chance of occurrence, you should consider the possibility that they calculated the odds incorrectly.

This event, as may be the case with the recent 737 MAX 8 crashes, was an example of a common mode failure because a single source caused multiple failures. If the failures of three hydraulic systems were entirely independent of each other, then the failure of all three hydraulic systems in the DC-10 would be extremely unlikely. But because all three hydraulic systems had lines near the tail engine, a single event could damage all of them. The common mode failure wiped out the benefits of redundancy. Likewise, a single software problem may cause problems on multiple 737 crashes.

Now consider that the cracks in the turbine blades of the DC-10 would have been detected except for what the National Transportation Safety Board (NTSB) called “inadequate consideration given to human factors” in the turbine blade inspection process. Is human error more likely than one in a billion? Absolutely. And human error in large complex software systems like those used on the 737 MAX 8 is almost inevitable and takes significant quality control to avoid. In a way, human error was an even-more-common common mode failure in the system.

But the common mode failure hierarchy could be taken even further. Suppose that the risk management method itself was fundamentally flawed. If that were the case, then perhaps problems in design and inspection procedures, whether it is hydraulics or software, would be very hard to discover and much more likely to materialize. In effect, a flawed risk management is the ultimate common mode failure.

And suppose they are flawed not just in one airline but in most organizations. The effects of disasters like Katrina, the financial crisis of 2008/2009, Deepwater Horizon, Fukashima, or even the 737 MAX 8 could be inadequately planned for simply because the methods used to assess the risk were misguided. Ineffective risk management methods that somehow manage to become standard spread this vulnerability to everything they touch.

The ultimate common mode failure would be a failure of the risk management process itself. A weak risk management approach is effectively the biggest risk in the organization.

The financial crisis occurring while I wrote the first edition of this book was another example of a common mode failure that traces its way back to the failure of risk management of firms such as AIG, Lehman Brothers, Bear Stearns, and the federal agencies appointed to oversee them. Previously loose credit practices and overly leveraged positions combined with an economic downturn to create a cascade of loan defaults, tightening credit among institutions, and further economic downturns. Poor risk management methods are used in government and business to make decisions that not only guide risk decisions involving billions—or trillions—of dollars but also are used to affect decisions that impact on human health and safety.

Fortunately, the cost to fix the problem is almost always a fraction of a percent of the size of what is being risked. For example, a more realistic evaluation of risks in a large IT portfolio worth over a hundred million dollars would not have to cost more than a million—probably a lot less. Unfortunately, the adoption of a more rigorous and scientific management of risk is still not widespread. And for major risks, such as those in the previous list, that is a big problem for corporate profits, the economy, public safety, national security, and you.

A NASA scientist once told me the way that NASA reacts to risk events. If she were driving to work, veered off the road and ran into a tree, NASA management would develop a class to teach everyone how not to run into that specific tree. In a way, that's how most organizations deal with risk events. They may fix that immediate cause but not address whether the original risk analysis allowed that entire category of flaws to happen in the first place.

The Failure of Risk Management

Подняться наверх