Читать книгу Reliability Assessment: A Guide to Aligning Expectations, Practices, and Performance - Daniel Daley - Страница 7
ОглавлениеA Fictional Story — What Do You Have a Right to Expect?
Individual commitment to a group effort —
that is what makes a team work, a company work,
a society work, a civilization work.
Vince Lombardi
The setting is the waiting area outside the Plant Manager’s office. Sitting alone in the waiting area is Joe, the plant’s reliability engineer. He has been asked to meet with the plant manager at 3:00 pm and to bring with him the records for the recycle compressor in the P2S unit. It is now 3:20 pm and Joe can hear the voices of several people in the Plant Manager’s office. The voices are muffled so he cannot tell whose voices they are or what is being discussed. Joe has another meeting with his team and an equipment vendor at 4:00 so he was hoping this meeting would be over on time.
The recycle compressor in the P2S plant has had a sordid reliability history. It was the single largest cause of production losses in the P2S plant. Because that plant was in a “sold-out” position, every outage resulted in lost revenue.
At 3:35 pm, the door to the Plant Manager’s office opened; the Plant Manager looked out and invited Joe into his office. Inside, Joe found his boss, the Manager of Maintenance and Reliability, the Operation Manager for the P2S unit, and the Assistant Plant Manager. The Plant Manager dragged a chair from the back of the room into the middle of the group, then returned to his place behind the desk and took his seat.
“Have a seat, Joe. We have been discussing the recycle compressor in the P2S plant,” began the Plant Manager. “As you are aware, the machine has not been meeting our expectations and we need a solution.”
The Operations Manager interjected, “Our operators do their best to keep it running, but it is just a piece of junk.”
“It was designed, purchased, and built to the same corporate standards as the rest of this plant,” pointed out the Manager of Maintenance and Reliability, “and our maintenance department was just audited by corporate and found to be among the best in the company.”
“Let’s give Joe a chance to talk — that is what he was invited here to do,” chimed in the Assistant Plant Manager, doing his best to sound like a viable candidate for the next Plant Manager’s job that opened up. “Joe, you are the reliability expert. You probably know more than the rest of us put together,” he added.
“I am sure the machine was well designed. Corporate engineering purchased the best machine for the job, our operators are working with it as well as it can be operated, and our maintenance personnel are maintaining it as well as it can be maintained,” summarized the Plant Manager, showing his pride and ownership for each of those organizations. “It’s just not performing the way we expect it should operate and we are at a loss to understand why,” he added.
“Well, I don’t think you want to hear this, but your expectations may not be consistent with the facts,” began Joe.
“I don’t follow,” said the Plant Manager. “Are you disagreeing with what the others have said here today?”
“I assembled this file in preparation for this meeting,” began Joe. “There are a variety of records that are inconsistent with what was just said.”
“Joe, there is no reason to get defensive. No one is blaming you for the poor performance,” responded the Assistant Plant Manager.
“I was not trying to be defensive; I was just trying to lay out the history that provides some insight into what our expectations should be. The file paints a rather gloomy picture for this machine.
Of course, if that is not what you want to discuss, it is up to you”, Joe said, looking at the Plant Manager.
The Plant Manager waved Joe on saying, “I think you are right. Let’s hear about what is in the file. My impression is that we have given this machine every chance for success. Prove me wrong.”
Feeling a little like the defendant in a courtroom, Joe started down through a stack of papers in the file, sequentially handing each one to the Plant Manager and explaining what it said or meant.
“First,’ Joe noted, “there is no record of concurrent design for reliability during the initial project development. Although the designers paid attention to the functionality of the system and system integrity, they did not take any formal steps to see that this machine — or any other part of the unit, for that matter — would provide any specific level of reliability or availability.”
The Assistant Plant Manager laughed and said, “You’re telling us that this machine is likely to blow up in our faces?”
Joe responded, “No I said that integrity was addressed during the design, but not reliability or availability. It won’t blow up, but it will fail at unknown intervals. Based on the design, you don’t really know what percentage of the time the machine will need to be shut down for maintenance.”
Continuing, Joe pushed another document toward the Plant Manager saying, “This is the original bid comparison. This machine was the least expensive of all the alternatives. I am familiar with two of the other more expensive alternatives. They were selected and installed at two of our other plants. Both are experiencing much higher reliability and significantly lower maintenance costs.”
The Operations Manager commented, “If we selected the most expensive choice for every component, we would never get any new plants.”
Joe responded, “The fifteen year lifecycle cost for this choice will end up costing more than twice as much as the closest alternative. And that is without considering the value of lost production. Again, a comprehensive lifecycle cost comparison was never made during the design. In other words, the cheapest choice up front is the most expensive choice over the long haul.”
By this time, the participants in the meeting other than Joe were giving each other nervous looks and were squirming in their seats.
Joe withdrew another document from the file and pushed it toward the Plant Manager, saying, “This is the record of alignment measurements completed during construction. What the records suggest is that there was an unusually high piping load on the inlet nozzle when the compressor was placed in service. The inlet piping is 24-inch diameter and the area it passes through is quite congested. Apparently re-routing the piping was viewed as too expensive. Therefore, the machine has had to deal with high nozzle stress for its entire life.”
“But that doesn’t directly cause failures,” responded the Assistant Plant Manager.
“Well, stress translates into strain, and strain translates into displacement, and displacement between stationary and rotating components results in more wear and early wear out,” explained Joe. “In other words, it is a ‘defect’ in the system.”
Pulling yet another small stack of papers from the manila folder, Joe described their content. “These reports cover a series of events that resulted in emergency shutdowns of this machine. You can see that most of them were situations when the feed drum was allowed to exceed high level. It appears that in several situations the machine ingested at least some liquid.”
The Operations Manager took the sheets from the Plant Manager’s desk, saying “Now you are trying to blame the operators. I can assure everyone here that my operators do a good job and this was not their fault.”
“I am not trying to blame anyone,” said Joe. “I am just trying to describe some of the things that affect the reliability of this machine. It really doesn’t matter to this machine if it was slugged with liquid due to operator oversight or a malfunctioning suction drum level instrument.”
About this time, several of the people in the meeting began to glaze over. Apparently the meeting was not going as they had expected. They wished they were elsewhere.
Joe again reached into his file and pulled out two documents. He was sensing that his air time was running out and, if he wanted to make a point, he would need to do so quickly. “This first document is a record of predictive and preventive maintenance,” started Joe. “it shows that roughly fifty percent of the time the recommended PM is not being done.”
Joe flipped to the second document, saying, “This is a record of the work that was recommended for the last turnaround and the work that was actually completed. Our analysis showed that several components were at the end of their life and several other components would not survive another run. You can see that the decision was made to defer the overhaul from the turnaround. Here you see that when the bearings failed, the other components with limited life were not changed because of the desire to get the machine back as quickly as possible. The decision to defer the overhaul from the turnaround caused the first shutdown. In turn, the decision to make only a partial repair caused the second.”
By this time, the Plant Manager had seen enough. He said, “I will have to take responsibility for those decisions. I made those choices.”
Again sensing that his audience was running out of patience, Joe pulled out the last two documents from the folder. The first was a document from a recent project and the second was a copy of a recent budget detail.
Pointing to the project document, Joe began by saying, “The first of these documents describes some communications that occurred during the recent capacity expansion project. You may or may not be aware that the corporate project management process does not include any design-for-reliability or reliability analysis steps. As a result, we in the plant reliability department performed a comparison of the pre-project reliability and compared it to the reliability of the proposed post-project configuration. Our calculations showed that the proposed post-project configuration would be less reliable. This is the result of the redundant electrical feeder to this machine being used to supply a new load. We made a recommendation to the project manager that the redundancy be maintained, but was told that was beyond the project scope.”
Once again, the Assistant Plant Manager interjected, “That was analyzed and viewed as an acceptable risk.”
Joe responded, “One problem is that risk is really a measure of hidden cost. In some cases, the costs appear later; in other cases, the costs appear sooner. In this case, we have already experienced a failure associated with this choice. Although the failure was charged to this machine, it was really a failure of the electrical system that supplies the motor — and the decision making process”
Joe continued, referring to the final document, “This final document is a copy of the budget detail for last year. I have highlighted a line item that proposes replacement of several outdated controls on the machine. Also some of the instrument wiring shows deterioration and should be replaced as a part of our plant renewal initiative. As you can see, the line item was struck from the budget and will need to be proposed for some later time.”
“Is that all” said the Plant Manager, “or is there anything else?”
“That’s about it,” answered Joe.
“Well, I can see how some of these things might have an effect, but I guess I don’t completely follow what you are saying,” said the Plant Manager.
Sensing that several of the participants were becoming defensive, Joe started slowly to explain, “At the beginning of the meeting, you said that this machine was not meeting expectations. For reliability, realistic expectations should be based on an assessment of risk. Risk is a measure of the likelihood that an undesired event will occur. In this case, the undesired event is a shutdown of this machine. Each choice and action during the life of the machine will affect the risk of failure. In some cases, choices will improve reliability, some choices will maintain the same level of risk, and some choices will deteriorate reliability and increase the risk of failure. Each of the choices I have mentioned today tended to increase the risk of failure and reduce reliability. Thinking in terms of ‘what we have a right to expect,’ we should think of reducing our expectations or investing in efforts that will enhance reliability.”
“Well, Joe, I think I speak for everyone here when I say that we appreciate your efforts in assembling the information you shared with us today. I am sure you have other things you need to do and we have taken enough of your time,” said the Plant Manager, looking around the room. “If everyone else would hang around a few minutes, Joe, you can get back to work.”
Joe left the file with the Plant Manager and departed his office, closing the door behind him. The room was silent for a few minutes. Finally, the Plant Manager broke the silence asking, “Are there any comments?”
The Manager of Reliability and Maintenance (to whom Joe reported) started with, “Joe is a very conscientious employee. He takes his job seriously and works a lot of long hours.”
After another few moments of silence, the Plant Manager began, “I guess I have two observations. The first has to do with the information in this file. From my viewpoint, it is too late to bring these things up at this time. If they were as critical as Joe contends, he should have brought them up earlier.”
The good-hearted but naïve Operations Manager responded, “I think he did bring them up, but no one listened.”
His face reddening, the Plant Manager responded, ignoring the Operations Manager and speaking directly to the Assistant Plant Manager (to whom the Operations Manager reported), “You miss my point entirely. It is his job to get our attention. He needs to get our attention when there is a problem. He needs to be more persistent. That is his job.”
By this time, the room was completely silent. No one but the Plant Manager spoke. “I said there were two things. The second is the defeatist attitude I heard. What I heard in the tone of what he said, if not the words, was that he was giving up on this machine. We just cannot afford that kind of attitude.”
The meeting was over. Ignoring the others in his office, the Plant Manager looked down on his desk and began working on something else. One by one, the other members of the audience got up and left the office.
Some months later Joe was given his annual appraisal. Although there was nothing specific, his supervisor mentioned that he was not viewed as a “team player.” Several months later, Joe parted and joined another company.
Joe’s new employer thought he walked on water. Joe’s old company continued to suffer along with frequent failures of the recycle compressor and poor reliability in general.
Although this story is only fictional, it is a compilation of a variety of real-life experiences. It is intended to impart several messages to the reader:
1.Each of the papers Joe extracted from his folder represents one of the elements that contribute to the overall reliability of any system or piece of equipment.
2.The composite reliability or “what you have a right to expect” is a combination of all the items mentioned.
3.Unless the impact of each choice is clearly quantified, it is impossible to have an accurate understanding of reasonable expectations. Most people like to recall only the good things.
4.People can become defensive when their decisions are shown to be faulty.
5.It may be better to have a third party perform the analysis than sacrifice an employee by asking him to perform the evaluation and deliver the bad news.
Let’s discuss the elements that should be included in a WDYHARTE (What do you have a right to expect?) analysis. As our fictional reliability engineer explained, each and every point in the life of a system affords us with opportunities to make choices that will affect reliability. In some cases, the individuals involved are aware they are making choices that affect reliability. In other cases, they are not aware. Sometimes they make sound choices that positively affect the reliability, but sometimes they make choices that compromise the reliability. They then often rationalize that current savings are more important to the business than the added costs — stemming from poor reliability — that will be experienced much later (or by someone else).
Let’s go back and review the elements one by one that determine “what you have a right to expect.” Expectations for performance are often not based on any comprehensive analysis or assessment. Instead, they are based on a “gut feel” or “hoping for the best.” Expectations without the information needed to provide an informed opinion are misinformed and ultimately lead to disappointment. When expectations are aligned with reality, people and businesses are more likely to get what they expect and expect what they get.
Inherent reliability is probably the single most important characteristic of any system or piece of equipment in terms of determining overall reliability performance. The inherent reliability of a system or device is determined by its configuration and component selection. For instance, if a plant has redundant feed pumps or recycle compressors, that fact will profoundly affect the inherent reliability. Also, if the components were chosen based on lifecycle cost rather than just first cost, the inherent reliability will be enhanced. In performing this analysis, the lifecycle cost includes first cost, all forms of maintenance costs, the costs associated with unreliability (e.g., lost profit associated with unplanned outages), and costs associated with unavailability (e.g., lost profit associated with necessary planned outages).
The inherent reliability is a measure of the overall “robustness” of a system or piece of equipment. It provides an upper limit to the reliability and availability that can be achieved. In other words, no matter how much inspection or maintenance you perform, you will never exceed the inherent reliability. If you operate, maintain, and inspect a device as well as possible, you will be able to harvest all of the inherent reliability. On the other hand, if there are gaps in your operating, maintenance, or inspection practices, you will harvest only some portion of the inherent reliability.
If you wish to improve the inherent reliability of an existing system or device, you will need to change the current configuration or component choices and you will need to do so in a manner that improves reliability rather than detracts from it.
Because most systems and devices spend their lives with much the same inherent reliability as was decided by the original design, it is critical that the initial design take reliability and availability requirements into consideration. Adding a redundant component is both difficult and expensive after the original system has been built. In the case of a plant, piping has to be run a great distance to a spot where space is available. This awkward configuration is also confusing for operators. Although redundancy in printed electronic circuits is less expensive than in large physical systems, the difficulty of changing the software that controls the circuits and takes advantage of redundancy is complicated; it is difficult to ensure that new defects have not been introduced.
It is best to apply one or more of the design techniques that fall under the heading of Design-For-Reliability to ensure that long-term reliability requirements are addressed concurrently with the physical design of any system. One example of a DFR technique is RBD or the Reliability Block Diagram technique. Using this technique, each of the elements of a system is represented by a block and connected to other elements in a manner that closely represents the manner in which they interact in the actual system. Characteristics are assigned to each block; they cause it to act mathematically in the same manner as the actual component.
If the actual component has poor reliability, it will fail frequently. If it has poor availability, it will have characteristics that cause it to be down for maintenance a large portion of the time. For manually constructed RBDs, there are techniques that allow the composite reliability to be calculated by hand. It is also possible to construct RBDs in software that simulates the actual performance of real systems. These programs simulate the planned and unplanned outages of components based on characteristics that accurately represent the real-life components that have been chosen.
After RBDs have been assembled and calculations completed, you will have an initial estimate of the inherent reliability that is reasonable to expect. If the calculated reliability does not meet requirements or expectations, either the configuration can be changed (e.g., adding redundancy) or different (more reliable) components can be selected. By inserting the new configuration or characteristics of new components into the model and re-running the calculations or software, it will be possible to estimate the improvement.
Once a configuration and list of component choices have been finalized, it is possible to perform lifecycle cost comparisons to evaluate if the cost of changes is justified by the reduction in lifecycle costs (resulting from fewer and/or shorter outages or by lower maintenance costs).
If initial project design procedures account only for system integrity (e.g., structural or pressure retaining capability) and not for reliability and availability performance, the owner will have to “take what he gets” for those two performance areas.
Another element of reliability mentioned in the fictional account described above is that of initial construction or assembly. It is possible to design a system to be reliable, but then lose a portion of the benefits of all that cost and effort when the system is constructed. Inherent reliability depends on things being assembled in a manner that does not introduce additional defects. All too often, shortcuts made to meet schedule or due to misunderstandings in how things should be assembled lead to the inclusion of defects. The example of pipe stress on the nozzles of rotating equipment is one that many reliability engineers have faced. Inadequate door seals that allow liquid intrusion and ultimately cause corrosion are another common example. The list is endless, but the solution is strict controls during construction.
Harvesting All the Inherent Reliability
As mentioned earlier, the inherent reliability is the maximum possible reliability performance, but it is possible to perform much worse. The portion of the inherent reliability that is actually harvested or achieved is a result of:
•How well the system is operated
•How well it is maintained
•How well it is inspected
An automobile is a good example of a device that has a usable life that is determined by how it is operated. For example, some vehicles last several hundred thousand miles for an original owner. Yet, the exact same models frequently last only tens of thousands of miles when they are traded from hand to hand. If the owner drives the vehicle conservatively, sees that it is regularly maintained, and is sensitive to unusual noises or behaviors; it is possible to achieve a long and reliable life. If the owner accelerates too quickly, rides the brakes, and is insensitive to minor problems until they turn into major problems; the car is likely to be less reliable and to have a shorter life.
Although failures that are caused by poor operation are typically charged to the equipment rather than to the operator, a significant portion of the reduced reliability is not the fault of the equipment. For instance, if the MTBF (Mean Time Between Failure) for a device is two years and every other failure is caused by mis-operation, then the equipment MTBF should be four years. If the MTBF of a device is two years and every second failure is due to mis-operation and every third failure is due to a power failure or an upstream instrument failure, the MTBF of the device should be six years. If you are blaming the device and, as a result, you are focusing your attention on the device only, you will never achieve the desired improvement.
In order to achieve the desired improvement and to harvest the full inherent reliability, it is important to clearly recognize the source of failures.
In addition to mis-operation, it is possible to cause failures or allow failures to occur because of inadequate maintenance or inspection. Let’s look at a few simple examples.
The “Path to Failure” is a series of causes and effects that ultimately lead to a failure. At the very beginning of the path is a Systemic Cause that creates a trap for some unsuspecting individual. The next step is a Human Cause leading to a Physical Cause and finally setting up a Failure Mechanism and, ultimately, a defect that will result in a failure. (The following diagram shows a cause-effect flow in which each effect sequentially becomes the cause of the following effect.)
A Failure Mechanism is a form of deterioration that ultimately produces a defect. For instance, for any mechanical device, the only possible failure mechanisms are corrosion, erosion, fatigue, or overload. Let’s take corrosion as an example. If a corrosion circuit exists (cathode – anode – electrode), there will be visible signs. First, it should be possible to see two dissimilar metals being joined by a liquid electrolyte, or the products of corrosion (rust) should be evident. If operators, craftspersons, and inspectors are keeping their eyes open, they should be able to recognize this failure mechanism at work. If this failure mechanism is allowed to go on working for a long enough period to result in a defect and a failure, it is not the fault of the device. It is the fault of the humans who operate, maintain, or inspect the device. In order to harvest all the inherent reliability, people need to:
•Know what they are looking for (e.g., understand failure mechanisms)
•Be placed by design and discipline in a position where deterioration or defects are evident (e.g., follow organized rounds in a disciplined manner)
•Keep their eyes open
Taken one step further, after a failure mechanism has been at work for a period of time, a defect will form. But the presence of a defect does not automatically result in a failure. Often nature “throws the dice” for some period of time after a defect has formed but before a failure occurs. By this I mean that several circumstances may need to be present to result in a failure. For example, corrosion may weaken a pipe, but the piping system may also have to experience unusual but not unexpected pressure increases before a failure will occur. This aspect of “forgiving nature” or a grace period between defect and failure provides another opportunity to prevent a failure. But, as with the case of active failure mechanisms, people need to play an active role in finding and removing defects.
Well-designed programs for operations, maintenance, and inspection are one of the keys to harvesting all the inherent reliability of a system. Poorly-designed programs allow systems to operate at some level less than possible based on the inherent reliability.
Maintaining or Improving Inherent Reliability during Modification and Renewal
There are two distinctly different paradigms surrounding the aging of systems and equipment. One paradigm is best described by this description of an aging system, “This plant is unreliable because it is getting old.” The other paradigm is the complete opposite, “We have been working with this unit for a long time, so we have worked out all the bugs and know how to stay ahead of the problems.” In the first case, aging is used as an excuse for poor reliability. The equipment is managing the personnel. In the second case, aging is used as a reason why reliability is good. The personnel are managing the equipment.
In addition to the short-term or day-to-day concerns affecting reliability, there are long-term concerns. For instance, most units go through some form of modernization, expansion, or renewal process during their life. These events are often used as opportunities to enhance reliability. Sometimes, however, the reliability after the event is worse than before.
One form of renewal is an overhaul or, for a complete plant, a turnaround. One philosophy espoused by those with a short-term point of view is to perform the absolute minimum amount of work during those events. Another viewpoint is to limit the work to the amount needed to fulfill requirements. If requirements call for reliable service for the next specified number of years, then the work scope will be designed to deliver that result.
A simple example that compares the minimum amount of work to the amount of work needed to provide reliable service for a specific period is the overhaul of a diesel engine. It may be possible to address immediate concerns and return the engine to service (albeit for a limited period) by replacing piston rings, fuel injectors and connecting rod bearings. This approach may even provide an engine that is usable for quite some time, depending on the condition of other parts. Yet, if you want to ensure the engine provides the same reliable life as a new engine, it is necessary to perform a careful tear-down, evaluating the condition and remaining life on each and every part. Components that have been worn beyond the point that they can provide the desired life must be replaced.
Other events that occur in the life of many plants and systems are a modification in service or an expansion. During these events, it is possible that current inherent reliability will be retained; it may also be enhanced or even reduced. As in the situation described in the fictional account above, it is not uncommon to see equipment that once provided a source of redundancy used instead as a source of additional capacity. In the example, a redundant electrical feeder was used as a source of power for new loads. It is not uncommon to see spare pumps placed in parallel service with primary pumps to increase throughput.
In some cases, this modification will reduce reliability simply by eliminating redundancy. In other cases, as with parallel pumps, in addition to the loss of redundancy, both pumps may actually wear faster because they are working against one another.
During the development of new facilities, we apply Design-For-Reliability techniques to ensure that the completed product is reliable. We can apply those same techniques during the design of modifications to ensure that the modified facility has an inherent reliability equal or greater than before the change.
The fictional account provided at the beginning of this chapter paints a fairly gloomy picture of how the reliability engineer’s data is received by members of plant management. In some cases, I am sure it is an exaggeration; in others, it is fairly accurate. Think for a moment about issues in your personal life where you have built a set of expectations only to have them dashed by more accurate or realistic information. For many people, reliability is an abstract characteristic that is based more on luck and good intentions than it is on physical realities and solid analysis. For those individuals, it is often painful news when they learn that their systems and equipment are not reliable and that many of the elements contributing to the poor reliability were results of their own choices.
In order to minimize the negative impact of this discovery, it is best if the exercise of learning “what you have a right to expect” is accomplished as a part of a proactive exercise. This exercise should be done quite separate from any event resulting from poor reliability. Finding out that you have some opportunities for improvement feels a lot better when you are doing it on your own than when a catastrophic event has occurred and you are being forced to do so by your boss or his boss.
Independent third parties have little ownership for the programs that have been installed but are ineffective. They are also more likely to tell the complete and undistorted truth than someone who is dependent on the people receiving the report for pay increases and promotional opportunities. Another problem with using someone from inside your current organization is that each and every group has made some contribution to good or poor reliability. As a result, every employee within a plant can be biased in one way or another.
A comprehensive assessment of “what you have a right to expect” is different from an audit of your current reliability and maintenance programs. It is an evaluation of the effectiveness of all the elements important to reliability in the context of the inherent reliability of your current systems.
Using the example of an automobile, a concerned father may be willing to pay for the expectation of high reliability for his daughter’s vehicle by purchasing a new car for her. She has few of the other characteristics leading to high reliability (knowledge of how best to operate, maintain, or inspect it), but a new reliable vehicle can reasonably be expected to overcome those weaknesses.
Transferring the analogy to a system or piece of equipment, few of us have the luxury of replacing an item when it begins to age. In other words, we cannot “buy” reliability the way the protective father did. In most real-life cases, we have a right to expect only the level of reliability justified by our:
•Good operation
•Sound maintenance
•Thorough inspection
•Thoughtful renewal practices
In order to have a realistic assessment of “what we have a right to expect,” we must assess our expectations in light of inherent reliability as well as all other choices made over the life of the system or device.