Читать книгу Effective Maintenance Management - V. Narayan - Страница 13

Оглавление

Chapter 3

Reliability Engineering for the Maintenance Practitioner

We can now develop some of the reliability engineering concepts that we will need in subsequent chapters. Prior knowledge of the subject is not essential, as we will define the relevant terms and derive the necessary mathematical expressions. As this is not a text on reliability engineering, we will limit the scope of our discussion to the following areas of interest.

•Failure histograms and probability density curves;

•Survival probability and hazard rates;

•Constant hazard rates, calculation of test intervals, and errors with the use of approximations;

•Failure distributions and patterns, and the use of the Weibull distribution;

•Generation of Weibull plots from maintenance records;

•Weibull shape factor and its use in identifying maintenance strategies;

For a more detailed study of reliability engineering, we suggest that readers refer to the texts 3,4,6 listed at the end of the chapter.

3.1 FAILURE HISTOGRAMS

We discussed failures at the system level in Chapter 2. Failures develop as the result of one or more modes of failure at the component level. In the example of the engine’s failure to crank, we identified three of the failure modes that may cause the failure of the cranking mechanism.

If designers and manufacturers are able to predict the occurrence of these failures, they can advise the customers when to take corrective actions. With this knowledge, the customers can avoid unexpected production losses or safety incidents. Designers also require this information to improve the reliability of their products. In mass-produced items, the manufacturer can test representative samples from the production line and estimate their reliability performance. In order to obtain the results quickly, we use accelerated tests. In these tests, we subject the item to higher stress levels or operate it at higher speeds than normal in order to initiate failure earlier than it would naturally occur.

Let us take as an example the testing of a switch used in industrial applications. Using statistical sampling methods, the inspector selects a set of 37 switches from a given batch, to assess the life of the contacts. These contacts can burn out, resulting in the switch failing to close the circuit when in the closed position. In assessing the time-to-failure of switches, a good measure is the number of operations in service. The test consists of repeatedly moving the switch between the on and off positions under full load current conditions. During the test, we operate the switch at a much higher frequency than expected normally.

As the test progresses, the inspector records the failures against the number of operations. When measuring life performance, time-to-failure may be in terms of the number of cycles, number of starts, distance traveled, or calendar time. We choose the parameter most representative of the life of the item. In our example, we measure ‘time’ in units of cycles of tests. The test continues till all the items have failed. In Table 3.1, a record of the switch failures after every thousand cycles of operation is shown.

We can plot this data as bar chart (see Figure 3.1), with the number of switch failures along the y-axis, and the life measured in cycles along the x-axis.

To find out how many switch failures occurred in the first three thousand cycles, we add the corresponding failures, namely 0 + 1 + 3 = 4. By deducting the cumulative failures from the sample size, we obtain the number of survivors at this point as 37 − 4 = 33. As a percentage of the total number of recorded failures, the corresponding figures are 4/37 or approximately 11% and 33/37 or approximately 89% respectively.


Table 3.1


Figure 3.1 Number of failures recorded per cycle.

We can view this information from a different angle. At the end of three thousand cycles, about 11% of the switches have failed and 89% have survived. Can we use this information to predict the performance of a single switch? We could state that a switch that had not failed during the first three thousand cycles had a survival probability of approximately 89%. Another way of stating this is to say that the reliability of the switch at this point is 89%. There is no guarantee that the switch will last any longer, but there is an 89% chance that it will survive beyond this point. As time passes, this reliability figure will keep falling. Referring to the Table 3.1, we can see that at the end of five thousand cycles,

•The cumulative number of failures is 17;

•The proportion of cumulative failures to the sample size (37) is 46%;

•The proportion of survivors is about 100% − 46% = 54%.

In other words, the reliability is about 54% at the end of five thousand cycles. Using the same method, by the end of nine thousand cycles the reliability is less than 3%.

How large should the sample be, and will the results be different with a larger sample? With a homogeneous sample, the actual percentages will not change significantly, but the confidence in the results increases as the sample becomes larger. The cost of testing increases with the sample size, so we have to find a balance and get meaningful results at an acceptable cost. With a larger sample, we can get a better resolution of the curve, as the steps will be smaller and the histogram will approach a smooth curve. We can normalize the curve by dividing the number of failures at any point by the sample size, so that the height of the curve shows the failures as a ratio of the sample size. The last column of Table 3.1 shows these normalized figures.

3.2 PROBABILITY DENSITY FUNCTION

This brings us to the concept of probability density functions. In the earlier example, we can smooth the histogram in Figure 3.1 and obtain a result as seen in Figure 3.2. The area under the curve represents the 37 failures, and is normalized by dividing the number of failures at any point by 37, the sample size. In reliability engineering terminology, we call this normalized curve a probability density function or pdf curve. Because we tested all the items in the sample to destruction, the ratio of the total number of failures to the sample size is 1. The total area under the pdf curve represents the proportion of cumulative failures, which is also 1.


Figure 3.2 Probability density function.

If we draw a vertical line at time t = 3,000 cycles, the height of the curve gives the number of failures as a proportion to the sample size, at this point in time. The area to the left of this line represents the cumulative failure probability of 11%, or the chance that 4 of the 37 items would have failed. The area to the right represents the survival probability of 89%. In reliability engineering terminology, the survival probability is the same as its reliability, and the terms are interchangeable.

3.3 MORTALITY

We now turn to the concept of mortality, which when applied in the human context, is the ratio of the number of deaths to the surviving population. To illustrate this concept, let us consider the population in a geographical area. Let us say that there are 100,000 people in the area on the day in question. If there were ten deaths in all on that day, the mortality rate was 10/100,000, or 0.0001. Actuaries analyze the mortality of a population with respect to their age. They measure the proportion of the population who die within one, two, three,...n years. A plot of these mortality values is similar to Figure 3.3, Element A (which refers to equipment component failures). In the first part of the curve (the so-called infant mortality section), the mortality rate keeps falling.


Figure 3.3 Failure Patterns

A baby has a high chance of dying at birth, and the longer it survives, the greater the chance is it will continue to live. After the first few days or weeks, the mortality rate levels out. For the next 50–70 years, it is fairly constant. People die randomly, due to events such as road accidents, food poisoning, homicides, cancer, heart disease, or other reasons. Depending on their lifestyles, diet, race, and sex, from about 50 years on the mortality starts to rise. As people get older, they become susceptible to more diseases, their bones tend to become brittle, and their general resistance becomes lower. Not many people live up to 100 years, though some ethnic groups have exceptional longevity. Insurance companies use these curves to calculate their risks. They adjust the premiums to reflect their assessment of the risks.

We use a similar concept in reliability engineering. The height of the pdf curve gives the number of failures at any point in time, and the area of the curve to the right of this point the number of survivors. The term hazard rate designates equipment mortality.We divide the number of failures by the number of survivors, at this point. In the earlier example, the hazard rate at t = 3,000 cycles is 3/33 or 0.0909. The study of hazard rates gives us an insight into the behavior of equipment failures, and enables us to make predictions about future performance.

3.4 HAZARD RATES AND FAILURE PATTERNS

The design of industrial equipment was simple, sturdy, heavy, and robust prior to World War II. Repairs were fairly simple, and could easily be done at site using ordinary hand tools. Breakdown strategies were common, which meant that equipment operated till failures occurred. The introduction of mass production techniques meant that interruptions of production machinery or conveyors resulted in large losses. At the same time, the design of equipment became more complex. Greater knowledge of materials of construction led to better designs with a reduction in weight and cost. Computer-aided analysis and design tools became available, along with computing capacity. As a result, the designers could reduce safety factors (which included a factor for uncertainty or ignorance). In order to reduce investment costs, designers reduced the amount of standby equipment installed and intermediate storage or buffer stocks.

These changes resulted in slender, light, and sleek machinery. They were not as rugged as its predecessors, but met the design conditions. In order to reduce unit costs, machine uptime was important. The preferred strategy was to replace complete sub-assemblies as it took more time to replace failed component parts.

A stoppage of high-volume production lines resulted in large losses of revenue. In order to prevent such breakdowns, manufacturers used a new strategy. They replaced the sub-assemblies or parts at a convenient time before the failures occurred, so that the equipment was in good shape when needed. The dawn of planned preventive maintenance had arrived.

Prior to the 1960s, people believed that most failures followed the so-called bath-tub curve. This model is very attractive, as it is so similar to the human mortality curves. By identifying the knee of the curve, namely, the point where the flat curve starts to rise, one could determine the timing of maintenance actions. Later research1 showed that only a small proportion of component failures followed the bath-tub model, and that the constant hazard pattern accounted for the majority of failures. Where the bath-tub model did apply, finding the knee of the curve is not a trivial task.

As a result, conservative judgment prevailed when estimating the remaining life of components. Preventive maintenance strategies require that we replace parts before failure, so the useful life became synonymous with the shortest recorded life. Thus the replacement of many components took place long before the end of their useful life. The opportunity cost of lost production justified the cost of replacing components that were still in good condition.

The popularity of preventive maintenance grew especially in industries where the cost of downtime was high. This strategy was clearly costly, but was well justified in some cases. However, the loss of production due to planned maintenance itself was a new source of concern. Managers who had to reduce unit costs in order to remain profitable started to take notice of the production losses and the rising cost of maintenance.

Use of steam and electrical power increased rapidly throughout the twentieth century. Unfortunately there were a large number of industrial accidents associated with the use of steam and electricity resulting in the introduction of safety legislation to regulate the industries. At this time, the belief was that all failures were age related, so it was appropriate to legislate time-based inspections. It was felt that the number of incidents would reduced by increasing the inspection frequencies.

Intuitively, people felt more comfortable with these higher frequency inspection regimes. Industrial complexity increased greatly from the mid-1950s onwards with the expansion of the airline, nuclear, and chemical industries. The number of accidents involving multiple fatalities experienced by these industries rose steeply.

By the late 1950s, commercial aviation became quite popular. The large increase in the number of commercial flights resulted in a corresponding increase in accidents in the airline industry. Engine failures accounted for a majority of the accidents and the situation did not improve by increasing maintenance effort. The regulatory body, the U.S. Federal Aviation Agency, decided to take urgent action in 1960, and formed a joint project with the airline industry to find the underlying causes and propose effective solutions.

Stanley Nowlan and Howard Heap1, both of United Airlines, headed a research project team that categorized airline industry failures into one of six patterns. The patterns under consideration are plots of hazard rates against time. Their study revealed two important characteristics of failures in the airline industry, hitherto unknown or not fully appreciated.

1.The failures fell into six categories, illustrated in Figure 3.3.

2.The distribution of failures in each pattern revealed that only 11% were age-related. The remaining 89% appeared to be failures not related to component age. This is illustrated in the pie-chart, Figure 3.4.

The commonly held belief that all failures followed Pattern A, the Bathtub Curve, justifying age-based preventive maintenance was called into question, as it accounted for just a small percentage of all failures (in the airline industry). Nowlan and Heap questioned the justification for doing all maintenance using age as the only criterion.

We will discuss these issues later in the book.

An explanation of these failure patterns and a method to derive them using a set of artificially created failure data is given in Appendix 3-1.


Figure 3.4 Failure Patterns. Patterns A, B, and C, which are age-related, account for 11% of failures studied in the research project.

3.5 THE TROUBLE WITH AVERAGES

As we know, the average height of a given population does not tell us a great deal. If the average is, say, 1.7 m, we know that there will be some people who are shorter, say under 1.5 m, and some who are taller, perhaps over 2 m. If you are a manufacturer of clothes, you would need to know the spread or distribution of the heights of the population in order to design a range of sizes that are suitable.

We use the average or mean as a measure to describe a set of values. The arithmetic average is the one most commonly used, because it is easy to understand. The term average may give the impression it is an expected value. In practice, these two values may be quite different from each other.

There is a similar situation when we deal with equipment failure rates. The majority of the failures may take place in the last few weeks of operation, thereby skewing the distribution. For example, if we recorded failures of 100 tires, and their combined operational life was three million km, what can we learn from the mean value of 30,000 km of average operational life? In practice, it is likely that there were very few failures within the first 5000 km or so, and that a significant number of tires failed after 30,000 km. Hence, the actual distribution of failures is important if we are to use this information for predicting future performance. Such predictions are useful in planning resources, ordering replacement spares, and in preparing budgets.

As a refinement, we can define the spread further using the standard deviation. However, even this is inadequate to describe the distribution pattern itself, as illustrated by the following example. In Table 3.2, you can see three sets of failure records of a set of machine elements. Figures 3.5, 3.6, and 3.7 respectively illustrate the corresponding failure distributions, labeled P, Q, and R.

Note that all three distributions have nearly the same mean values and standard deviations. The failure distributions are however quite different. Most of the failures in distribution P occur after about 5 months, whereas in distribution R, there are relatively few failures after 20 months. Thus, the two distributions are skewed, one to the left and the other to the right. The distribution Q is fairly symmetrical. Knezevic2 discusses the importance of knowing the actual distribution in some detail. He concludes his paper with the following observations.

•Knowledge of the actual failure distribution can be important;

•Use of a constant failure rate is not always appropriate;

•As investment and operational expenditure get greater scrutiny,the pressure to predict performance will increase—in many cases,the use of mean values alone will reduce the accuracy of predictions;

•Understanding the distributions does not need more testing or data.


Table 3.2 Distribution of failures—elements P, Q, and R


Figure 3.5 Distribution P: Mean = 15.21; Std.Dev. = 13.79


Figure 3.6 Distribution Q: Mean = 15.21; Std.Dev. = 13.79


Figure 3.7 Distribution R: Mean = 15.21; Std.Dev. = 13.79

3.6 THE SPECIAL CASE OF THE CONSTANT HAZARD RATE

So far we have emphasized the importance of knowing the actual failure distribution. One should not assume a constant or average failure rate, unless there is evidence to believe this to be true. However, we know that in the airline industry, of the six patterns (Figures 3.3), the patterns D, E, and F account for about 89% of the failures. Patterns D and F are similar to pattern E over most of the life. If we ignore early failures, the constant hazard pattern accounts for nearly 89% of all the failures. The picture is similar in the offshore oil and gas industry.

The Broberg Study published in 1973 showed similar patterns and distributions whereas a U.S, Navy study (MSP), released in 1993, also showed similar curves but the distributions were somewhat different. These are quoted in a paper by Timothy Allen3.

In view of its dominant status, the special case of the constant hazard rate merits further discussion.

Let us examine the underlying mathematical derivations relating to constant hazard rates. In section 3.3, we defined the hazard rate as the ratio of the probability of failure at any given time to the probability of survival at that time. We can express this using the following equation.

3.1

where z(t) is the hazard rate, f(t) is the probability of failure, or the height of the pdf curve, and R(t) is the survival probability, or the area of the pdf curve to the right, at time t. The cumulative failure is the area of the curve to the left at time t. The total area under the pdf curve, that is, cumulative failures plus survivors has to be 100% or 1.

F(t)+R(t)=13.2

and

3.3

or

hence

3.4

The constant hazard rate will be demoted as λ, and is given by,

3.5

Combining expressions 3.4. and 3.5 we get

or

Integrating,

3.6

3.7 AVAILABILITY

Availability is a measure of the time equipment is able to perform to specified standards, in relation to the time it is in service. The item will be unable to perform when it is down for planned or unplanned maintenance, or when it has tripped. Note that it is only required that the equipment is able to operate, and not that it is actually running. If the operator chooses not to operate it, this does not reduce its availability.

Some items are only required to operate when another item fails, or a specific event takes place. If the first item itself is in a failed state, the operator will not be aware of its condition because it is not required to work till another event takes place. Such failures are called hidden failures. Items subject to hidden failures can be in a failed state any time after installation, but we will not be aware of this situation.

The only way to know if the item is working is to place a demand on it. For example, if we want to know whether a fire pump will start, it must be actually started—this can be by a test or if there is a real fire. At any point in its life, we will not know whether it is in working condition or has failed. If it has failed, it will not start. The survival probability gives us the expected value of its up-state, and hence its availability on demand at this time. Thus, the availability on demand is the same as the probability of survival at any point in time. This will vary with time, as the survival probability will keep decreasing, and with it the availability. This brings us to the concept of mean availability.

3.8 MEAN AVAILABILITY

If we know the shape of the pdf curve, we can estimate the item’s survival probability. If the item has not failed till time t, the reliability function R(t) gives us the probability of survival up to that point. As discussed above, this is the same as the instantaneous availability.

In the case of hidden failures, we will never know the exact time of failure. We need to collect data on failures by testing the item under consideration periodically. It is unlikely that a single item will fail often enough in a test situation to be able to evaluate its failure distribution. So we collect data from several similar items operating in a similar way and failing in a similar manner, to obtain a larger set (strictly speaking, all the failures must be independent and identical, so using similar failures is an approximation).We make a further assumption, that the hazard rate is constant. When the hazard rate is constant, we call it the failure rate. The inverse of the failure rate is the Mean Time To Failure or MTTF. MTTF is a measure of average operating performance for non-repairable items, obtained by dividing the cumulative time in service (hours, cycles, miles or other equivalent units) by the cumulative number of failures. By non-repairable, we mean items that are replaced as a whole, such as light bulbs, ball bearings, or printed circuit boards.

In the case or repairable items, a similar measure of average operating performance is used, called Mean Operating Time Between Failures, or MTBF. This is obtained by dividing the cumulative time in service (hours, cycles, miles or other equivalent units) by the cumulative number of failures. If after each repair, the item is as good as new (AGAN), it has the same value as MTTF. In practice the item may not be AGAN in every case. In the rest of this chapter, we will use the term MTBF to represent both terms.

Another term used in a related context is Mean Time to Restore, or MTTR. This is a measure of average maintenance performance, obtained by dividing the cumulative time for a number of consecutive repairs on a given repairable item (hours) by the cumulative number of failures of the item. The term restore means the time from when the equipment was stopped to the time the equipment was restarted and operated satisfactorily.

Table 3.3 shows a set of data describing failure pattern E. Here we show the surviving population at the beginning of each week instead of that at the end of each week. Figure 3.8 shows the cumulative number of failures, and Figure 3.9 shows the surviving population at the beginning of the first 14 weeks.


Table 3.3


Figure 3.8 Cumulative failures against elapsed time.


Figure 3.9 Surviving population at the beginning of each week.

We can use this constant slope geometry in Figure 3.8 to calculate the MTBF and failure rates. When there are many items in a sample, each with a different service life, we obtain the MTBF by dividing the cumulative time in operation by the total number of failures. We obtain the failure rate by dividing the number of failures by the time in operation. Thus,

3.7

For a rigorous derivation, refer to Hoyland and Rausand4, page 31. Note that this is the only case when the relationship applies, as in the other failure distributions, the slope of the cumulative failure curve changes all the time.

We can only replace an item after a test as it is a hidden failure. We do not know if it is in a failed condition unless we try to use it. How do we determine a justifiable test interval T? At the time of test, if we find the majority of items in a failed state, we have probably waited too long. In other words, we expect very high survival probability. Thus, in the case of systems affecting safety or environmental performance, it would be reasonable to expect this to be 97.5% or more, based on, for example, a Quantitative Risk Assessment.

Let us try to work out the test interval with a numerical example. Using the data in Table 3.3 at the beginning of week number 1, all 1000 items will be in sound working order (As Good As New, or AGAN). At the beginning of week number 2, we can expect 985 items to be in working order, and 970 items at the beginning week 3. At the beginning of week 14, we can expect only 823 items to be in working condition. So far, we have not replaced any of the defective items because we have not tested them and do not know how many are in a failed state. Had we carried out a test at the beginning of week 2, we would have expected to find only 985 in working order. This is, therefore, the availability at the beginning of week 2. If we delay the test to the beginning of week 14, only 823 items are likely to be in working order. The availability at that time is thus 823 out of the 1000 items, or 0.823.

The mean availability over any time period, say a week, can be calculated by averaging the survival probabilities at the beginning and end of the week in question. For the whole period, we can add up the point availability at the beginning of each week, and divide it by the number of weeks. This is the same as measuring the area under the curve and dividing it by the base to get the mean height. In our example, this gives a value of 91.08%. If the test interval is sufficiently small, we can treat the curve as a straight line. Using this approximation, the value is 90.81%. The error increases as we increase the test interval, because the assumption of a linear relationship becomes less applicable. We will see later that the error using this approximation becomes unacceptable, once T/MTBF exceeds 0.2.

Within the limits of applicability, the error introduced by averaging the survival probabilities at the beginning and end of the test period is fairly small (~ 0.3 %). These requirements and limits are as follows.

•They are hidden failures and follow an exponential distribution;

•The MTBF > the test interval, say by a factor of 5 or more;

•The item is as good as new at the start of the interval;

•The time to carry out the test is negligible;

•The test interval > 0.

In the example, the test interval (14 weeks) is relatively small compared to the MTBF (which is 1/0.015 or 66.7 weeks). Figure 3.10 illustrates these conditions, and the terms used.

The objective is to have an acceptable survival probability at the time of the test. The difference in the number of survivors, calculated using the exact and approximate solutions is quite small, as can be seen in Figure 3.11. The mean availability and survival probability are related, and this is illustrated in Figure 3.12. The relationship is linear over the range under consideration.


Figure 3.10 Mean availability approximation.


Figure 3.11 Survivors; lower curve = exact value, upper curve = linear approximation


Figure 3.12 Mean availability and survival probability.

We will use this example to develop a generally applicable method to determine test intervals for hidden functions. The objective of the exercise is to find a test interval T that will give us the required mean availability A, when the failure rate is Ï. We have noted that at any point in time, the availability of the item is the same as its survival probability, or the height of the R(t) curve. The mean availability is obtained by dividing the area of the R(t) curve by the base, thus,

3.8

When the hazard rate is constant, from the earlier derivation (expression 3.6),

for t > 0

Subsituting,

3.9

Evaluating the integral explicity gives

3.10

This gives an exact measure of mean availability. We cannot use algebraic methods to solve the equation, as T appears in the exponent as well as in the main expression. We can of course use numerical solutions, but these are not always convenient, so we suggest a simpler approximation, as follows.

The survival probability or R(t) curve (see Figure 3.10) is nearly linear over the test interval T, under the right conditions. The mean is the arithmetic average of the height of the curve at t=0 and t=T.

The mean value of availability A is then:

3.11
3.12

or

3.13

The estimates produced by this expression are slightly optimistic. However over the range of applicability, the magnitude of deviation is quite small. Table 3.4 and Figure 3.11 show the error in using the exact and approximate equations for values of λt from 0.01 to 0.25.

Figure 3.12 shows the relationship between survival probability and mean availability. In Figure 3.13, we compare the approximate value to the exact value of mean availability over the range. It is quite small up to a value of λt of 0.2. We can see the magnitude of the error in Figure 3.14. From this, we can see that it is safe to use the approximation within these limits.

Table 3.4 Comparison of exact vs. approximate mean availability.


If the test interval is more than 20% of the MTBF, this approximation is not applicable. In such cases, we can use a numerical solution such as the Maximum Likelihood Estimation technique—refer to Edwards5 for details.

3.9 THE WEIBULL DISTRIBUTION

A number of failure distribution models are available. Among these are the exponential, gamma, pareto, Weibull, normal or Gaussian, lognormal, Birbaum-Saunders, inverse Gaussian, and extreme value distributions. Further details about these distributions are available in Hoyland and Rausand4 or other texts on reliability theory.

Weibull6 published a generalized equation to describe lifetime distributions in 1951. The two-parameter version of the Weibull equation is simpler and is suitable for many applications. The three-parameter version of the equation is suitable for situations where there is a clear threshold period before commencement of deterioration. By selecting suitable values of these parameters, the equation can represent a number of different failure distributions. Readers can refer to Davidson7 for details on the actual procedure to follow in doing the analysis.


Figure 3.13 Mean availability; exact vs. approximate values.


Figure 3.14 Error in estimate of availability vs. T/MTBF

The Weibull distribution is of special interest because it is very flexible and seems to describe many physical failure modes. It lends itself to graphical analysis, and the data required is usually available in most history records. We can obtain the survival probability at different ages directly from the analysis chart. We can also use software to analyze the data. Figure 3.15 shows a Weibull plot made using a commercial software application.

It is fairly easy to gather data required to carry out Weibull analysis, since time-to-failure and preventive replacement details for the failure mode are nearly all that we need. For this we need a record of the date and description of each failure. We also need the date of any preventive maintenance action that results in the component being repaired or replaced before failure occurs. Once we compute the values of the two parameters, we can obtain the distribution of failures. We can read the survival probabilities at the required age directly from the chart. We can then estimate the reliability parameters, and use this data for predicting the performance of the item.


Figure 3.15 Typical Weibull Plot.

The Weibull equation itself looks somewhat formidable. Using the simpler two-parameter version, the survival probability is given by the following expression.

3.14

where η is called a scale parameter or characteristic life, and β is called the shape parameter.

Using expression 3.14, when t = η there is a 63.2% probability that the component has failed. This may help us in attributing a physical meaning to the scale parameter, namely that nearly 2/3rd of the items in the sample have failed by this time. The value gives us an idea of the longevity of the item. The shape factor β, tells us about the distribution of the failures. Using expression 3.14, we can compute the R(t) or survival probability for a given set of values of η and β, at any point in time t. In Appendix 3-2, we have provided the results of such a calculation as an example.

In spite of the apparent complexity of the equation, the method itself is fairly easy to use. We need to track the run-lengths of equipment, and to record the failures and failure modes. Recording of preventive repair or replacement of components before the end of their useful life is not too demanding. These, along with the time of occurrence (or, if more appropriate, the number of cycles or starts), are adequate for Weibull (or other) analysis. We can obtain such data from the operating records and maintenance management systems.

Such analysis is carried out at the failure modes level. For example, we can look at the failures of a compressor seal or bearing. We need five (or more) failure points to do Weibull analysis. In other words, if we wished to carry out a Weibull analysis on the compressor seal, we should allow it to fail at least five times! This may not be acceptable in practice, because such failures can be costly, unsafe, and environmentally unacceptable. Usually, we will do all we can to prevent failures of critical equipment. This means that we cannot collect enough failure data to improve the preventive maintenance plan and thus improve their performance. On items that do not matter a great deal—for example, light bulbs, vee-belts, or guards—we can obtain a lot of failure data. However, these are not as interesting as compressor seals. This apparent contradiction or conundrum was first stated by Resnikoff8.

3.10 DETERMINISTIC AND PROBABILISTIC DISTRIBUTIONS

Information about the distribution of time to failures helps us to predict failures. The value of the Weibull shape parameter β can help determine the sharpness of the curve. When β is 3.44, the pdf curve approaches the normal or Gaussian distribution. High β values, typically over 5, indicate a peaky shape with a narrow spread. At very high values of β, the curve is almost a vertical line, and therefore very deterministic. In these cases, we can be fairly certain that the failure will occur at or close to the η value. Figure 3.16 shows a set of pdf curves with the same n value of 66.7 weeks we used earlier, and different β values. Figure 3.17 shows the corresponding survival probability or reliability curves. From the latter, we can see that when β is 5, till the 26th week, the reliability is 99%.

On the other hand, when we can be fairly sure about the time of failure,that is, with high Weibull β values, time-based strategies can be effective. If the failure distribution is exponential, it is difficult to predict the failures using this information alone, and we need additional clues. If the failures are evident, and we can monitor them by measuring some deviation in performance such as vibration levels, condition based strategies are effective and will usually be cost-effective as well.

If the failures are unrevealed or hidden, a failure-finding strategy will be effective and is likely to be cost-effective. Using a simplifying assumption that the failure distribution is exponential, we can use expression 3.13 to determine the test interval. In the case of failure modes with high safety consequence, we can use a pre-emptive overhaul or replacement strategy, or design the problem out altogether.

When β values are less than 1, this indicates premature or early failures. In such cases, the hazard rate falls with age, and exhibits the so-called infant mortality symptom. Assuming that the item has survived so far, the probability of failure will be lower tomorrow than it is today. Unless the item has already failed, it is better to leave it in service, and age-based preventive maintenance will not improve its performance. We must address the underlying quality problems before considering any additional maintenance effort. In most cases, a root cause analysis can help identify the focus area.


Figure 3.16 Probability density functions for varying beta values.

3.11 AGE-EXPLORATION

Sometimes it is difficult to assess the reliability of the equipment either because we do not have operating experience, as in the case of new designs, or because data is not available. In such cases, initially we estimate the reliability value based on the performance of similar equipment used elsewhere, vendor data, or engineering judgment. We choose a test interval that we judge as being satisfactory based on this estimate. At this stage, it is advisable to choose a more stringent or conservative interval. If the selected test interval reveals zero or negligible number of failures, we can increase it in small steps. In order to use this method, we have to keep a good record of the results of tests. It is a trial and error method, and is applicable when we do not have access to historical data. This method is called age-exploration.

3.12 CHAPTER SUMMARY

In order to evaluate quantitative risks, we need to estimate the probability as well as the consequence of failures. Reliability engineering deals with the methods used to evaluate the probability of occurrence.


Figure 3.17 Survival probability for varying beta values.

We began with failure histograms and probability density curves. In this process we developed the calculation methodology with respect to survival probability and hazard rates, using numerical examples. Constant hazard rates are a special case and we examined their implications. Thereafter we derived a simple method to compute the test intervals in the case of constant hazard rates, quantifying the errors introduced by using the approximation.

Reliability analysis can be carried out graphically or using suitable software using data held in the maintenance records. The Weibull distribution appears to fit a wide range of failures and is suitable for many maintenance applications. The Weibull shape factor and scale factors are useful in identifying appropriate maintenance strategies.

We discussed age-exploration, and how we can use it to determine test intervals when we are short of historical performance data.

REFERENCES

1.Nowlan F.S. and H.F.Heap. 1978. Reliability-Centered Maintenance. U.S. Department of Defense. Unclassified, MDA 903-75-C-0349.

2.Knezevic J. 1996. Inherent and Avoidable Ambiguities in Data Analysis. UK: SRD Association, copyright held by AEA Technology, PLC. 31-39 .

3.Allen, Timothy. RCM: The Navy Way for Optimal Submarine Operations. http://www.reliabilityweb.com/artO6/rcm_navy.htm

4.Hoyland A. and M.Rausand. 2004. System Reliability Theory, 2nd ed. John Wiley and Sons, Inc. ISBN:978-0471471332.

5.Edwards, A.W.F. 1992. Likelihood. Johns Hopkins University Press. ISBN 0801844452

6.Weibull W. 1951. A Statistical Distribution of Wide Applicability. Journal of Applied Mechanics, 18: 293-297.

7.Davidson J. 1994. The Reliability of Mechanical Systems. Mechanical Engineering Publications, Ltd. ISBN 0852988818.22-33

8.Resnikoff H.L. 1978. Mathematical Aspects of Reliability Centered Maintenance. Dolby Access Press.

Effective Maintenance Management

Подняться наверх