Figure 4.11 Crack propagation in a weld.
The incipiency interval may be very short, as in the case of light bulbs, or very long, as in the case of weld crack propagation. A large number of failures have incipiency intervals ranging from weeks to several months or years. Bearing failures, general corrosion, and weld crack propagation are all examples of such failures. Nowlan and Heap2 refer to the point x in Figure 4.10 as the point of potential failure, and the point y as the point of functional failure. Moubray7 refers to it as the P-F curve, where points P and F correspond to points x and y in Figure 4.10. Therange of variance in incipiency is shown in Figure 4.12.
Figure 4.12 Examples of incipiency intervals.
Even in the case of a single failure mode in a given operating context, the droop of the incipiency curve may vary. Thus, there is a range of incipiency intervals, as illustrated in Figure 4.13. This range introduces uncertainty in determining the incipiency interval.
Figure 4.13 Variations in incipiency intervals.
4.7 LIMITS TO THE APPLICATION OF CONDITION MONITORING
When the incipiency is very short, the time available to plan or execute maintenance action is also very small. In such cases, it is difficult to plan replacement before failure by monitoring the component’s condition. When incipiency intervals are in weeks, months, or years, condition monitoring is often an effective way to plan component replacement. Condition monitoring is feasible when it is possible to measure the change in performance, using human senses or instruments. It follows that we cannot monitor hidden or unrevealed failures.
Proponents of condition-based maintenance are correct when they highlight their ability to predict failures. Any predictive capability enhances the decision making process. However they sometimes give the impression that condition monitoring systems will solve all our problems. We know that all failures do not lend themselves to condition monitoring. The failure must exhibit incipiency, it must be feasible to measure it, and the interval must be of reasonable duration. We must always ask the providers of condition monitoring services to demonstrate how they meet these requirements.
4.8 AGE RELATED FAILURE DISTRIBUTION
A system consists of many pieces of equipment, each of which has several components. Each component can fail in one or more ways. In Chapter 3, we looked at the six failure patterns identified by the Nowlan and Heap2 team. You will recall that these failure patterns are plots of the hazard rates against time. Other studies such as Broberg and MSP reported similar results—see Reference 3 in Chapter 3.
Prior to the Nowlan and Heap study, the belief was that all failures followed the so-called bath-tub curve. Their results showed that this pattern was only applicable to 4% of all the failure modes.
Fourteen percent showed a constant failure pattern, and if we ignore the failures that took place early in life, a further 75% also followed this pattern. The remaining 11% (including 4% of the bath tub) of the failure modes exhibited a distinct relationship to age. Should we concern ourselves with this relatively small proportion of failures that exhibit an age-relationship?
To answer this question, we need to know whether any of these failure modes could result in serious consequences. If so, they acquire a new level of respect. With a skewed distribution, a strategy based on an assumed constant failure pattern will not be satisfactory. Therefore, we cannot assume that all failures exhibit a constant hazard rate pattern, as long as any of the remaining 11% matter.
When we assemble components to build equipment, each component failure-mode affects the overall failure rate. These individual component failure-modes may have exhibited a distinct age-related failure pattern. When any failure takes place, we replace the affected part with a new one. In an ideal case, we do not replace any of the other components at this point. The latter are at different stages of deterioration in their own life cycles. One of these will fail some time thereafter because it has reached the end of its life. We replace it and start a new cycle, while other components continue from their partly worn-out state. The result is that at the assembly level, the failures tend to be randomly distributed and follow the exponential distribution.
The concept of Mean Time To Failures, or MTTF, is worth further consideration at this point. As discussed in Chapter 3, the mean does not tell us much about the distribution. With a given sample, many of the failures could have taken place early or late in terms of age. In such a case, the use of the mean distorts the picture, because one may wrongly infer that the failures take place uniformly over the life. Hence, the use of MTTF without a full understanding of the distribution may lead to inappropriate decisions.
When the hazard rate is constant (meaning that the distribution is exponential), it is perfectly acceptable to use the MTTF. At this point there is (approximately) a 63% probability that the component has failed, and only a 37% probability of survival. In cases where the consequences of failure are high, we must do whatever we can to reduce or eliminate them. If the failure is evident and exhibits incipiency, as with a ball bearing, we can take vibration or other condition monitoring action. If the failure is hidden, as with a gas detector, we carry out a test, or a failure finding task. We must plan preventive maintenance action well before t = MTTF, because we cannot accept a 37% probability of survival at the time of the test or repair. The lay person often thinks of the MTTF as the expected time of failure and, therefore, the maintenance interval, which is clearly not the case.
Nearly three quarters of all accidents are due to the action (or inaction) of human beings. We cannot wish it away, as it is too large a contributor to ignore. Human beings are complex systems, with hundreds of failure modes. In the following discussion, we will use the terms human error and human failure interchangeably.
The causes of human error are many and varied. Lorenzo3 categorizes them as random, systematic, and sporadic. We can correct random errors by better training and supervision. A shift in performance in one direction indicates systematic variability. We can reduce these by providing a regular performance feedback. Sporadic errors are the most difficult ones to predict or control. In this case, the person’s performance is fine for most of the time. A sudden distraction or loss of concentration results in sporadic error.
There is an optimum level of stress at which human beings