Some items are only required to operate when another item fails, or a specific event takes place. If the first item itself is in a failed state, the operator will not be aware of its condition because it is not required to work till another event takes place. Such failures are called hidden failures. Items subject to hidden failures can be in a failed state any time after installation, but we will not be aware of this situation.
The only way to know if the item is working is to place a demand on it. For example, if we want to know whether a fire pump will start, it must be actually started—this can be by a test or if there is a real fire. At any point in its life, we will not know whether it is in working condition or has failed. If it has failed, it will not start. The survival probability gives us the expected value of its up-state, and hence its availability on demand at this time. Thus, the availability on demand is the same as the probability of survival at any point in time. This will vary with time, as the survival probability will keep decreasing, and with it the availability. This brings us to the concept of mean availability.
If we know the shape of the pdf curve, we can estimate the item’s survival probability. If the item has not failed till time t, the reliability function R(t) gives us the probability of survival up to that point. As discussed above, this is the same as the instantaneous availability.
In the case of hidden failures, we will never know the exact time of failure. We need to collect data on failures by testing the item under consideration periodically. It is unlikely that a single item will fail often enough in a test situation to be able to evaluate its failure distribution. So we collect data from several similar items operating in a similar way and failing in a similar manner, to obtain a larger set (strictly speaking, all the failures must be independent and identical, so using similar failures is an approximation).We make a further assumption, that the hazard rate is constant. When the hazard rate is constant, we call it the failure rate. The inverse of the failure rate is the Mean Time To Failure or MTTF. MTTF is a measure of average operating performance for non-repairable items, obtained by dividing the cumulative time in service (hours, cycles, miles or other equivalent units) by the cumulative number of failures. By non-repairable, we mean items that are replaced as a whole, such as light bulbs, ball bearings, or printed circuit boards.
In the case or repairable items, a similar measure of average operating performance is used, called Mean Operating Time Between Failures, or MTBF. This is obtained by dividing the cumulative time in service (hours, cycles, miles or other equivalent units) by the cumulative number of failures. If after each repair, the item is as good as new (AGAN), it has the same value as MTTF. In practice the item may not be AGAN in every case. In the rest of this chapter, we will use the term MTBF to represent both terms.
Another term used in a related context is Mean Time to Restore, or MTTR. This is a measure of average maintenance performance, obtained by dividing the cumulative time for a number of consecutive repairs on a given repairable item (hours) by the cumulative number of failures of the item. The term restore means the time from when the equipment was stopped to the time the equipment was restarted and operated satisfactorily.
Table 3.3 shows a set of data describing failure pattern E. Here we show the surviving population at the beginning of each week instead of that at the end of each week. Figure 3.8 shows the cumulative number of failures, and Figure 3.9 shows the surviving population at the beginning of the first 14 weeks.
Table 3.3
Figure 3.8 Cumulative failures against elapsed time.
Figure 3.9 Surviving population at the beginning of each week.
We can use this constant slope geometry in Figure 3.8 to calculate the MTBF and failure rates. When there are many items in a sample, each with a different service life, we obtain the MTBF by dividing the cumulative time in operation by the total number of failures. We obtain the failure rate by dividing the number of failures by the time in operation. Thus,
3.7 |
For a rigorous derivation, refer to Hoyland and Rausand4, page 31. Note that this is the only case when the relationship applies, as in the other failure distributions, the slope of the cumulative failure curve changes all the time.
We can only replace an item after a test as it is a hidden failure. We do not know if it is in a failed condition unless we try to use it. How do we determine a justifiable test interval T? At the time of test, if we find the majority of items in a failed state, we have probably waited too long. In other words, we expect very high survival probability. Thus, in the case of systems affecting safety or environmental performance, it would be reasonable to expect this to be 97.5% or more, based on, for example, a Quantitative Risk Assessment.
Let us try to work out the test interval with a numerical example. Using the data in Table 3.3 at the beginning of week number 1, all 1000 items will be in sound working order (As Good As New, or AGAN). At the beginning of week number 2, we can expect 985 items to be in working order, and 970 items at the beginning week 3. At the beginning of week 14, we can expect only 823 items to be in working condition. So far, we have not replaced any of the defective items because we have not tested them and do not know how many are in a failed state. Had we carried out a test at the beginning of week 2, we would have expected to find only 985 in working order. This is, therefore, the availability at the beginning of week 2. If we delay the test to the beginning of week 14, only 823 items are likely to be in working order. The availability at that time is thus 823 out of the 1000 items, or 0.823.
The mean availability over any time period, say a week, can be calculated by averaging the survival probabilities at the beginning and end of the week in question. For the whole period, we can add up the point availability at the beginning of each week, and divide it by the number of weeks. This is the same as measuring the area under the curve and dividing it by the base to get the mean height. In our example, this gives a value of 91.08%. If the test interval is sufficiently small, we can treat the curve as a straight line. Using this approximation, the value is 90.81%. The error increases as we increase the test interval, because the assumption of a linear relationship becomes less applicable. We will see later that the error using this approximation becomes unacceptable, once T/MTBF exceeds 0.2.
Within the limits of applicability, the error introduced by averaging the survival probabilities at the beginning and end of the test period is fairly small (~ 0.3 %). These requirements and limits are as follows.
•They are hidden failures and follow an exponential distribution;
•The MTBF > the test interval, say by a factor of 5 or more;
•The item is as good as new at the start of the interval;
•The time to carry out the test is negligible;
•The test interval > 0.
In the example, the test interval (14 weeks) is relatively small compared to the MTBF (which is 1/0.015 or 66.7 weeks). Figure 3.10 illustrates these conditions,