We now turn to the concept of mortality, which when applied in the human context, is the ratio of the number of deaths to the surviving population. To illustrate this concept, let us consider the population in a geographical area. Let us say that there are 100,000 people in the area on the day in question. If there were ten deaths in all on that day, the mortality rate was 10/100,000, or 0.0001. Actuaries analyze the mortality of a population with respect to their age. They measure the proportion of the population who die within one, two, three,...n years. A plot of these mortality values is similar to Figure 3.3, Element A (which refers to equipment component failures). In the first part of the curve (the so-called infant mortality section), the mortality rate keeps falling.
Figure 3.3 Failure Patterns
A baby has a high chance of dying at birth, and the longer it survives, the greater the chance is it will continue to live. After the first few days or weeks, the mortality rate levels out. For the next 50–70 years, it is fairly constant. People die randomly, due to events such as road accidents, food poisoning, homicides, cancer, heart disease, or other reasons. Depending on their lifestyles, diet, race, and sex, from about 50 years on the mortality starts to rise. As people get older, they become susceptible to more diseases, their bones tend to become brittle, and their general resistance becomes lower. Not many people live up to 100 years, though some ethnic groups have exceptional longevity. Insurance companies use these curves to calculate their risks. They adjust the premiums to reflect their assessment of the risks.
We use a similar concept in reliability engineering. The height of the pdf curve gives the number of failures at any point in time, and the area of the curve to the right of this point the number of survivors. The term hazard rate designates equipment mortality.We divide the number of failures by the number of survivors, at this point. In the earlier example, the hazard rate at t = 3,000 cycles is 3/33 or 0.0909. The study of hazard rates gives us an insight into the behavior of equipment failures, and enables us to make predictions about future performance.
3.4 HAZARD RATES AND FAILURE PATTERNS
The design of industrial equipment was simple, sturdy, heavy, and robust prior to World War II. Repairs were fairly simple, and could easily be done at site using ordinary hand tools. Breakdown strategies were common, which meant that equipment operated till failures occurred. The introduction of mass production techniques meant that interruptions of production machinery or conveyors resulted in large losses. At the same time, the design of equipment became more complex. Greater knowledge of materials of construction led to better designs with a reduction in weight and cost. Computer-aided analysis and design tools became available, along with computing capacity. As a result, the designers could reduce safety factors (which included a factor for uncertainty or ignorance). In order to reduce investment costs, designers reduced the amount of standby equipment installed and intermediate storage or buffer stocks.
These changes resulted in slender, light, and sleek machinery. They were not as rugged as its predecessors, but met the design conditions. In order to reduce unit costs, machine uptime was important. The preferred strategy was to replace complete sub-assemblies as it took more time to replace failed component parts.
A stoppage of high-volume production lines resulted in large losses of revenue. In order to prevent such breakdowns, manufacturers used a new strategy. They replaced the sub-assemblies or parts at a convenient time before the failures occurred, so that the equipment was in good shape when needed. The dawn of planned preventive maintenance had arrived.
Prior to the 1960s, people believed that most failures followed the so-called bath-tub curve. This model is very attractive, as it is so similar to the human mortality curves. By identifying the knee of the curve, namely, the point where the flat curve starts to rise, one could determine the timing of maintenance actions. Later research1 showed that only a small proportion of component failures followed the bath-tub model, and that the constant hazard pattern accounted for the majority of failures. Where the bath-tub model did apply, finding the knee of the curve is not a trivial task.
As a result, conservative judgment prevailed when estimating the remaining life of components. Preventive maintenance strategies require that we replace parts before failure, so the useful life became synonymous with the shortest recorded life. Thus the replacement of many components took place long before the end of their useful life. The opportunity cost of lost production justified the cost of replacing components that were still in good condition.
The popularity of preventive maintenance grew especially in industries where the cost of downtime was high. This strategy was clearly costly, but was well justified in some cases. However, the loss of production due to planned maintenance itself was a new source of concern. Managers who had to reduce unit costs in order to remain profitable started to take notice of the production losses and the rising cost of maintenance.
Use of steam and electrical power increased rapidly throughout the twentieth century. Unfortunately there were a large number of industrial accidents associated with the use of steam and electricity resulting in the introduction of safety legislation to regulate the industries. At this time, the belief was that all failures were age related, so it was appropriate to legislate time-based inspections. It was felt that the number of incidents would reduced by increasing the inspection frequencies.
Intuitively, people felt more comfortable with these higher frequency inspection regimes. Industrial complexity increased greatly from the mid-1950s onwards with the expansion of the airline, nuclear, and chemical industries. The number of accidents involving multiple fatalities experienced by these industries rose steeply.
By the late 1950s, commercial aviation became quite popular. The large increase in the number of commercial flights resulted in a corresponding increase in accidents in the airline industry. Engine failures accounted for a majority of the accidents and the situation did not improve by increasing maintenance effort. The regulatory body, the U.S. Federal Aviation Agency, decided to take urgent action in 1960, and formed a joint project with the airline industry to find the underlying causes and propose effective solutions.
Stanley Nowlan and Howard Heap1, both of United Airlines, headed a research project team that categorized airline industry failures into one of six patterns. The patterns under consideration are plots of hazard rates against time. Their study revealed two important characteristics of failures in the airline industry, hitherto unknown or not fully appreciated.
1.The failures fell into six categories, illustrated in Figure 3.3.
2.The distribution of failures in each pattern revealed that only 11% were age-related. The remaining 89% appeared to be failures not related to component age. This is illustrated in the pie-chart, Figure 3.4.
The commonly held belief that all failures followed Pattern A, the Bathtub Curve, justifying age-based preventive maintenance was called into question, as it accounted for just a small percentage of all failures (in the airline industry). Nowlan and Heap questioned the justification for doing all maintenance using age as the only criterion.
We will discuss these issues later in the book.
An explanation of these failure patterns and a method to derive them using a set of artificially created failure data is given in Appendix 3-1.