To understand different level of root causes, let us take one industrial case.
Consider this example: During the overhauling of a large reciprocating compressor, the maintenance supervisor discovers a damaged compressor rod requiring replacement. So, he decides to have a rod made in a local shop by fabricating the rod with cut threads. But the OEM’s design department has recommended the compressor rods for this frame size to have rolled threads. As a result of the improper fabrication, the rod fails due to fatigue in the thread area and causes extensive secondary damage inside the compressor.
Figure 2.1 Events leading to compressor failure.
If you study this example, you can discern the following events leading to the costly failure:
The warehouse did not stock spares for this rod because it was a new compressor installation.
The maintenance supervisor decides to have a rod fabricated without drawings.
Neither the user nor the local shop investigated the thread requirements.
Because the compressor was not equipped with vibration shutdowns, it ran for a significant amount of time before it was shutdown.
There were several chances to break the chain of events leading to the catastrophic compressor failure. If the project engineer had ordered spare parts through the OEM, this failure probably would have been avoided. If either the maintenance supervisor or the local machine shop had talked to the OEM, or studied the failed rod, they would have been aware of the importance of rolled threads. Lastly, if a vibration shutdown had been in place, the compressor would have shutdown after only minimal damage. We see there were six major events leading to the secondary compressor damage. These events were as follows:
No procedure in place to order spare parts for newly purchased equipment (latent root).
The improper installation of the packing leads to rod scoring.
Because a spare rod is not available and plant management wants the compressor back in operation as soon as possible, it was decided to have a replacement rod fabricated at a local machine shop.
No one checks with the OEM about rod thread specifications (physical root).
The rod fails after two days of operation.
The broken rod causes extensive damage to the cylinder, packing box, distance piece, and cross‐head.
After examining the vestiges of the failure, the rotating equipment (RE) engineer would discover a fatigue failure in the threaded portion of the rod. From this, he would conclude an improper thread design led to a stress riser and a shortened fatigue life. After talking to the OEM, he writes a report recommending that all compressor rods in the plant have rolled threads.
This recommendation will surely reduce rod failures, but the investigation did not uncover the latent root of failure. The stress riser, due to the improper thread design, is called the “physical root,” because it did initiate the physical events leading to the secondary damage. However, there were significant events preceding the physical root that are of interest. If the RE engineer had the time and resources, he would have discovered that the absence of a procedure requiring new equipment to be purchased with adequate spares directly initiated the sequence of events. This basic event is called the “latent root.”
By requiring spare parts be purchased from the OEM for all new equipment, the latent root is eliminated, not only for this scenario but, potentially, for many other similar events. This example demonstrates the importance of finding out the “latent root” of rotating equipment failures. Stopping at the “physical root,” deprives the organization of a valuable opportunity for improvement. So, an RCFA is a detailed analysis of a complex, multi‐event failure, such as the example above, in which the sequence of events is hoped to be found, along with the initiating event. The initiating event is called the root cause, and factors that contributed to the severity of the failure or perpetuated the events leading to the failure are called contributing events.
Industry personnel generally divides failure analysis into three categories in order of complexity and depth of investigation.
They are:
1 Component failure analysis (CFA) looks at the specific physical cause of failure such as fatigue, overload, or corrosion of the machine element that failed, for example, a bearing or a gear. This type of analysis mostly emphasizes to find the physical causes of the failure.
2 Root cause investigation (RCI) is conducted in greater depth than the CFA and goes substantially beyond the physical root of a problem. It investigates to find the human errors involved but doesn’t involve management system deficiencies.
3 Root cause analyses (RCA) include everything the RCI covers plus the management system problems that allow the human errors and other system weaknesses to exist.
Although the cost increases as the analyses become more complex, the benefit is that there is a much more complete recognition of the true origins of the problem. Using a CFA to solve the causes of a component failure answers why that specific part or machine failed and can be used to prevent similar future failures. Progressing to an RCI, we find the cost is 5–10 times that of a CFA but the RCI adds a detailed understanding of the human errors contributing to the breakdown and can be used to eliminate groups of similar problems in the future. However, conducting an RCA may cost well into six figures and require several months. These costs may be intimidating to some, but the benefits obtained from correcting the major roots will eliminate huge classes of problems. The return will be many times the expenditure and will start to be realized within a few months of formal program implementation.
One thing that has to be recognized is that, because of the time, manpower, and costs involved, it is essentially impossible to conduct an RCA on every failure. The cost and possible benefits have to be recognized and judgments made to decide on the appropriate type of analysis.
When RCA Is Justified
Equipment Damage or Failure
RCFA are normally justified for those events associated with the partial or complete failure of critical production equipment, machinery, or systems. This type of incident can have a severe, negative impact on plant performance. Therefore, it often justifies the effort required to fully evaluate the event and to determine its root cause.
Operating Performance
Many a time deviations in operating performance occur without the physical failure of equipment or components. Chronic deviations may justify the use of RCFA as a means of resolving the recurring problem.
Product Quality
RCFA can be used to resolve most quality‐related problems. However, the analysis should not be used for all quality problems.
Capacity Restrictions
Many of the problems or events that occur affect a plant’s ability to consistently meet expected production or capacity rates. These problems may be suitable for RCFA, but further evaluation is recommended before beginning an analysis. After the initial investigation, if the event can be fully qualified and a cost‐effective solution not found, then a full analysis should be considered. Note that an analysis normally is not performed on random, nonrecumng events or equipment failures.
Economic Performance
Deviations in economic performance, such as high production