Dr. Eric Bechhoefer, Chief Engineer and CEO, GPMS
We regularly get asked about the applicability of “big data”, “AI”, and “machine learning” in the field of fault detection, particularly in the area of HUMS and machine condition monitoring.
There is a lot of excitement around these new approaches to modeling machine health. Unfortunately, much of it is misplaced, particularly when it comes to complex rotating machinery. That’s because these approaches, however en-vogue and exciting in the abstract, do not deal well with real world asymmetric data sets – that is, data sets that have exceedingly few faults.
The field of machine condition monitoring is an old one. Visit the Society for Machine Failure Prevention or the Vibration Institute and you can see that work related to machine health (as part of a larger effort to optimize assets for reliability and uptime) has been going on for over a century. Academic researchers and practitioners fundamentally agree that there are three core steps to the process of fault detection:
1. Extract the Fault Signal from the Noise
2. Identify the Particular Fault from the Fault Signal
3. Provide Predictive Estimates of Remaining Useful Life on the Component in Question
Significantly, Big Data approaches to the problem of machine failure have challenges at each of these stages. In contrast, GPMS uses a physics-based approach that is highly effective. Let’s look at this in more detail.
Step 1: Fault Signal Extraction
Big Data techniques in condition monitoring have mostly been tested on simple machines (e.g. induction motor) but not on more complex industrial or aviation gearboxes. For a simple induction motor, a rudimentary condition indicators such as acceleration RMS, are effective. But in the real world gearboxes, the measured signal is the superposition of many components. The measured signals are complex. The fault features is so small relative to the measured noise noise that extensive signal processing is required to extract a feature representative of a fault. The lack of sophisticated algorithms to process those signals is a first fatal flaw for the Big Data approach.
GPMS has patented and trade secret signal processing software algorithms that allow our system to see fault signals in the noise that no other system on the market can identify. These algorithms are aided by hardware: sensors attached to the machine using high-resonance bracketry and a true distributed processing sensors network, which is also key to the ability to extract fault signals and speaks to the need for a ‘full system’ hardware + software approach.
Step 2: Identification of a Fault based on the Fault Signal Extracted from the Noise
The essential aim of Big Data is to gather a large enough training library of fault information that the software would be able to match a known fault in the library with an emerging fault in a subject machine. But that prerequisite has proven impossible in condition monitoring for complex machines. Why? (i) faults happen infrequently; (ii) faults of different components happen with different modalities; (iii) each component has multiple fault modalities; and (iv) the same component in different positions in a gearbox will have different signatures, so you begin the problem all over again.
When you stack all of these factors, the training library is for all practical purposes unobtainable.
In contrast, GPMS models the “normal” gearbox components based on physics. We know that due to the laws of physics, data from a nominal/normal component will have a Non-Gaussian distribution and that the component health will have a Nakagami probability distribution. Therefore, when the calculated result provides a distribution that is no longer Nakagami, we know that the component is no longer good. The deviation from the Nakagami distribution that triggers a warning can be set based on the desired Probability of False Alarm (PFA). That is, if the desire is to trigger earlier, the PFA will be higher. Our standard is to set the PFA to 1 in a million when the Health Indicator (HI) registers at 0.5 (where HI of 0.75 is time to begin planning for maintenance). Furthermore, because the behavior of the Nakagami distribution is non-linear, the probability of false alarm by the time the Health Indicator registers 1.0 (time to perform maintenance), is infinitesimal (less than 1 in 10^300).
In short, because of the nature of the problem, it is not feasible to identify a bad component by comparing it to a library of bad components (as Big Data attempts to do). GPMS uses normal data, which is plentiful, and shows when a component is no longer good. The result: our system requires a very small data set (typically 10 min of data). We then refine the configuration after installing on 3 to 5 aircraft. The beauty is in the simplicity, but is it deceptively hard to implement, which is why no one else does it.
Step 3: Predictive Estimates of Remaining Useful Life (RUL)
A system provides little value if it can’t provide the asset owner (or its maintainer) with advance notice of maintenance requirements. For illustration: think of the difference between your oil light on a car dashboard and the gas gauge that provides range until empty. The first tells you the problem has happened. The second gives you advanced notice and the ability to take corrective action.
(Incidentally, the increasing interest we hear in the helicopter industry for “Real Time HUMS” is recognition that the industry has failed to provide predictive solutions: they’re trying to work around the above noted oil light limitation. But that’s a whole other topic for a different day.)
In terms of predictive solutions, Big Data provides no model for usage that would demonstrate progression of each type of fault on each component type in each location in a given gearbox. Assume for argument’s sake a complete Big Data library of fault information, if you had a bearing fault in the library that you knew trended to failure in 150 hours (i.e., progressed from a health indicator of 0.75 to 1), a Big Data system would estimate 150 hours RUL any time it saw a component at 0.75.
In contrast, GPMS’ model-based system looks at level (Health Indicator), the derivative (rate of change of level), and load to estimate Remaining Useful Life (RUL). If the load is low, the derivative will go to zero, and the RUL will increase. Big Data simply does not have this feedback capability.
I’ve written 10 papers on Big Data approaches to RUL and reviewed 100+ papers over the last five years. In these papers, the time window that the leading Big Data researchers are able to establish for the RUL is 30 to 60 minutes. In contrast, the GPMS physics-based model, can identify faults at the earliest stages and then provide a remaining useful life estimate of more than 150 hours — more than enough time to schedule maintenance (with the right people, parts and time on hand) around planned maintenance windows.
***
To be successful, Big Data and AI would need to excel at each of the steps outlined above and as noted, such approaches fail at each step. As the saying goes, “garbage in, garbage out.” If you don’t have the hardware and algorithms needed to extract faults, identify them, and provide meaningful predictive estimates to the underlying component, you haven’t accomplished much.
It’s time we got real about the potential of Big Data and AI in the area of machine condition monitoring. While these new methods are tremendously powerful in some applications, they are not for all applications. “If your only tool is a hammer then every problem looks like a nail”. And complex rotating machinery? It’s more of a bolt.