People have lost confidence in anomaly detection
It is becoming quite common that companies are reluctant to talk about anomaly detection, most of the times because some of these companies haven’t had positive experiences when they have implemented this kind of technology on their network. And, even if I understand their frustration because lots of vendor promise the holy grail with anomaly detection but most of them fail to deliver, I do still believe that anomaly detection is one of the cornerstones for autonomous networks, we just need to modernize our anomaly detection technologies.
Problem #1: anomaly detection has a different meaning for different people
The first issue I see with anomaly detection is regarding the concept of anomaly detection; it means different things for different people.
After talking with hundreds of network and infrastructure owners, some of them believe that anomaly detection is what happens when a KPI is higher than a static threshold, for example when the CPU of a device over 80%. Personally, and what most of the people understand, that is not the case, anomaly detection is when a system can detect some kind of unexpected behaviour, normally based on what the system have learned that is normal.
Therefore, anomaly detection requires some kind of learning of what is normal (and what is not), anomaly detection is not just static thresholds. And it is important that the learning of what is normal comes with high level of granularity and seasonality.
Let’s see an example. During the defence of an RFP that I run for an MSP in Europe, it was the time to talk about anomaly detection, and the customer mentioned that the other competitor also did anomaly detection based on machine learning. After digging a bit deeper we realised that what this tool did was get the average utilization for a 24h period, for a set of KPIs, and trigger an alert if the current average for the last 24h was higher than a standard deviation of the expected value.
I understand, and agree, that this is nice to have feature, but definitely it is incomplete for different reasons:
- You need to wait 24h before you can raise an alert (not real time)
- What happens with weekends when the usage is drastically different
Therefore, as mentioned before, more granularity and seasonality are required, otherwise people lose confidence on anomaly detection.

Static threshold alerts will generate false alerts
Problem #2: Old algorithms vs time series foundation models
Another issue we face with anomaly detection is the fact that we are still using algorithms that were created 10-15 years ago. These are still useful algorithms that detect anomalies in lots of different situations, such as deviation from normal, however these algorithms are not built to detect all the relevant anomalies that occur on the network. And these creates a lot of frustration on network teams that have invested in anomaly detection technologies but then they still have some incidents that their tools were not able to detect.
In order to solve this problem we need to embrace newer technologies that are able to use the new advances in artificial intelligence, for example the new time series foundation models such as TSPulse or Tiny Time Mixer (TTM) that are pretrained on domain specific data (for example one model can be trained on MPLS data, another one can be trained on RAN data…). These new foundation models allow us to detect anomalies that the old algorithms could not detect, reducing MTTR and MTBI.

Example of predicted behaviour using TTM time series foundation model
Problem #3: Malicious vs Benign anomalies
Are all anomalies bad? They don’t have to be bad; they could be neutral or even good. For example, wouldn’t be a good thing that the traffic on a website is higher than expected during Black Friday? Wouldn’t be a bad thing that the number of connections to an online streaming service during the final of the champions league is normal for a Saturday night?
There are situations where some KPIs behaving in an abnormal way doesn’t mean that there is a problem (i.e. there is nothing to fix) and, therefore, we shouldn’t need to get notified.
Most of the companies that embrace anomaly detection is to avoid being flooded with false alerts (what we call alert fatigue) and the old algorithms for anomaly detection can’t be trained to filter benign or neutral anomalies from the malicious ones.
To give more clarity let’s talk about a more common example, for instance when there are a pair of internet routers configured in HA, and then suddenly the traffic of router 1 goes down and the traffic on router 2 goes up. This could be considered a benign anomaly, because even though it’s an anomaly that the traffic in router 1 is lower than expected, it is not a problem, it means that the HA configuration worked as expected.
How can be sure that an anomaly is malicious or benign?
My answer to that is context. If we could tag some of our time series data with several labels that describe the situation such as “football match”, “twitch stream” or “internet down in South London” (and that could be the reason or context on why there are more people in the office today) then we would be able to filter out those anomalies that are actually expected.

Some anomalies are neutral or benign
Problem #4: Lack of correlation between KPIs
And the last problem I see with current anomaly detection is the lack of correlation between KPIs. This relates with the fact that current anomaly detection algorithms work on a univariate level, meaning they only consider individual KPIs, with no context of the KPIs around them.
I can talk a real conversation with a company that was using anomaly detection on their network, but they used to complain about the fact that, even with anomaly detection enabled, they used to received lots of alerts from the same incident. One of the examples they mentioned was when there was an issue (an anomaly) on the power of an optical interface of a switch. This issue generated multiple anomalies, including the loss of power, but also the increase of errors and discards on that interface, reduced traffic on that interface as well as in lots of other interfaces in their network, higher latency in some services…basically they ended up with tenths all alerts triggered that were related to the same issue. False alerts, because they don’t need to be actioned in any way, that led to alert fatigue.
This is something that AIOPs platforms are supposed to fix by correlating events together, however they need training (i.e. see the issue several times) to be able to temporarily correlate the issues together and show them as a single incident, but these systems won’t be able to help on the 20% of incidents in the network that happen once every year.
A solution to this problem is use multivariate anomaly detection algorithm, like the TTM time series foundation model mentioned above, that can detect anomalies mixing multiple KPIs together and understanding the correlation among them. With these foundation models you have the best of both worlds, you detect anomalies that static thresholds wouldn’t, but you don’t generate hundreds of incidents that are part of the same issue.

Some anomalies can't be detect using univariate algorithms
#TechnicalBlog