Not everyone understands why we need anomaly detection in network monitoring, some companies only want to know when something is 90% utilization and that’s it. But I’m a strongly believer that this shouldn’t be the case, that to be more proactive and even predictive, we need to use AI and ML to help us analyse the data that we are collecting and find those little anomalies that will help us avoid outages.
Scenario
Let’s review this with a very simple scenario:
We have a company with multiple branches and several data centres and we want to monitor the availability of the connection of each of these locations using a monitoring tool.

Caveat, each location has one or multiple internet connections, therefore to detect outages on the locations with a single router is very simple, but how do we detect outages on locations with more than one router?
This looks very simple with a very short answer, monitor each internet connection individually, and when all of them are down, then trigger an alert, right?
If we want to be a little more clever we could create a custom metric that shows the total availability of the site, something like:
Total availability = (availability router1 + availability router2 ) / 2
*If we have only two routers on that location
This way we can trigger different alert severities depending on the total availability:
· 100 % -> All good
· 50% -> minor incident
· 0% -> major incident

Router001 is in an location with 2 routers, Router003 is the only router on its location, hence they raise different severity alerts.

This is good but this does not give us all the information we need; networks are never that simple.
Why do we need anomaly detection?
We might have only two internet connections from two different routers, but there are lots of other elements on the network that will impact that internet connections, things like routing tables, firewall rules etc…
Think about this situation, first router goes down, therefore HA kicks in (let’s say VRRP, HSRP, GLBP or any other flavour) and then all traffic now is routed through the secondary internet connection, but, unfortunately, there is some misconfiguration at the firewall level not allowing the traffic to go through.
The second internet connection is active, therefore is not triggering any alert and everyone thinks everything is working fine, but it is not… How can we be more proactive and avoid this situation?
Solution
Anomaly detection is key in this situation, because if we could know how the normal traffic looks like on this location, we could trigger an alert when an anomaly on this traffic pattern is detected and notify of this problem before it’s too late.
But there is another issue, from the devices we get the traffic generated from the device itself, and what we need is the total traffic generated from both routers. It might be that normally the traffic goes through only from the primary internet connection, therefore it would be normal that no traffic goes through the secondary internet connection. The solution is to create another custom metric with the total traffic going through that location:
Total traffic = Traffic router1 + Traffic router2

Bear in mind that SevOne does not only learns the normal behaviour of any metric monitored from the devices, it also learns the normal behaviour of any custom metric created.

Now, we can be proactive and trigger a major incident alert when at least one internet connection is down AND the total traffic through that location is less than expected, meaning that even though the secondary internet connection is active, is sending less traffic than expected.
