What and why?
Recently, the AIOps team and I have been doing some work with IBM's CIO office to help manage their worldwide environment. As you can imagine, it is a large and distributed environment spread across many sites throughout the world.
Key goals of managing environments like IBM's are to quickly localise problems and equipment, unify operations across otherwise disparate management silos and domains, and present the data in an appealing and consumable way - exactly what the Cloud Pak for AIOps is designed to do. In this AIOps Bitesize, we explore how senior engineer James Kemish's fantastic new host-to-host grouping capabilities for IBM CloudPak for AIOps Topology Manager can help meet these goals by modeling sites. The new host-to-host option for tag and token templates provide a much simpler way of grouping devices and summarising their connectivity.
For extra context, read [this article] about Topology Manager's data processing, [this article] about how grouping and templates work and [this article] about our geospatial capabilities.
Let's Go
Our specific requirements and assumptions are:-
Our devices have site data including latitude and longitude properties. Each site can contain many devices.
Automatically identify sites in our data and create a correlate-able group of the devices in each site.
When viewing a site, show its topology as a set of connected devices.
View our sites on a geographical map.
Use custom icons to add some visual sparkle to our UI. BO N U S - the SVG icons used in this article are available [from my Git repository] .
When looking at a device, show its site data in a tooltip.
Topology Manager has been configured to use a GIS tile server.
Check out the product documentation at these links for this AIOps Bitesize :-
What do we mean by 'site' and why are they important to manage?
By site, we mean a physical location that contains resources that need to be managed and other common terms for sites include PoP (Point-of-Presence) and Node. Sites represent anything that matters to you and can include buildings, regions or just geographical coordinates. The thing they have in common is that you can consider them as a group of resources and, naturally, sites can be related to other sites - such as in the case of a WAN.
Sites are really valuable from a management standpoint because they provide extra context which can help with localised procedures such as truck-rolls, localise impact analysis and the notification of affected users and enable better alert grouping, visualisation and search and filtering.
How can we model sites?
There's a number of options here and the 'best' choice depends on specific use-cases. A good and basic starting point is to take advantage of our previous comment - sites are groups of resources and so we just need data to group on. Ideally our site data would include location geometry in GeoJSON format as it can provide a lot of extra value, but it's not required to get started - just a property that consistently reveals a site name is enough to do some grouping.
For the purposes of this article, we'll assume that the following site data is available for each device:-
siteCountry - the country in which the device resides, for example GB (Great Britain).
siteName - the short name of the site, for example 20 York Road.
siteCity - the name of the town or city nearest to the device, for example Coventry.
longitude -a WGS84 longitude floating point number. ****
latitude - a WGS84 latitude floating point number. ****
**** AIOps topology manager requires a GeoJSON feature to be provided as the value of a resource's geolocation property to enable geospatial capabilities.
If your use-cases need it, you could use more advanced models such as including sites as resources and explicitly relating them to their resources, or using list properties and token templates to add resources to multiple groups. Each approach has its pros and cons including how sites are visualised, searched for, whether they can have explicit state to not, and how groups are built The approach we're going to take can be thought of as just grouping and categorisation and, as a result, a con is that our site groups cannot have their own explicit state.
We can think of our approach here as being like an SQL SELECT ... GROUP BY query. We have a table of resources with site properties that we want to group on, the result being a group for each site and its devices. Because we're using the same latitude and longitude for each device in a given site site, topology manager will consider these devices to be in the same location.
How do I provide site data?
If your chosen data sources don't provide site data, then don't worry, you could merge your in-house site data as an overlay topology or [use a file enrichment rule as described in this article] . For this article, we'll use a File Observer file that models a WAN spread across a number of sites in mainland Great Britain. BO N U S - The file contains devices, ports, connectivity and site data and is available [from my Git repository] .
Example device and port records follow. The model used followed the documented device modelling recipe where the port is considered to contained and partOf the device, and connectivity is between ports. Note that our site data just needs to be on the devices.
V : { "_operation" : "InsertReplace" , "uniqueId" : "WRTVL-R-73" , "entityTypes" : [ "router" ] , "name" : "WRTVL-R-73" , "_references" : [ ] , "model" : "8804" , "siteCountry" : "GB" , "siteCity" : "Portsmouth" , "siteName" : "Neptune" , "geolocation" : { "type" : "Feature" , "geometry" : { "type" : "Point" , "coordinates" : [ - 1.0974285321365755 , 50.8015879383928 ] } } }
V : { "_operation" : "InsertReplace" , "uniqueId" : "WRTVL-R-73[port4]" , "entityTypes" : [ "networkinterface" ] , "name" : "port4" , "_references" : [ { "_edgeType" : "partOf" , "_toUniqueId" : "WRTVL-R-73" } , { "_edgeType" : "contains" , "_fromUniqueId" : "WRTVL-R-73" } , { "_edgeType" : "connectedTo" , "_toUniqueId" : "MHRWL-R-73[port2]" } ] }
Here's an example of what part of our ungrouped topology looks like once loaded, including the site data and use of custom icons.
Turbocharged Token Templates to the rescue
The challenge is how to efficiently group devices based on your criteria and quickly determine any connectivity between them. Prior to IBM Cloud Pak for AIOps v4.9 this was tricky because Topology Manager's templates could not make use of host-to-host views as it can be expensive to calculate topology summaries, which is why they could only use 'element-to-element' topologies. For those on older versions, it is possible with customisation but we'd highly recommend that you upgrade.
A solution to this problem has been found in v4.9 as far more efficient processing can be done if you know the set of devices that are going to be in a given group. v4.9 enhances tag and token templates to take advantage of this by including a host-to-host hop type which is what we'll use here - all we need to know are the devices of interest and we efficiently figure out the connectivity between them.
The following figure demonstrates the nature of the problem The input topology is shown on the left and even this simple graph has 13 resources and 12 relationships whereas the one we typically want to see is shown on the right. It is tempting to try and use Dynamic or Exact templates and traversals to try and solve this problem but they can yield very large groups and it's easy to run into selectivity issues - e.g. what if you want all devices in a given site but there's no tags or relationship or resource type selectivity to help guide the traversals?
Where this new approach really pays dividends is when dealing with larger sites which can include large numbers of devices, such as the simulated one shown below.
Creating our Site Groups
Let's walk through the process of creating our groups. We'll use a token template because we (a) want a group for each site and (b) don't know up-front how many sites we have, just what the site properties are. Another benefit of Token Templates is that resources don't need to be related to other resources for them to be grouped.
1. Create a new Token Template as follows, completing the standard parameters and choosing Host-to-Host a the group hop type. Note that we've tagged the resulting groups with Site Blog and GB Site for extra context and we've used a custom location icon for the groups.
2. We now need a rule for the Token Template. This rule creates 'tokens' that are the values used to group by. You can think of them as the value that an SQL GROUP BY clause would use and, as with SQL, you don't need to use a single column - it could be the result of concatenation, a regex etc. These tokens control how the resulting groups will be identified and named and so have a bearing on the number of groups created and their size. [This article] provides more details of how all of this works and best-practice is to aim for a mid-ground on group quantity and size as large groups with thousands of resources may result in a poor UX.
Our devices have three site related properties we could use - siteCounty, siteName and siteCity. siteCountry alone may result in a large group, siteCity sounds appealing and could be a good option although could be large and may lack the finer grained context of siteName. siteName appeals because it reveals something that's likely tangible to the user but does not include where the site is. In this case, we can combine siteCity and siteName to give the best of both worlds for context and size. I follow best-practice by targeting the rule to a given Observer job and type and I want our group names to be delimited and so I specify the following token: ${siteCity} - ${siteName}
3. I now re-run my site load Observer job as Token Templates need processing by Observers for them to work. Check the Template admin screen for confirmation of the groups being created.
Using Sites
Once your groups have been created, your incident and alert management in AIOps will benefit from them, assuming they're configured for correlation and alerts are correlated to their members. Here's a quick tour of what you can do in Topology Manager once they're created.
Finding Sites
Because we tagged our sites, we can easily find any that were created by our Token Template. You can either do a Google-like search for an adhoc term, such as blog, or create a more structured filter, which is what I did here. The benefit of using a filter over an adhoc search is that (a) you can be much more specific with your criteria and (b) you can save them for quick recall later and allocate them to users or user groups. For example, you can easily find all sites that have problems. Here we can also see how our Token Template rule combined the site city and name.
Viewing Devices at a Site
Now we have our sites, there's a couple of ways we can view which devices are present. From the list above, click on the More details link and then navigate to the Related resources tab.
Or you can get a more topological view by clicking on the Resource group name link.
Viewing Sites on a Map
Because we provided location data, we can also view where are sites are and a summary of the devices in them.
Final Thoughts
In this AIOps Bitesize, we used Topology Manager's new host-to-host token template capability to automatically find sites and show the device-to-device connectivity within them. You saw how token templates provide you with a simple way of finding sites and how the use of custom icons can add sparkle to your experience.
What would your scenario be?