API Connect

 View Only

Monitoring APIC Analytics

By Chris Dudley posted Sun September 01, 2024 07:35 AM

  

API Connect analytics stores the API event data for all the API calls to the gateway. This can amount to a large amount of data, especially if payload logging is enabled or you are using a long data retention policy. As the amount of data stored in analytics grows it has an impact on the system resources needed not just to store more data, but also to respond to queries in the UI and API. This means that not only is more disk space needed but also more memory or more data nodes.

It is common to want to keep an eye on the health of your APIC Analytics deployment and, to this end, a dedicated API operation was added. This was designed for use in the Analytics UI to display early warnings of problems long before they become serious issues. As such, it returns booleans as to whether there is a problem rather than numerical values that then need interpreting or analysis.

This new operation is called the service-status operation and can be called at any scope (cloud, organization, catalog, space). The response returned is the same regardless which scope is called, it is available at all of them for ease of authentication and to allow for it to be called from the Analytics UI at all scopes.

The service-status operation was added in APIC 10.0.5.6, but was greatly expanded in 10.0.8 and this blog post is going to focus on 10.0.8. Only a couple of the attributes shown below are available in 10.0.5, please check the 10.0.5 API reference docs for more information there, but I would strongly encourage planning for an upgrade to 10.0.8.

The 10.0.8 reference documentation is available here: https://apic-api.apiconnect.ibmcloud.com/v10/10.0.8.LATEST.html#/IBMAPIConnectAnalyticsAPI_200/operation/%2F{analytics-service}%2Fcloud%2Fservice-status/get

Make a REST API call like this (with a cloud level authentication token):
GET /analytics/{analytics-service}/cloud/service-status

or via the CLI (having done a cloud level apic login):
apic -m analytics service:cloudServicestatus --server <platform API host> --analytics-service <analytics service name> --format json

There are equivalent API and CLI operations for the other scopes.

It will then return a response like this:

{
  "initial_transform": true,
  "long_term_data_enabled": true,
  "storage_enabled": true,
  "rollover_ok": true,
  "aws_storage": true,
  "geoip_enabled": false,
  "diskspace_ok": true,
  "transform_ok": false,
  "storage_memory_ok": false,
  "reindex_ok": false
}

These attributes are all booleans and they are used to control the behaviour of the Analytics UI, but they contain some health checks which can be useful to know about or to use in your own system monitoring.

initial_transform

There can be an hour delay on first installation of Analytics before API event data is transformed into API Summary data. This boolean purely tells the UI whether that first transform has happened yet. Its used to allow the UI to display a message to “check back later” if someone tries to access long term trend reports before any transform runs have happened.

long_term_data_enabled

An internal attribute that says whether long term data is enabled or not. Disabling long term storage is not exposed to customer deployments but is used in some internal scenarios.

storage_enabled

Is analytics internal storage enabled? If analytics is set to offload only then the analytics UI will be disabled.

aws_storage

Only ever set to true for IBM managed deployments in SaaS.

geoip_enabled

Is the GeoIP feature enabled? Enabled by default in 10.0.8, but can be disabled if desired (for example all consumers are internal and so using internal IP addresses which can’t be resolved to a physical location).

rollover_ok

Is API event rollover working correctly? This is a very important attribute to monitor. Some of the most common problems we see in Analytics are when rollover has failed for whatever reason. Unfortunately once it fails it is not currently re-attempted. This means that all API events go to a single index which grows and grows and means the retention policy will not be working correctly. This tends to lead to running out of disk space and system resources. This will return true if the latest API event index was from the last couple of days. If its older than that then it suggests rollover has failed since there should be at least one new index each day.

diskspace_ok

Is the amount of analytics disk space used less than 70%? There can be issues seen as the amount of free disk space decreases and its definitely a very bad idea to completely run out of disk space. The watermark is set to 70%. Once the disk space increases beyond that then either additional disk space needs to be added or retention needs to be decreased to clear up disk space.

transform_ok

Is the transformation to long term summary data working correctly? This checks that the hourly transformation job that converts API event data to long term summary data is working correctly and has not failed.

storage_memory_ok

Is analytics storage memory usage below 95%? It can be hard to closely monitor the memory usage of analytics storage because as a java application the memory usage can grow and then be garbage collected when everything is running perfectly happily. As such the warning level has intentionally been set very high at 95% as it was found false positives were given at lower levels.

reindex_ok

Is the monthly reindex job working correctly? At the beginning of every month (or the 2nd or 3rd of the month) the previous month’s daily summary indices are reindexed into a single monthly index. This decreases the system resources needed and allows us to store data for far longer because we will be using few shards for a single monthly index instead of 30 daily indices. It can also handle aggregation so if there were multiple API calls for the same API in the same time period from the same consumer, then it can store a single record with a count instead of storing a count for each one. This can greatly reduce the data stored and so decrease the system resources needed.

If any of the *_ok attributes return false then the analytics UI will display yellow warning alerts to show that there is a problem. The intention is that then the system admininstrator can investigate further and consult the documentation or raise a support case if they need further assistance, but that can be done calmly in plenty of time rather than as an urgent high severity support case when the system fails.

If rollover fails then it can often simply be reattempted via the analytics API and that can be all that is needed to get things working again:

apic -m analytics clustermgmt:rolloverIndex --index apic-api-w --server <platform API host> --analytics-service <analytics service name> --format json

(this command needs a cloud admin login)

If the system is running out of disk space then there are instructions in the documentation on how to add more (depending on the storage provider used this might necessitate a disaster recovery exercise as only certain storage providers support adding disk space to existing PVCs): https://www.ibm.com/docs/en/api-connect/10.0.8?topic=analytics-running-out-storage-space

If the system is running out or memory, performing slowly or generally having performance issues then there are a few things that can be tried.

  1. Switch deployment profile

  2. Decrease retention

  3. Decrease log level

  4. Scale horizontally

1. Switch deployment profile

The first step should be ensuring that you are using the correct deployment profile for your usage. There are 7 profiles to choose from (though the n3c4.m16 profile is now deprecated as it is notorious for running out of memory).

Analytics deployment profiles on VMWare: https://www.ibm.com/docs/en/api-connect/10.0.8?topic=deployment-planning-your-topology-profiles

Analytics deployment profiles on kubernetes and openshift: https://www.ibm.com/docs/en/api-connect/10.0.8?topic=licensing-analytics-component-deployment-profile-limits

If you are using a profile too small for your usage then simply switch to a larger profile. There is no need to use custom overrides anymore - in some cases they made matters worse. The out of the box profiles have balanced resources for all analytics pods and switching to a larger profile will ensure all parts of analytics get the resources they need.

Changing deployment profile: https://www.ibm.com/docs/en/api-connect/10.0.8?topic=options-change-deployment-profile

2. Decrease retention

Should you not want to change deployment profile then you can look at decreasing analytics retention to keep data for a shorter time period. The longer analytics has to keep data for the more system resources are needed.

3. Decrease log level

You can also look at decreasing how much data is stored for each event by lowering the API log level from payload or headers down to “activity”. Payload logging as a huge impact on the amount of resources needed in analytics - for example we store about 4-5Kb at “headers” level, but adding a 200Kb payload would 25 times increase on the system requirements.

4. Scale horizontally

Those using kubernetes or openshift can also make use of the ability to scale horizontally if using “dedicated storage”. Analytics scales very well horizontally and there are customers with 20 analytics data nodes in order to be able to keep data for multiple months at high TPS. More information on this can be found in the documentation: https://www.ibm.com/docs/en/api-connect/10.0.8?topic=options-dedicated-storage-scaling-up Note that dedicated storage and horizontal scaling is not possible on VMWare.

If there are issues with the transform or reindex jobs then it is probably best to raise a support case so this can be investigated further - this is not something that is commonly seen or expected and likely needs the IBM support team to investigate further.

Hopefully this information helps you monitor your Analytics deployment, any feedback is always welcome.

#analytics#APIConnect#monitoring#IBMAPIConnect

0 comments
29 views

Permalink