What's something you're working on that has excited you lately? As a developer working on Analytics developer, I can say that about the improvements to performance and resource utilisation my colleagues and I have made recently - and are now available since version 10.0.5.2. So I'd like to tell you about some of the ways that could help you (infrastructure owners, ops teams) get more punch from your analytics system.
To cut a long story short, 10.0.5.2 comes packing cutting-edge versions of OpenSearch and Logstash, together offering significant performance improvements over 10.0.1.9 and earlier versions. We've also defined new deployment profiles which allow much greater flexibility and scalability, aiming to suit every business need. For the first time, you can deploy Analytics with up to 8 CPU cores and 64GiB of RAM per node, all of which together should multiply the capacity of the largest system possible in previous releases.
This should give your business more power to do more, with less. However, it's important to understand some key points about pipeline performance, to make sure we're getting the most out of the system.
Key improvements
Let's cut straight to the question "how much faster is the Analytics subsystem of 10.0.5.2 than 10.0.1.9?" The short answer is, a lot faster. First, lets consider only the data store section of the pipeline. In my lab setup, I was able to isolate the data stores in each version and benchmark them running on the same hardware, using Elasticsearch Rally and OpenSearch Benchmark respectively. It's not realistic to give specific numbers here because there are a lot of factors and you may not see the same results. However, all things being equal, document ingestion in 10.0.5.2 at that stage of the pipeline in isolation should be about 20% faster than on 10.0.1.9. How is this achieved? Mainly, we now ship OpenSearch 2.4 as opposed to the older Elasticsearch shipped in 10.0.1.9.
That's not all. In addition to upgrading the data store section of Analytics, we also streamlined the entire pipeline and revised memory and CPU allocations in all of our pods to maximise throughput. In particular, we no longer need Kafka or Zookeeper services, since that functionality is now built in to our ingestion service. As a result, in my lab tests and using the n3xc4.m16 profile, passing data through the whole pipeline showed that 10.0.5.2 can ingest data nearly twice as fast as 10.0.1.9!
Even better than that, we have new profiles available including the rather capacious n3xc8.m64. In fact, my testbench system wasn't able to drive this profile all the way to its maximum performance but with sufficiently fast storage, we're confident that this profile will go much faster still.
Understanding pipeline performance and bottlenecks
To really dig into these performance improvements, lets take a few minutes to orient ourselves so we know what we're talking about and why it's important. I'm mainly going to talk about ingestion pipeline performance rather then search performance, just because in this type of application, you tend to care more about performance when storing data than when retrieving it.
So what is the Analytics data ingestion pipeline and what determines how well it performs? Simply put, the Analytics pipeline is a series of computing modules, defined as parts of a cluster, that pass data messages from one module to another in sequence, as water flows through a pipe. When an API call happens, the IBM DataPower gateway passes information about the call to our data ingestion pod, which runs the industry standard Logstash software. That optionally buffers the message (see the docs for details on using persistent queues) and attempts to pass it to the data store, which then indexes all the messages and writes them to backend storage in such a way that we can easily retrieve them later.
Therefore we have a chain of processes, one after the other, each one in its own container. The upshot of all this is that any one link in the chain can act as a bottleneck if it is slower than all the others, thereby reducing the throughput of the pipeline to that of the slowest part - just as the narrowest part of a water pipe determines how much water can flow through the whole length.
This is very important to understand before moving on to try to quantify how fast our Analytics system will go, because if you upgraded from 10.0.1.9 or 10.0.4.0 to the latest version and saw no improvements, then you would rightly be disappointed. In that case, you would need to look at your API gateway, your virtualisation infrastructure, your storage solution and so on. I'll give some hints on what to look at in this situation at the end of this blog.
Additional considerations affecting performance
Even with the switch to OpenSearch, we see the same limitations on ingestion capacity with ElasticSearch (up to and including version 7.x) - that is the memory overhead of sharding. If you aren't familiar with sharding, put simply the data store needs to spread its indices across multiple fragments in order to provide redundancy and improve search performance. Sharding is good for our data, however shards do come with overheads, particularly the memory which is used by the data store to keep track of said shards.
A key point here is *peak* throughput capacity as opposed to *sustained* throughput capacity. As an illustration, if you were storing 50,000,000 API events every day using a single node, since our default rollover document count is 25,000,000 we would end up with 2 rolled over indices created per day. We then have 5 shards per index and assuming we keep rolled over indices for 30 days, that gives us a grand total of 300 shards in play at one time. The ES docs recommend no more than 20 shards per GiB of heap memory, that would require 15 GiB of heap. That represents a relatively heavy workload but would fit within allocation of our m32 profiles - but if the event rate were doubled, which may be achievable depending on hardware or infrastructure performance, then we might need to go up to the next profile option to have enough *memory* to handle the doubling of shards in a sustainable way. However, peak throughput can be higher for some period as long as it averages out to around the same over a day or a week. You could also try pushing the 20 shards per GiB limit, to see how your system performs, since it is only a guideline and the optimal figure will depend on many factors.
Alternatively or additionally, we can change the rollover and retention settings following the documentation. Depending on the exact use case, we could push minimum age to 7 days and rely solely on document count (defaults to 25 million) to rollover our indices. This should reduce the shard overhead going forward, assuming the daily number of new documents is less than 25 million. You can also consult the AWS OpenSearch docs as there is significant commonality between that and API Connect Analytics.
Summary and takeaways
- Analytics in APIC 10.0.5.2 contains updated technology and streamlined resource utilisation, which lends itself to much better ingestion performance than previous versions.
- Sharding is a good thing and necessary - but over-sharding can adversely affect the sustainability of heavy ingestion loads. So rather than relying on default settings, always review the rollover and retention settings in analytics to make sure you have a good balance.
- Performance can also be heavily impacted by backend storage solutions that rely on spinning disk, as well as other infrastructure issues like CPU contention, so it's difficult to give projected figures for performance.
- If you have reached the capacity of your analytics system, you can consider upgrading to one of the newer performance profiles available.