Over the past several years, the IBM Storage Insights team has been working hard on making core architectural changes to the IBM Storage Insights code base. While the end user experience for the most part remains untouched, the majority of the code base has been completely rewritten from scratch to transition off of our old architecture.
We've delivered the changes in phases over time with the first deliverable last fall and the final changes coming earlier this year. New IBM Storage Insights customers have been already placed on the new architecture with existing customers currently in the process of being migrated over. Let me explain what was involved in the transition of IBM Storage Insights to the new architecture.
IBM Storage Insights is a cloud-based Software as a Service (SaaS) product where the cloud provider such as IBM is responsible for maintaining the product. Maintenance tasks such as replacing a failed hard drive, applying security updates, upgrading the product, or performing backups are all performed by IBM on behalf of the IBM Storage Insights user. End users benefit by simply consuming IBM Storage Insights remotely over the Internet without having to spend time and money maintaining it locally.
Single-Tenant vs. Multi-Tenant
IBM Storage Insights was originally based on our on-premises sister product, IBM Spectrum Control. IBM Spectrum Control is a mature product that was developed years before cloud-based SaaS products ever existed. Like other legacy products, it requires a dedicated server to host the product. Since IBM Storage Insights shared the same code base with IBM Spectrum Control, each IBM Storage Insights instance also required a dedicated server in IBM Cloud to host the IBM Storage Insights server and Db2 database. This resulted in a single-tenant SaaS architecture where each IBM Storage Insights customer had their own dedicated server to host the Db2 database and IBM Storage Insights server.
While this works well, single-tenant architecture has its drawbacks:
- Each individual instance is not fully utilized all the time with machine resources often wasted waiting for certain activity to occur. While Virtual Machine (VM) over provisioning can help with the utilization, it can also cause severe problems when all physical resources on the hypervisor are consumed with unpredictable demand. Additionally there is still wasted overhead of the base operating system and product installation.
- If the instance needs to grow beyond the bounds of the current server, a complex scale up task to migrate the data to another server with more resources is involved. This most likely includes additional down-time for the end user.
- Monitoring of each individual instance is difficult with the isolated nature of each instance.
- Higher operating costs due to all the above.
With a multi-tenant SaaS architecture, multiple SaaS instances, or tenants, are hosted from a single multi-tenant application that spans the resources of many shared servers and services. Even though any two tenants might share common resources, the data of each tenant is isolated to ensure any one tenant does not see the data of another; let alone even knows others exist. For example, rival soft drink giants like Coca-Cola and Pepsi might both be customers with data on the same database but never know this is the case.
With the move to multi-tenancy,the development team investigated many tools that are available to achieve that goal. In the following paragraphs I will go over the various components that we decided on in our road to multi-tenant architecture.
Instead of using multiple VMs for various servers, the development team is using another virtualization technology called Containers. If you are familiar with Docker, containers is the technology behind it. While VMs virtualize the hardware, containers take another approach to the virtualization of the operating system. The resulting container consists of just the application and a very small overhead for dependencies; this makes the applications that run within the container very portable, lightweight, and quick to start up.
In our case you should note that the application within the container is not the entire IBM Storage Insights application. Instead the team broke out the IBM Storage Insights server into multiple, independent micro-services based on a functional area. Each micro-service is the application that runs within the container. So for example, we have one micro-service for the web server and another to process performance data. A collection of all the containers for the various micro-service applications comprise the entire multi-tenant IBM Storage Insights server.
To keep track of all the IBM Storage Insights containers, the team is using Kubernetes as our container management tool. Kubernetes organizes containers into pods that are deployed on nodes in the cluster.
Pods can easily scale out as user demand grows. If we find any one pod is being over worked, within a few seconds, we can spawn additional pods for that workload to handle the extra load. Note this does not require us to spawn additional pods for all the micro-servers. We are able to adjust the number of pods on a micro-server granularity depending on how the demand grows. So if we find we need more processing power for reports, we can spawn additional pods for the micro-service that is responsible for reports. Finally the converse is also true on being able to scale down. If we find pods are under-utilized, we can scale down the number of pods to match the demand and save on computing resources.
Kubernetes is also able to perform automatic monitoring of the pods for the development team. If it detects a pod is not responding, it will automatically spawn up a replacement pod and destroy the non-responding pod. If it detects a pod is consuming more CPU or memory than it is configured for, it can restart the pod. All this allows for greater application robustness while our DevOps and Support teams can debug the problem.
Finally with rolling upgrades of Kubernetes pods, our customers can say goodbye to the quarterly downtime to your IBM Storage Insights server. Additional pods are deployed during the upgrade process so no end user down time is observed. For example, there are 10 pods configured for a micro-service. During an upgrade, Kubernetes will spawn up an 11th pod with the new code base and once that pod is up, it will destroy one of the down-level pods. During this time, any of the pods available; either at the old or new code level will respond to requests. This process will continue until all pods are at the latest level.
The ease of upgrades along with no end user down time enables the team to deploy one-off fixes or updates outside of the traditional quarterly upgrade cycle of IBM Storage Insights. Since going live with the final multi-tenant architecture earlier this year, we have already utilized this approach to apply select critical updates to the cluster.
As we process user data, the data must flow from one micro-service to another. For example, once a probe payload is received from the Data Collector, it needs to be processed by one micro-service, sent to another micro-service to persist the data in the database, then have its job status updated via another micro-service monitoring jobs. For this communication, we are utilizing Kafka. Kafka is a distributed streaming platform where message topics can be subscribed to for either publishing or reading messages on that topic.
The big benefit of Kafka is it provides message fault tolerance. If one container starts to process a message but crashes before it can commit the message, Kafka will detect this situation and give that message to another container of the same type for processing.
For the IBM Storage Insights database, we switch from Db2 to a NoSQL (Not Only SQL) database. We evaluated many NoSQL databases and in the end decided on Cassandra. NoSQL databases like Cassandra offered several benefits over Db2 in a multi-tenant environment but the key ones are highlighted below:
- With all tenants using the same database, Db2 would have problems handling the significant amount of data and concurrent access. NoSQL databases have no problems with this amount of data.
- Db2 scales up where NoSQL databases scale out. If there is a scaling problem with Db2, the scaling up solution means the database has to move to a more powerful server or larger hard drives. This scaling up also requires down time to the end user. NoSQL databases on the other hand scales out where you can simply add another node to the database for the extra workload. Additionally that scaling out with a NoSQL database can be done live with no end user down time.
- NoSQL databases are a multi-node database meaning it doesn't have a single point of failure that will result in end user down time if any one node goes down. It also has remote cluster support so data is automatically copied to another data center.
Monitoring such a large environment has its challenges as well. You can imagine all the individual monitoring needed from the hardware components to the Kubernetes pods, processing metrics, Kafka messages, traces, customer job activity, payload processing, user requests, etc. As the new code was developed for the multi-tenant architecture, key metrics were added for monitoring all these areas. The metrics are stored in real-time within a Prometheus time series database. A slew of dashboards were created to visualize the data via Grafana. These Grafana dashboards are used internally by our Development, DevOps, and Support teams as we develop, monitor, maintain, and debug the IBM Storage Insights product across all users.
The new monitoring capability changed how we debug problems. For example, before traces were looked at as one of the first steps to see if there are any problems to resolve. Now with the sheer amount of traces coming in across all users, it's a daunting task. Instead the team is able to visually see when problems or anomalies occur in the Grafana dashboards and use that as the first step in the to specifically narrow down the problem to the specific micro-service, time, tenant, and device.
Now that the new architecture is in place, the team had to do the work to migrate the data stored in each single-tenant IBm Storage Insight's Db2 database and import it into Cassandra. There could be up to a year worth of historical data stored within each Db2 which would take some time to migrate over to Cassandra. To minimize the downtime during the cut-over from the single-tenant IBM Storage Insights instance to the multi-tenant one, the team performed a pre-migration of the static historical data ahead of the actual cut-over. This pre-migration occurred behind the scenes while the single-tenant IBM Storage Insights instance was operating like normal.
When the time finally comes to cut-over the single-tenant IBM Storage Insights instance to the multi-tenant offering, during the upgrade window the single-tenant IBM Storage Insights server is shut down and any remaining historical data is migrated along with the current IBM Storage Insights data, configuration, and systems' probe data. When the server comes back up after the maintenance window, it's live in the multi-tenant infrastructure.
In conclusion, the development team is super excited of the changes that were made to IBM Storage Insights to making it a true multi-tenant SaaS offering. The team has worked tremendously hard getting to this point and it is kind of anti-climatic to have all the changes go mostly unnoticed by our users. Hopefully the changes you do notice are for the better that contains improved performance, a more robust product, and no down time during upgrades.