Like any other project, Observability doesn’t just happen. There are various requirements and items that have to be planned. One of these items is the volume of generated data and the impact of that volume on such things as overall OpEx, bandwidth and the ability of your monitoring tools and teams to deal with said data volume. The amount of data generated by metrics, logs and traces will have a direct impact on the continuing cost of operations. Most cloud platform vendors charge by data volume (for example, Amazon charges up to $0.09 / GB for egress data on AWS), which means your observability plans might create additional egress charges from your cloud provider.
It’s clear that Observability will create operational costs. What’s unclear is just how much cost will it generate, and will it be more than the value received by creating observability in the first place. To answer these two questions, I conducted an experiment where I can glean overall and differential costs for operating – and then observing – an eCommerce Application.
The Observability Data Volume Experiment
We’re trying to discover just how much data is generated by an application with full observability? To find this out, we’ll need a couple of things for our experiment
- A standard test application that we can run with full observability
- A way to measure the observability data stream volume
For this particular experiment, I’m using Stan’s Robot Shop, a free sample microservices application provided by Instana. I ran my experiment with a constant load over a twenty-four hour period.
To maintain a standard benchmark for control within our experiment, I selected key observability technologies:
- Metrics: Prometheus
- Tracing: Jaeger and OpenTracing
- Logs: Fluentd
- EUM: Instana EUM (which is open source)
Now for the experiment:
- The Robot Shop application was deployed to a four node Kubernetes cluster on GKE. Load was generated using the script that comes with Robot Shop.
- I wrote some additional scripting to scrape the Prometheus end points and record the size of the data payloads.
- Another script accepted Jaeger tracing spans and EUM beacons, recording the size of the data payloads.
- Fluentd collected all the logs and concatenated them all into one flat file. Using the timestamps from the log file, one hour was extracted into a new file, which was then measured.
A note on data granularity: As you may or may not know, Instana collects all metrics at 1-second granularity. Doing this with Prometheus would so devastatingly skew the experiment results, since Prometheus has none of the optimizations built into the Instana sensors and agents. Thus, I conducted the experiment at 10 second sample rate for Prometheus metrics. The load generation script produces one request per second to the Robot Shop back-end services.
Observability Data Volume Experiment – The Results
I found the results quite interesting, mainly because they weren’t what I was expecting. I had assumed that the traces would take the biggest chunk of the data and that the total data for a simple Hello World application like the Robot Shop would easily fit inside 100GB. I was way off – let’s see just how wrong I was.
Observability Data Volume: Tracing
At a rate of 1 trace per second, over 24 hours per day and 30 days in a month, the total number of traces is 2.5 million. The average trace size was 66kB. Therefore, the total data size for traces was 161GB. Looks like my estimate of fitting inside 100GB has already been proved wrong.
While Tracing can be sampled at source, that would mean having to throw away nearly half of the data to fit inside the original estimate of 100GB.
Observability Data Volume: EUM
Each back-end call is triggered by a user interaction at the browser, which produces an EUM beacon – conveniently making the number of beacons generated the same as the number of traces – 2.5 million. The average size of an EUM beacon is a lot smaller at 397 bytes (whew!), making our total data size for a month of EUM beacons 1GB.
Observability Data Volume: Logs
For logs, especially when it comes to data volumes, your mileage may vary – depending on your app, configuration settings, etc. The Robot Shop application logs quite a bit at INFO, though not nearly as much as some other real-world applications. From the experiment, the log file size for one hour was 5MB, making the total log volume for one month 3.4GB. A lot smaller than I though it would be.
Observability Data Volume: Metrics
We collected metrics – using Prometheus – from from every container, each worker node and from kube state metrics for the cluster giving a total of 1.1MB per sample period. With a sample every ten seconds, that’s 259,200 samples per month, which results in a total data volume of 285GB. I was really shocked to not only see Metric data volume exceed traces, but by practically 75%.
Total Observability Data Volumes
The grand total across all datasets is 452GB per month for a simple Hello World application running on a small Kubernetes cluster.
Conclusion
That’s a lot of data which will expand linearly as the application gains in complexity and the number of requests it processes increases. A real production application would be hundreds of hosts, running hundreds or thousands of containers to service multiple requests per second. It’s easy to extrapolate that the data requirement would easily be many terabytes. Certainly something to factor in to your calculations when looking at observability platforms.
Alternatively there is Instana’s simple all inclusive pricing per OS instance, no additional charges no matter how much data you send or how many engineers login to work with all the data. Making it much easier to predict exactly how much it’s going to cost, keeping you in control of your operational expenses. Rather than guessing how much Observability data your applications will generate and either paying for data allowance you don’t use or being surprised by a big bill for overages.