IBM Event Streams and IBM Event Automation

IBM Event Streams and IBM Event Automation

Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

 View Only

Using client quotas with IBM Event Streams

By Dale Lane posted Sun February 26, 2023 10:23 AM

  

In this post, I want to highlight a feature that I often see under-used in IBM Event Streams, and show how you can easily give it a try.

Kafka can enforce quotas to limit the impact that client applications can have on your cluster. To quote the Kafka documentation:

It is possible for producers and consumers to produce/consume very high volumes of data or generate requests at a very high rate and thus monopolize broker resources, cause network saturation and generally DOS other clients and the brokers themselves.

Having quotas protects against these issues and is all the more important in large multi-tenant clusters where a small set of badly behaved clients can degrade user experience for the well behaved ones.

In fact, when running Kafka as a service this even makes it possible to enforce API limits according to an agreed upon contract.

The Event Streams documentation adds:

... quotas protect from any single client producing or consuming significantly larger amounts of data than the other clients ... This prevents issues with broker resources not being available to other clients, DoS attacks on the cluster, or badly behaved clients impacting other users of the cluster

There are different types of quotas: quotas based on a network bandwidth usage threshold (available for both producers and consumers) and quotas based on CPU utilization.

In this post, I'll use a network bandwidth usage threshold for producer applications as my example, but everything I show applies to other quota types as well.

To demonstrate the way this works, I started with the kafka-producer-perf-test.sh and kafka-consumer-perf-test.sh tools that come with Apache Kafka.

I started up six producers and six consumers - all pointing at the same Kafka topic. The six producers each sent 1,000,000 messages (128-byte messages) - as fast as they could, with no throttling. And then set acks=all to verify that the messages were being produced successfully.

The summary report from the six producers said:

1000000 records sent, 13647.405628 records/sec (1.67 MB/sec), 15097.44 ms avg latency, 28727.00 ms max latency, 14045 ms 50th, 27803 ms 95th, 28416 ms 99th, 28617 ms 99.9th.
1000000 records sent, 13649.454704 records/sec (1.67 MB/sec), 15184.59 ms avg latency, 28819.00 ms max latency, 14072 ms 50th, 28081 ms 95th, 28602 ms 99th, 28749 ms 99.9th.
1000000 records sent, 13681.762211 records/sec (1.67 MB/sec), 15020.49 ms avg latency, 28231.00 ms max latency, 13950 ms 50th, 27848 ms 95th, 28072 ms 99th, 28171 ms 99.9th.
1000000 records sent, 13761.972916 records/sec (1.68 MB/sec), 14979.00 ms avg latency, 28559.00 ms max latency, 13921 ms 50th, 27947 ms 95th, 28281 ms 99th, 28453 ms 99.9th.
1000000 records sent, 13876.169067 records/sec (1.69 MB/sec), 14781.38 ms avg latency, 28420.00 ms max latency, 13916 ms 50th, 27910 ms 95th, 28173 ms 99th, 28333 ms 99.9th.
1000000 records sent, 13627.691469 records/sec (1.66 MB/sec), 15077.66 ms avg latency, 28504.00 ms max latency, 14083 ms 50th, 27925 ms 95th, 28224 ms 99th, 28390 ms 99.9th.

That gave me an idea of what my current Event Streams cluster setup can comfortably do.

Let's start adding quotas to slow things down.

I added a producer network bandwidth threshold quota to the credentials I was using for the performance test.

For example:

spec:
  quotas:
    producerByteRate: 250000

See this setting in the context of my KafkaUser client credentials configuration

This told the Kafka brokers to each restrict my producer applications to only being able to produce 250,000 bytes per second.

Then I ran the same test again.

This time, aside from the most obvious difference that the test took much longer to run, the summary report from the six producers said:

1000000 records sent, 2122.578933 records/sec (0.26 MB/sec), 101142.87 ms avg latency, 121005.00 ms max latency, 114136 ms 50th, 120002 ms 95th, 120014 ms 99th, 120050 ms 99.9th.
1000000 records sent, 2121.507204 records/sec (0.26 MB/sec), 100072.10 ms avg latency, 120916.00 ms max latency, 112556 ms 50th, 120002 ms 95th, 120026 ms 99th, 120060 ms 99.9th.
1000000 records sent, 2120.823304 records/sec (0.26 MB/sec), 99296.82 ms avg latency, 121015.00 ms max latency, 111734 ms 50th, 120002 ms 95th, 120011 ms 99th, 120029 ms 99.9th.
1000000 records sent, 2121.061719 records/sec (0.26 MB/sec), 99230.30 ms avg latency, 121110.00 ms max latency, 111043 ms 50th, 120002 ms 95th, 120057 ms 99th, 120144 ms 99.9th.
1000000 records sent, 2121.493701 records/sec (0.26 MB/sec), 98743.10 ms avg latency, 121152.00 ms max latency, 111780 ms 50th, 120001 ms 95th, 120006 ms 99th, 120031 ms 99.9th.
1000000 records sent, 2173.161885 records/sec (0.27 MB/sec), 95485.01 ms avg latency, 121043.00 ms max latency, 108325 ms 50th, 120001 ms 95th, 120009 ms 99th, 120041 ms 99.9th.

Notice that the same messages were still successfully sent. Setting this quota doesn't prevent my applications from being able to produce messages, the Kafka brokers just slows them down by delaying responses sent to them.

To illustrate this further, I re-ran the test with a variety of different producerByteRate values.

You can see the shell output of how I did this, which byte rates I tried, and the results showing the impact in docs/log.txt.

Throwing the summary reports into Excel and generating default graphs for a few of the columns makes it easier to see the impact.


Showing the number of records sent per second by each producer with different producer quota thresholds set.


Showing the total throughput of each producer with different producer quota thresholds set.


Showing the average latency for each producer with different producer quota thresholds set.

As you can see, this is very easy to set, and has an immediate impact. If you're running an Event Streams cluster being used by different teams or projects, I recommend considering applying quotas to protect your teams.

One last pointer: I mentioned above that quotas don't prevent applications from working, they just slow the applications down if they exceed their quota. So how can you tell when this is happening?

The easiest place to check is the Producers tab for your topic in the Event Streams admin interface.


see larger image

Scroll down to the list of applications, and if everything is happy with none of the applications being throttled by their quota, you'll see "Quota reached 0 times".


see larger image

To show how this will change, I set a very very small producerByteRate quota and then re-ran my test one final time.

Checking back on the list of applications, I can see that my applications are trying to exceed this quota.


see larger image

This makes it easy for you to keep an eye on whether you've set the quota to a level that is sufficient for what your project team's applications are trying to do.

To save you having to constantly watch the UI, Event Streams also makes these metrics available through the Grafana/Prometheus monitoring stack, so you could configure an alert to send you a notification if the "quota reached" metric value starts to increase.

I've put the details of how I ran these performance tests in github.com/dalelane/event-streams-quotas-demo so you can see how you could run the same tests against your own Event Streams cluster.

For more information about any of this, please see the Kafka documentation on quotas or the Event Streams documentation on how to set your client quotas.

0 comments
47 views

Permalink