IBM Event Streams and IBM Event Automation

IBM Event Streams and IBM Event Automation

Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

 View Only

Identifying incompatible Kafka APIs by using Event Gateway metrics

By KIT DAVIES posted 23 days ago

  

The Kafka protocol consists of many types of operations, known as API keys, that define the interactions a client may have with a broker. For example, consuming or producing a message, joining a group, committing an offset, and so on.

Each API key has a message format that is versioned independently of other API keys, and one of the first interactions a client has with a broker is the ApiVersions operation. This is used to query the set of API versions offered by the broker, from which a client selects the ones it supports. allowing them both to communicate compatibly.

In each Kafka release, the Java client libraries provided by the Kafka project are kept up-to-date with the latest API versions defined in that release. However, there are quite a few third-party client libraries, over which the Kafka team have no control, and development on some of these has slowed or even stopped completely. Therefore, the API versions that they support often lag behind the latest provided by the brokers they are connecting to. Maintaining backward compatibility with old and outdated versions to serve these third-party clients has become a significant burden for the Kafka project developers. So, in Kafka 4.0, several of the API keys have had their support for the oldest versions pruned. See the table below for information about the pruned versions for each API key.

The Kafka team were careful to avoid impacting the most widely used third-party libraries where possible, but with Kafka 4.0, there is a possibility that an application that uses an older version of one of these libraries might not be able to find API versions that are compatible with the newer brokers. What is needed is a way for a broker, or an Event Gateway acting in its place, to identify clients that use older and potentially incompatible versions before it is upgraded to 4.0. This is the purpose of the new client_api_versions_gauge metric introduced in version 11.6.3 of IBM Event Endpoint Management.

For each call from a client, the client_api_versions_gauge metric records the API version used for the call, and includes the API key and client id as attributes. We can use a tool such as Prometheus to scan versions and filter by API key, client id, or both to find which clients might become incompatible after an upgrade to Kafka 4.0. The following table shows an example with raw data from the metric:

Each row in the table is an instance of the metric, showing the attributes for each metric, including the API key and client id. Ignore the others, exported_job, instance and job, as they are injected by the reporting component. You can also see that in our case the metrics collector has prepended the component name eemgateway_ to the metric name. We will need to include this when building our queries.

The numbers on the right are the value of the metric for each combination of attributes. For example, in the second row, we can see client producer-1 was using version 4 of the ApiVersions API key and, at the bottom, version 12 of the Metadata API key.

We can display a simplified version of the table by adjusting the query:

This only displays the API key and client id, making things a little easier to inspect. The metric becomes more useful when we start to filter the data. For example, let’s look at what versions are being used by a particular client id:

Here, by adding the filter {clientId=”producer-1”} to the query, we can see the versions for each API key used by an application using that client id. If we have clients that have ids matching a pattern, for example, because they belong to the same application or group of applications, we can use a regular expression to match ids as follows:

Note the use of =~ in the filter expression in the query. This defines a match by using regular expressions. Full details of the Prometheus query language is outside the scope of this article, but is available in the Prometheus documentation at https://prometheus.io/docs/prometheus/latest/querying/basics/

Now we come to the goal of this article: finding clients that might be using old API versions. First, let us find what versions of a particular API key the clients are using:

We can see some are operating at version 17, but there are two still using version 11. Just as an example, we might know there is a problem with clients operating at versions 11 or earlier, so how do we isolate those? The answer is by applying a filter to the API version in the query as follows:

By specifying <= 11 in our query, we are saying we only want results where the version of Fetch is within the problematic range. This is the information we require to pinpoint the applications that have to be upgraded with newer Kafka client libraries. Now we can use the client id to identify the application that needs updating, in this example, quotaControl.

An important note is that we are relying on the client id value to identify such applications. It should be apparent that using meaningful client ids is very important for fully understanding Kafka client activity. Without that, the metric loses much of its usefulness.

Other uses

We can use the client_api_versions_gauge metric for other purposes too. Because it shows the pattern of operations performed by a client, by observing the number of instances of each API key, we can get an idea of which clients are interacting efficiently with the gateway, and which are not. For instance, a badly written consuming client might perform a relatively expensive authentication and group allocation before each fetch of records. A well-written consumer would authenticate and join a group once, then perform all required fetches before finally disconnecting. We can analyse such patterns as follows:

In the previous table, we are using the aggregate function count_over_time() to count instances of each API key and client id over a 5-minute period (note the time range specifier [5m]). The client labelled goodConsumer performs the authentication, setup and close operations only once for its 20 fetches. The client labelled badConsumer performs all operations for every fetch, indicating possibly questionable design in the client application software, showing where we should focus our efforts if we want to improve our application efficiency and reduce network load.

Conclusion

Hopefully in this article I have shown you how to use the Event Gateway metrics to help you get a clear picture of what operations your Kafka client applications are performing, and what version levels they are operating at. This will help you keep your applications secure and up-to-date, and avoid unwanted and costly system failures.

Notes:

Summary of Kafka API keys and pruned versions

All ranges are inclusive

Table 1: Pruned Kafka API versions

 API key

 Pruned versions

Produce

V0-V2

Fetch

V0-V3

ListOffset

V0

Metadata

none

OffsetCommit

V0-V1

OffsetFetch

V0

FindCoordinator

none

JoinGroup

none

Heartbeat

none

LeaveGroup

none

SyncGroup

none

DescribeGroups

none

ListGroups

none

SaslHandshake

none

ApiVersions

none

CreateTopics

V0-V1

DeleteTopics

V0

DeleteRecords

none

InitProducerId

none

OffsetForLeaderEpoch

V0-V1

AddPartitionsToTxn

none

AddOffsetsToTxn

none

EndTxn

none

WriteTxnMarkers

none

TxnOffsetCommit

none

DescribeAcls

V0

CreateAcls

V0

DeleteAcls

V0

DescribeConfigs

V0

AlterConfigs

none

AlterReplicaLogDirs

V0

DescribeLogDirs

V0

SaslAuthenticate

none

CreatePartitions

none

CreateDelegationToken

V0

RenewDelegationToken

V0

ExpireDelegationToken

V0

DescribeDelegationToken

V0

DeleteGroups

none

Full details: https://cwiki.apache.org/confluence/display/KAFKA/KIP-896%3A+Remove+old+client+protocol+API+versions+in+Kafka+4.0

0 comments
42 views

Permalink