Instana

Instana

The community for performance and observability professionals to learn, to share ideas, and to connect with others.

 View Only

Migrating from ZooKeeper to ClickHouse Keeper on a operator based cluster.

By Asif Akhtar posted Tue September 09, 2025 01:54 PM

  

As ClickHouse adoption grows, many teams are moving away from ZooKeeper to the newer ClickHouse Keeper. ClickHouse Keeper is a lightweight, purpose-built coordination service designed to handle ClickHouse workloads without the operational overhead of a full ZooKeeper cluster.
Why the need to migrate to clickhouse-keeper.?

  • ClickHouse Keeper can run embedded within ClickHouse or as a standalone service, simplifying deployment and management within a ClickHouse ecosystem.

  • ClickHouse Keeper resolves known issues present in ZooKeeper, such as the 1MB limit on default packet and node data size, and the ZXID overflow problem, which can necessitate restarts in ZooKeeper

  • ClickHouse Keeper, being implemented in C++, offers better performance and significantly reduced resource consumption (CPU, memory, disk I/O) compared to ZooKeeper, which is Java-based and prone to issues like Full GC pauses

In this guide, I will walk through how to migrate from ZooKeeper to ClickHouse Keeper on a GitOps-managed platform (Flux). We’ll cover the prerequisites, deployment, migration steps, and validation checks to ensure a smooth transition.

Prerequisites

Before starting, make sure you’re on supported versions:

  • IBM ClickHouse Operator: v1.2.5

  • Helm Chart: v1.2.0

  • ZooKeeper: v3.4 or higher (to run the converter utility).

You’ll also need the clickhouse-keeper-converter binary inside your ZooKeeper pod.

Step 1: Deploy the ClickHouse Operator with Helm

In a GitOps setup, we manage operators with Flux + Helm.

Below is an example HelmRelease for the ClickHouse Operator pinned to the required versions:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: clickhouse
  namespace: flux-system
spec:
  targetNamespace: ch-system
  chart:
    spec:
      chart: ibm-clickhouse-operator
      version: "v1.2.0"
      sourceRef:
        kind: HelmRepository
        name: clickhouse
  values:
    installCRDs: true
    image:
      repository: icr.io/clickhouse/clickhouse-operator
      tag: v1.2.5

Note:- We need to update clickhouse-operator only if its not in version v1.2.5, if its already in or above we don't need this step.

Step 2: Deploy ClickHouse Keeper

apiVersion: clickhouse-keeper.altinity.com/v1
kind: ClickHouseKeeperInstallation
metadata:
  name: clickhouse-keeper
  namespace: datastores
spec:
  configuration:
    clusters:
      - name: keeper
        layout:
          replicasCount: 1
    settings:
      keeper_server/tcp_port: "2181"

Deploy this with Flux or Kustomize. Once applied, verify:

kubectl get pods -n datastores -l app=clickhouse-keeper

kubectl get svc -n datastores | grep clickhouse-keeper

Step 3: Convert ZooKeeper Metadata

/tmp/clickhouse-keeper-converter/clickhouse-keeper-converter \
  --zookeeper-logs-dir /data/version-2 \
  --zookeeper-snapshots-dir /data/version-2 \
  --output-dir /tmp/ckeeper-snapshots

This generates Keeper-compatible snapshots. Copy them into the Keeper pod’s snapshot directory.

Step 4: Restart Keeper with Snapshots

Ensure your snapshot_storage_path points correctly:

/var/lib/clickhouse-keeper/coordination/snapshots/store

Remove old snapshots, restart Keeper, and confirm metadata matches ZooKeeper:

SELECT * FROM system.zookeeper WHERE path = '/';

Step 5: Update ClickHouse to Use Keeper

zookeeper:
  nodes:
    - host: clickhouse-keeper-headless.datastores.svc.cluster.local
      port: 2181

Flux will reconcile and trigger a rolling restart (ensure spec.restart: "RollingUpdate" is enabled). Perform this during a maintenance window and pause heavy DDL jobs.

Step 6: Validate the Cluster

Run these checks after the switchover:

Keeper connectivity

SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 10;

Replication queues

SELECT database, table, count(*) FROM system.replication_queue GROUP BY database, table;

Replication delay

SELECT database, table, max(abs(now() - create_time)) FROM system.replication_queue GROUP BY database, table;

Readonly replicas

SELECT * FROM system.replicas WHERE is_readonly = 1;

Step 7: Decommission ZooKeeper

Once ClickHouse is stable and fully synced with Keeper, you can scale down ZooKeeper pods and remove them after an observation period.

Conclusion

Migrating from ZooKeeper to ClickHouse Keeper simplifies your stack and brings cluster coordination closer to ClickHouse itself. Using GitOps principles ensures the migration is auditable, repeatable, and automated.

If you’re planning this migration in production, test in staging first, schedule it during low-traffic windows, and monitor replication closely.


#Documentation
#Infrastructure
#Kubernetes
#Database
#SRE

0 comments
18 views

Permalink