IBM Storage Ceph

IBM Storage Ceph

Connect, collaborate, and share expertise on IBM Storage Ceph

 View Only

Governing the Lakehouse: Building a ZeroTrust Architecture with IBM Storage Ceph and Polaris

By Daniel Alexander Parkes posted 2 days ago

  

Governing the Lakehouse: Building a ZeroTrust Architecture with IBM Storage Ceph and Polaris

In the modern data landscape, lakehouse architecture has become the de facto choice for organizations seeking to combine the flexibility of object storage with the transactional capabilities of data warehouses. It offers the ability to ingest diverse datasets, operate with multiple compute engines, and serve both structured and unstructured data—all from a single scalable backend.

But this architectural freedom comes with a cost.

As data platforms evolve to support Spark, Trino, Flink, and AI workloads, governance becomes the critical weak point. When multiple engines share access to the same object data, enforcing consistent and fine-grained access control becomes a daunting task. Traditional database GRANT semantics don’t map easily to object storage APIs. And standard cloud-native security patterns, such as reverse proxies or IAM-per-engine plugins, either don’t scale or compromise performance.

What organizations truly need is a way to govern access at the object layer, using familiar table-level RBAC concepts, without adding friction or architectural bottlenecks.

That’s the problem we set out to solve with the Zero Trust Lakehouse—a design pattern we’ve built and showcased using IBM Storage Ceph Object Gateway (RGW) and Polaris, an open-source Iceberg catalog that governs access through time-bound credentials.

The Governance Problem: Why Fine-Grained Control Is Hard in a Lakehouse

At its core, the problem is simple to describe but notoriously hard to solve:

“I want to say GRANT SELECT ON products, and I want that enforced at the storage layer, across all engines, without breaking performance.”

Let’s unpack that:

  • Data users (analysts, scientists, engineers) expect familiar database semantics: table names, privileges, and grants.

  • Security teams require enforcement at the lowest level—at the object store itself—so that data is protected no matter which engine accesses it.

  • Performance-critical workloads need direct object access—no proxies, no slow enforcement chains, no duplicate copies.

And yet, most patterns today fail at least one of those criteria:

  • Engine-level Policy Enforcement Points: Trino or Spark plugins can enforce fine-grained access until a new engine is introduced and the plugin no longer exists.

  • K8’s Namespace-per-team silos: Safe by default, but inflexible. When team Blue needs to read from team Purple’s bucket, you're either copying secrets (dangerous) or duplicating data (expensive).

  • PEPs using a proxy: Centralized enforcement, but they can’t keep up with high-throughput object workloads. Suddenly, your proxy is the bottleneck.

None of these patterns scale across engines, teams, or workloads.

The Zero Trust Lakehouse: A Catalog-Driven Model for Object-Level Governance

Rather than force every engine to re-implement its authorization, the Zero Trust Lakehouse uses a central catalogPolaris—as the single source of truth.

Each Iceberg table, role, and grant is defined once in Polaris. When a compute engine, such as Spark or Trino, needs access, it requests a short-lived credential from Polaris. That credential is scoped using IAM-style S3 policies, limiting access to precisely the authorized datasets. Crucially, the policy is enforced directly by Ceph RGW—so no proxy, sidecar, or plugin is required.

This model offers several key advantages:

  • SQL-style authorization: Use familiar RBAC like GRANT SELECT ON products_gold TO analyst

  • Least-privilege credentials: Tokens are short-lived and scoped to exact resources and actions.

  • Engine-agnostic enforcement: Since enforcement occurs at the storage layer, Spark, Trino, notebooks, and other Iceberg-aware engines receive consistent access control.

  • No performance bottlenecks: The token is passed in each S3 call and enforced natively by Ceph RGW—there’s no intermediary layer to slow things down.

This transforms the object store into a secure backend that behaves like a governed SQL database, yet with the flexibility and scalability of object storage.

Fine-Grained Authorization with Polaris RBAC

One of the key strengths of the Polaris catalog is its ability to define and enforce multi-level access control, down to the table, role, and even namespace level. The model separates concerns across:

  • Polaris Users – real identities (e.g., Charlie, Bob, Mark)

  • Principals – internal mappings that Polaris uses to manage permissions

  • Principal Roles – functional groupings (e.g., Data Engineer, Service Admin, Data Scientist)

  • Catalog Roles – actual RBAC bindings that define what actions are permitted in each catalog

  • Catalogs and Namespaces – logical data groupings, enabling isolation for environments like prod, dev, or per region/team.

As shown above, the same user (Bob) may assume multiple roles—such as both Streaming Data Loader and Spark ETL User—each of which is tied to a well-scoped Data Engineer role. That role can then be mapped to different Catalog Roles (e.g., Contributor in Bronze, Admin in Silver, and Gold), allowing for differentiated access across the data lifecycle.

This model also introduces namespaces, which serve as logical schema boundaries inside a catalog. For example, prod_ns and staging_ns can be used to cleanly separate curated datasets from raw ingests, with access policies scoped independently.

Together, these constructs enable the implementation of proper least-privilege access, eliminating the need to replicate policies across engines or services.

Hands-On: Deploy and Explore the Zero Trust Lakehouse

To bring this architecture to life, we’ve published a self-contained demo repository that walks you through the complete implementation. The only prerequisites are a Ceph cluster with the RGW service running and an OS capable of running Terraform, Podman/Docker.

🔗 GitHub: Polaris-Ceph Terraform Demo

This repo automates a full Polaris + Ceph stack using Terraform and Docker Compose.

Terraform handles all the setup, including generating scoped tokens and populating the correct Iceberg catalog configs.

Conclusion

The Zero Trust Lakehouse approach—powered by Ceph Object Storage and Polaris—enables structured, enforceable access control for object storage, without compromising performance or agility.

Whether you're building analytics pipelines, enforcing data compliance, or laying the foundation for secure AI/ML workloads, this model offers a proven, scalable approach to governance.

👉 Explore the full implementation here: https://github.com/likid0/polaris-ceph-demo
🎥 Watch the architecture walkthrough from Ceph Days: https://www.youtube.com/watch?v=GrkBFSrZOR4

0 comments
2 views

Permalink