Overview
Organizations increasingly want to make real-time Kafka data available for analytics and AI without introducing more pipeline complexity or unnecessary data duplication. In practice, many teams still end up copying the same data across streaming systems, object storage, and warehouses just to make it accessible to different engines and users.
Every second, events are flowing through Kafka: customer clicks, transactions, sensor signals, application logs. But turning that live stream into something analytics teams and AI systems can actually use often means building a maze of pipelines, copying data into multiple systems, and maintaining infrastructure that quickly becomes expensive and hard to manage. What should be a fast path from event to insight often turns into a familiar bottleneck: the same data duplicated across streaming platforms, object stores, and warehouses just so different tools can access it.
In this blog, we will walk through how to stream data from Confluent into Apache Iceberg tables and query data directly from IBM watsonx.data with both Presto and Spark without creating additional copies of the data.
By combining Confluent Tableflow with watsonx.data, teams can make real-time streaming data immediately available for analytics and AI while reducing the overhead and complexity of data duplication. Instead of moving data from system to system, analytical engines work directly with Iceberg tables that are continuously updated from streaming sources.
We will walk through how to use Tableflow’s Iceberg REST Catalog endpoint along with an API key to discover those tables and query them directly from watsonx.data, allowing streaming data to be analyzed alongside all your enterprise data in a unified environment.
Get Familiar with the technology:
Confluent TableFlow
- Converts Kafka topics → Apache Iceberg tables
- Stores table metadata (snapshots, schemas, manifests)
- Exposes a standards-based Iceberg REST Catalog
- Manages files in object storage (S3, GCS, or other cloud storage)
IBM Watsonx.data
- Open, hybrid lakehouse architecture
- Native Iceberg support across Spark + Presto engines
- Federated analytics across object storage, warehouses, and external catalogs
Apache Iceberg
- High-performance open table format
- Supports schema evolution, partitioning, snapshots, and incremental planning
Why this matters:
Confluent TableFlow + IBM watsonx.data provides a direct zero‑copy method to query Kafka topics as Iceberg Tables immediately. Most teams use complex pipelines, connectors, or batch jobs to move Kafka data into their analytics systems. This creates delays, extra storage copies, and inconsistent tables across engines.
- Make Kafka data queryable instantly:
- TableFlow turns topics into Iceberg tables automatically without building ingestion jobs.
- Everything stays in Iceberg + object storage, not a proprietary system.
- Eliminate pipelines & copies:
- No more connectors, batch jobs, or duplicating data into a warehouse/lake.
- Lower maintenance and operational overhead.
- Unify real‑time + historical analytics:
- Streaming data from Kafka sits next to your existing Iceberg tables
- Fresh Kafka data becomes available to downstream analytics immediately.
What we will build:

What you’ll learn:
- How TableFlow materializes Kafka topics into Iceberg tables
- How to connect Spark and Presto to the Iceberg REST Catalog
- How to run SQL queries on streaming data with zero copies
- How to unify real‑time and historical data in watsonx.data
Getting Started
What you’ll need:
- IBM watsonx.data trial
- Confluent Cloud trial
- Confluent Cloud account with a Kafka cluster and topic.
- TableFlow enabled (Managed Storage is simplest for this tutorial; BYOS works—engines must have storage access).
- IBM watsonx.data with Spark and Presto engines provisioned.
- Your Iceberg REST Catalog endpoint and a TableFlow API key/secret (from Confluent Cloud → TableFlow page
Enable TableFlow on Confluent Kafka Topic
In the Confluent dasboard, you’ll need to enable and configure TableFlow so it can be connected to external systems like watsonx.data using its API.

- Create or select a kafka topic
- Enable TableFlow
- Choose storage:
- Amazon S3
- Google Cloud Storage
- Azure Blob Storage
- Capture the REST Catalog Endpoint, which will be in the format: https://tableflow.{CLOUD_REGION}.aws.confluent.cloud/iceberg/catalog/organizations/{ORG_ID}/environments/{ENV_ID}
- Create TableFlow API Key + Secret: Use them in the format <apikey>:<secret> to specify the Iceberg REST Catalog credentials for Iceberg Reader applications or compute engines.

You have now set up TableFlow to be read by external systems, such as watsonx.data, which will be shown in the next section.
Accessing Confluent TableFlow iceberg tables using Watsonx.data Spark
The following Spark configuration registers the TableFlow Iceberg REST catalog as a Spark catalog named tableflow demo, enabling Spark to auto-discover namespaces and tables stored by TableFlow. You can run this Spark configuration directly inside a watsonx.data Spark engine notebook or any Spark environment that has network access to the Iceberg REST Catalog and object storage.
- First, connect to the Confluent TableFlow API, with the earlier created API endpoint and credentials:
from pyspark.sql import SparkSession
# Create Spark session with Confluent Tableflow configuration
spark = (
SparkSession.builder
.appName("Read Confluent Tableflow Table")
.config("spark.sql.catalog.tableflowdemo", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.tableflowdemo.type", "rest")
.config("spark.sql.catalog.tableflowdemo.uri",
" https://tableflow.{CLOUD_REGION}.aws.confluent.cloud/iceberg/catalog/organizations/{ORG_ID}/environments/{ENV_ID}")
.config("spark.sql.catalog.tableflowdemo.credential",
"{KEY}:{SECRET}")
.config("spark.sql.catalog.tableflowdemo.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
.config("spark.sql.catalog.tableflowdemo.rest-metrics-reporting-enabled", "false")
.config("spark.sql.catalog.tableflowdemo.s3.remote-signing-enabled", "true")
.config("spark.sql.catalog.tableflowdemo.client.region", "{CLOUD_REGION}")
.getOrCreate()
)
# The following additional parameters need to be added to the configuration to connect to AWS S3 storage when the TableFlow is enabled with AWS S3 as the integrated storage:
.config("spark.sql.catalog.tableflowdemo.s3.access-key-id", "<s3-access-key>")
.config("spark.sql.catalog.tableflowdemo.s3.secret-access-key", "<s3-secret-key>")
.config("spark.sql.catalog.tableflowdemo.s3.region", "<s3-region>")
- If connected successfully, you can discover namespaces and tables:
def read_confluent_table(spark, catalog):
# Show namespaces
print(f"\nNamespaces in catalog: {catalog}")
namespaces_df = spark.sql(f"SHOW NAMESPACES IN {catalog}")
namespaces_df.show(truncate=False)
# Loop through namespaces and show tables
namespaces = [row['namespace'].strip('`') for row in namespaces_df.collect()]
for ns in namespaces:
print(f"\nTables in namespace: {ns}")
tables_df = spark.sql(f"SHOW TABLES IN {catalog}.`{ns}`")
tables_df.select("namespace", "tableName", "isTemporary").show(truncate=False)
# Usage
catalog = "tableflowdemo"
read_confluent_table(spark, catalog)
- And read these query streams as data:
from pyspark.sql import SparkSession
# Query Iceberg table with SQL
catalog = "tableflowdemo"
namespace = "your_namespace"
table = "your_table"
# Get table schema
print("Table Schema")
print("-----------------------------------")
schema_df = spark.sql(f"DESCRIBE TABLE {catalog}.`{namespace}`.{table}")
schema_df.show(truncate=False)
# Query data with filters
print("\nSample Data (First 10 Rows)")
print("-----------------------------------")
df = spark.sql(f"""
SELECT *
FROM {catalog}.`{namespace}`.{table}
LIMIT 10
""")
df.show(truncate=False)
Configure Tableflow as a Custom Datasource and read using Presto
Next to using Spark, watsonx.data provides a custom datasource option to register datasources by providing Presto configuration to read.
What we proved /learned:
- Unified engine surface: Spark and Presto discover the same tables via the REST Catalog.
- Zero-copy integration: Query operational Kafka data in place as Iceberg tables in place—no custom pipelines.
- Open Standards: All components (Kafka, Iceberg, REST catalog, object storage) remain open and interoperable
Conclusion
Integrating Confluent TableFlow with IBM watsonx.data gives enterprises a complete, zero-copy, open-standard path to unify streaming and analytical workloads. With Iceberg as the backbone and watsonx.data as the federated engine layer, teams can:
- Discover streaming datasets instantly
- Query Kafka-backed Iceberg tables alongside historical data
- Apply unified governance across the lakehouse
- Eliminate pipelines, duplication, and connector complexity
- Power real-time analytics, dashboards, and AI
Interested in learning more about watsonx.data, check out the demo library
#watsonx.data
#PrestoEngine
#Catalog