watsonx.data

Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics

View Only

Back to discussions

Expand all | Collapse all

watsonx.data data ingestion patterns

1. watsonx.data data ingestion patterns

Like
William Cresanti
Posted Tue January 28, 2025 09:45 AM

Reply
Hello everyone. I'm interested to learn how organizations are getting data into watsonx.data. I'm new to the product and working out ingestion patterns for my org. Most of the documentation is focused on importing a file (manually) and creating an Iceberg table from its content. That's great in a demo, but not practical. How are folks feeding it data?

One of the avenues that I'm investigating is a near real-time pipeline (using either DataStage or Python) to update the Iceberg table content directly, but I'm concerned about the volume of files that will produce (considering the snapshots and metadata files).

Thoughts? Recommendations?

#watsonx.data
2. RE: watsonx.data data ingestion patterns

Like
JAN MUSIL
Posted Wed January 29, 2025 03:54 AM

Reply
Hi William, it depends on the format of the files (CSV, JSON, TEXT, Parquet, .....). You can copy the files "as is" to some S3 object store (some limited amount can be copied directly to Minio storage which is available in the default WXD deployment). After that you can create "wrapper" table thru Hive connector. Your files can be accessible now thru SQL using PrestoDB for the requested analysis and if necessary you can offload it by INSERT .... SELECT to Iceberg format. But it depends what you want to do and if Iceberg format is necessary (transactional processing, time travel or using other SQL engine like Spark). I am using "mc" Minio client command line utility to copy the file to the Minio object store. It's fast and don't need to use Python or DataStage. Notice that also Apache Airflow is supported and dbt tool can be used for the internal transformation (from Hive to Iceberg).

------------------------------
JAN MUSIL
------------------------------

Original Message
3. RE: watsonx.data data ingestion patterns

Like
William Cresanti
Posted Tue February 11, 2025 08:26 AM

Reply
Thank you Jan. My first attempt was using Iceberg as I understood that some updates and deletes could be required, which made me post here in the first place. It is my understanding that Parquet files are supposed to be immutable, which is ideal for data archival or log analysis, but not so much when there is still variation. It depends on the amount of variation of course, but I want to avoid too many delete markers and small files.

------------------------------
William Cresanti
------------------------------

Original Message
4. RE: watsonx.data data ingestion patterns

Like
zeus kam
Posted Tue February 04, 2025 02:58 PM

Reply
Hello William, To feed data into watsonx.data you can use tools like DataStage or Python for near real-time pipelines, updating Iceberg tables. To avoid excessive file creation, batch updates in larger chunks and leverage Iceberg features like partitioning and compaction. Consider an event-driven architecture (e.g., Kafka) for efficient data flow management.

------------------------------
zeus kam
------------------------------

Original Message
5. RE: watsonx.data data ingestion patterns

Like
William Cresanti
Posted Tue February 11, 2025 08:34 AM

Reply
Thank you Zeus. I have been working on an event-driven architecture using DataStage and/or Python for my processing. I've learned that Kafka has a message size limitation that will not work for us, but the messages can be carried by MQ (which will work fine). My main concern with the solution is that the data is still too variable, which will create too many files making querying less efficient. I think I need to keep this data in a different datastore and look to use Parquet for data archival. If I can thin our warehouse some, but still have the data available in the Data Lakehouse, I think we'll be good.

------------------------------
William Cresanti
------------------------------

Original Message

watsonx.data

watsonx.data

watsonx.data data ingestion patterns

William CresantiTue January 28, 2025 09:45 AM

JAN MUSILWed January 29, 2025 03:54 AM

William CresantiTue February 11, 2025 08:26 AM

zeus kamTue February 04, 2025 02:58 PM

William CresantiTue February 11, 2025 08:34 AM

1. watsonx.data data ingestion patterns

2. RE: watsonx.data data ingestion patterns

3. RE: watsonx.data data ingestion patterns

4. RE: watsonx.data data ingestion patterns

5. RE: watsonx.data data ingestion patterns

Additional
Resources

Office

Quick Links

watsonx.data

watsonx.data

watsonx.data data ingestion patterns

William CresantiTue January 28, 2025 09:45 AM

JAN MUSILWed January 29, 2025 03:54 AM

William CresantiTue February 11, 2025 08:26 AM

zeus kamTue February 04, 2025 02:58 PM

William CresantiTue February 11, 2025 08:34 AM

1. watsonx.data data ingestion patterns

2. RE: watsonx.data data ingestion patterns

3. RE: watsonx.data data ingestion patterns

4. RE: watsonx.data data ingestion patterns

5. RE: watsonx.data data ingestion patterns

Related Content

How to recover your watsonx.data iceberg table if you accidentally drop it ?

What is Presto and How Does It Work?

How to create a table by uploading file in watsonx.data

Augmenting Db2 and Netezza Workloads with watsonx.data

Iceberg Rest API on watsonx.data: Java Client PART-3 — Appending Data Files

Additional Resources

Office

Quick Links

Additional
Resources