Thank you Jan. My first attempt was using Iceberg as I understood that some updates and deletes could be required, which made me post here in the first place. It is my understanding that Parquet files are supposed to be immutable, which is ideal for data archival or log analysis, but not so much when there is still variation. It depends on the amount of variation of course, but I want to avoid too many delete markers and small files.
------------------------------
William Cresanti
------------------------------
Original Message:
Sent: Wed January 29, 2025 03:54 AM
From: JAN MUSIL
Subject: watsonx.data data ingestion patterns
Hi William, it depends on the format of the files (CSV, JSON, TEXT, Parquet, .....). You can copy the files "as is" to some S3 object store (some limited amount can be copied directly to Minio storage which is available in the default WXD deployment). After that you can create "wrapper" table thru Hive connector. Your files can be accessible now thru SQL using PrestoDB for the requested analysis and if necessary you can offload it by INSERT .... SELECT to Iceberg format. But it depends what you want to do and if Iceberg format is necessary (transactional processing, time travel or using other SQL engine like Spark). I am using "mc" Minio client command line utility to copy the file to the Minio object store. It's fast and don't need to use Python or DataStage. Notice that also Apache Airflow is supported and dbt tool can be used for the internal transformation (from Hive to Iceberg).
------------------------------
JAN MUSIL
Original Message:
Sent: Tue January 28, 2025 09:27 AM
From: William Cresanti
Subject: watsonx.data data ingestion patterns
Hello everyone. I'm interested to learn how organizations are getting data into watsonx.data. I'm new to the product and working out ingestion patterns for my org. Most of the documentation is focused on importing a file (manually) and creating an Iceberg table from its content. That's great in a demo, but not practical. How are folks feeding it data?
One of the avenues that I'm investigating is a near real-time pipeline (using either DataStage or Python) to update the Iceberg table content directly, but I'm concerned about the volume of files that will produce (considering the snapshots and metadata files).
Thoughts? Recommendations?
#watsonx.data