watsonx.data

 View Only
  • 1.  watsonx.data data ingestion patterns

    Posted 29 days ago

    Hello everyone. I'm interested to learn how organizations are getting data into watsonx.data. I'm new to the product and working out ingestion patterns for my org. Most of the documentation is focused on importing a file (manually) and creating an Iceberg table from its content. That's great in a demo, but not practical. How are folks feeding it data?

    One of the avenues that I'm investigating is a near real-time pipeline (using either DataStage or Python) to update the Iceberg table content directly, but I'm concerned about the volume of files that will produce (considering the snapshots and metadata files). 

    Thoughts? Recommendations?


    #watsonx.data


  • 2.  RE: watsonx.data data ingestion patterns

    Posted 28 days ago

    Hi William, it depends on the format of the files (CSV, JSON, TEXT, Parquet, .....). You can copy the files "as is" to some S3 object store (some limited amount can be copied directly to Minio storage which is available in the default WXD deployment). After that you can create "wrapper" table thru Hive connector. Your files can be accessible now thru SQL using PrestoDB for the requested analysis and if necessary you can offload it by INSERT .... SELECT to Iceberg format. But it depends what you want to do and if Iceberg format is necessary (transactional processing, time travel or using other SQL engine like Spark). I am using "mc" Minio client command line utility to copy the file to the Minio object store. It's fast and don't need to use Python or DataStage. Notice that also Apache Airflow is supported and dbt tool can be used for the internal transformation (from Hive to Iceberg).



    ------------------------------
    JAN MUSIL
    ------------------------------



  • 3.  RE: watsonx.data data ingestion patterns

    Posted 15 days ago

    Thank you Jan. My first attempt was using Iceberg as I understood that some updates and deletes could be required, which made me post here in the first place. It is my understanding that Parquet files are supposed to be immutable, which is ideal for data archival or log analysis, but not so much when there is still variation. It depends on the amount of variation of course, but I want to avoid too many delete markers and small files.  



    ------------------------------
    William Cresanti
    ------------------------------



  • 4.  RE: watsonx.data data ingestion patterns

    Posted 21 days ago

    Hello William, To feed data into watsonx.data you can use tools like DataStage or Python for near real-time pipelines, updating Iceberg tables. To avoid excessive file creation, batch updates in larger chunks and leverage Iceberg features like partitioning and compaction. Consider an event-driven architecture (e.g., Kafka) for efficient data flow management.



    ------------------------------
    zeus kam
    ------------------------------



  • 5.  RE: watsonx.data data ingestion patterns

    Posted 15 days ago

    Thank you Zeus. I have been working on an event-driven architecture using DataStage and/or Python for my processing. I've learned that Kafka has a message size limitation that will not work for us, but the messages can be carried by MQ (which will work fine). My main concern with the solution is that the data is still too variable, which will create too many files making querying less efficient. I think I need to keep this data in a different datastore and look to use Parquet for data archival. If I can thin our warehouse some, but still have the data available in the Data Lakehouse, I think we'll be good.  



    ------------------------------
    William Cresanti
    ------------------------------