Global AI and Data Science

 View Only
  • 1.  Watson Studio pipelines running notebook jobs

    User Group Leader
    Posted Sun December 03, 2023 10:24 AM

    Hello, I created a couple of Jupiter notebooks along in python with relevant job, and need to create a pipeline to run the notebook1 once, and notebook2 as many times as the number of elements in a list passed from notebook1 to notebook2.

    to do the above I need to pass environment variables from 1 to 2, and Crete a loop in pipeline until all elements in one of the variables passed are processed

    may you give me hints on how to pass environment variables ?

    also, how to setup a loop in Watson pipelines?

    thanks in advance



    ------------------------------
    Massimo Loaldi
    Advisory Partner Technical Specialist - Automation
    IBM
    Segrate (MI)
    ------------------------------


  • 2.  RE: Watson Studio pipelines running notebook jobs

    Posted Tue December 05, 2023 03:33 PM

    Hello Massimo,

    Checking through the job submission process, it doesn't seem like the jobs directly can be looped.  

    But, what if the environment variables of notebook1 were saved as a text file in your Cloud Object Storage (COS) using S3 methods?  Then queue notebook2 to run after notebook1, and load the text file with the envronment variables from COS using S3 methods.  Then just use a python "for loop" in notebook2 to loop through the environment variables. 

    The notebook1 is illustrated below where a simple file is created in the local environment, then moved to COS.  Note all orange blocks would be replaced with the specific information for your COS / bucket.  Note Endpoint may also be different in your case from mine.  

    Creates a text file in the running environment, moves to Cloud Object Storage (COS)
    The notebook2 is illustrated below where the simple file is loaded from COS to the local environment.  Note all orange blocks would be replaced with your specific information for your COS / bucket and Endpont may also be different.   
    Load the previously saved text file from Cloud Object Storage (COS)
    Load the file from the environment to Python, run a "for loop" until complete.  
    Something like this in the first cell:
    Where the execute_cell_range() are the cells numbers to start and end.  Then a for loop just before the Javascript command to execute all the cells multiple times based on the environment file loaded.  
    Attached is a link to StackOverflow where this and other methods exist on how to run a notebook multiple times without having combine all cells.  Link:  StackOverflow Notebook In a Loop
    Lastly I ran both these notebooks as Jobs and received confirmation that notebook2 was correctly loading the output created from notebook1.  
    Job Completed - Load File from COS Correctly.
    I hope this has been useful.


    ------------------------------
    Daniel Morvay
    ------------------------------



  • 3.  RE: Watson Studio pipelines running notebook jobs

    User Group Leader
    Posted Wed December 06, 2023 11:00 AM

    Ciao @Daniel Morvay, thanks for noting me the the jobs can not be looped with Watson Pipelines. I was actually looking for some docs on this perspective but couldn't find anything.

    Your suggestion to use .txt files is indeed a good one, I used to pass environment variables differently (either thru .ini or via os.environment[]); I believe all work fine. Re the loop, good hint to use the execute_cell_range(), I never used it, but I will certainly do it in the future, thanks !

     



    ------------------------------
    Massimo Loaldi
    Advisory Partner Technical Specialist - Automation
    IBM
    Segrate (MI)
    ------------------------------



  • 4.  RE: Watson Studio pipelines running notebook jobs

    Posted Fri February 16, 2024 10:48 AM

    I know that I'm a little late here to respond. I had something like this just come up for a customer. I took a slightly different approach, as the customer wants to use DataStage for part of the processing.

    They wanted to do something like this:

    1. Connect to an FTP server and grab ZIP files in a given folder (waiting to be processed)
    2. Load those FTP files into the project's bucket for access later
    3. Produce a list of the ZIP files in that bucket location for looping in a Pipeline sequence
    4. For each of the ZIP files waiting to be processed... do this
      1. Unzip the file and grab a specific file within
      2. Load that extracted file into a different bucket location for access
      3. Using DataStage, process that extracted text file
      4. Move the ZIP file into a backup location within the bucket and FTP server

    To do that, I orchestrated everything within a Pipeline... and at the top it looked like this:

    Now that first Notebook just grabs the ZIP from an FTP server and loads it into the project's bucket (this is CPDaaS)... but then at the end, I have it print a line with the list of ZIP files sitting in my bucket location:

    The magic here is that I start the line with a specific tag "FILE-LIST:", which I find later in my parent Pipeline. I found that the "cpdctl" utility is included within the Pipeline runtime... so, I used that within a Bash script stage. I passed in variables for the Notebook stage's job and run information:

    And then within the Bash, I get the Notebook run log:

    Now within, I'm using "grep" to find any lines in the Notebook log that specifically start with "FILE-LIST:"... and then return to the standard output just the portion after the ":" when found. That's just the pipe delimited list of the ZIP files waiting to be processed. 

    I use that in my Pipeline sequence to drive the loop:

    Then to pass the current loop's ZIP file into the other Notebook stages, I use an environment variable:

    Now that ZIP_FILE becomes an environment variable within the context of the Notebook execution. I grab it by using this within my first cell:

    Now I run the Notebook and do the other stuff... specifically, I'm placing the file for DataStage inside of the project's bucket... and changing the name to be consist. But then that works:

    I'm not sure if this is pertinent to your original use case, but hopefully it can be helpful to someone that may find this post in the future.



    ------------------------------
    Gabe Green
    ------------------------------