Data Integration

Data Integration

Connect with experts and peers to elevate technical expertise, solve problems and share insights.

 View Only
Expand all | Collapse all

Searching documentation about DataStage export and DSX or ISX

  • 1.  Searching documentation about DataStage export and DSX or ISX

    Posted Fri December 15, 2023 06:37 AM

    Hello,

    A client asked me to extract dataset lineage information from DataStage.

    They send me an export of their process as a DSX file and showed me how to export it on their desktop clickable application.

    They use an on-premise DataStage server.

    Now I'm looking for a solution to automate this export. If possible by calling an API otherwise by calling a non-interactive command line client.

    I'm also looking for documentation, or a specification of the file format DSX.

    During my first search, I've heard about an ISX export format that is said to be more recent. So I'm also looking for information about its format or any API to produce it. I've already found a istool export command.

    However my client seamed reluctant to use this format for some reason.

    Any link to relevant documentation would be appreciated.

    Thank you,



    ------------------------------
    Jérôme Mainaud
    ------------------------------


  • 2.  RE: Searching documentation about DataStage export and DSX or ISX

    Posted Mon December 18, 2023 02:15 AM

    Hi Jerome,

    take a look at https://www.ibm.com/docs/en/iis/11.7?topic=projects-exporting to export a whole Datastage project or parts of it.

    Regards,

    Udo



    ------------------------------
    Udo Neumann
    ------------------------------



  • 3.  RE: Searching documentation about DataStage export and DSX or ISX

    Posted Mon December 18, 2023 04:33 AM

    Hi Jérôme,

    Is it not possible for your client to run the data lineage reports themselves directly on their platform?

    It'd certainly be quicker, easier, cheaper and more accurate than asking you to import their entire DataStage landscape into your environment and running the reports there, 

    John



    ------------------------------
    John McKeever
    ------------------------------



  • 4.  RE: Searching documentation about DataStage export and DSX or ISX

    Posted Mon December 18, 2023 06:55 AM

    Hi John,

    Thank you for your idea. I will look at these reports.

    My goal is to automatically import the lineage in a data catalog every night. Therefore I need to get a parseable result without requiring human action.

    The report can be exported in CSV format, so it is somehow parseable. 

    However I see two limitations for now:

    1. I see no automated ways to trigger this.
    2. "The default maximum number of nodes that are displayed in a lineage report is 500. If more than this number of nodes is present in your lineage report, the report is truncated."

    If I can find a solution for both of them, it cloud be cheaper.

    I see this is a part of the InfoSphere Information Governance Catalog which offers a REST API. This API seems to be focused on assets, but maybe there is information about lineage.



    ------------------------------
    Jérôme Mainaud
    ------------------------------



  • 5.  RE: Searching documentation about DataStage export and DSX or ISX

    Posted Tue December 19, 2023 09:05 AM

    You might also look at doing the export in XML format (as opposed to DSX) - as THEN it could be read by Manta (the lineage company IBM purchased a couple of months ago) and have the lineage created all.

    But - a better solution would be to look at modernizing the platform to the Cloud Pak for Data and DataStage NextGen..  That way the DataStage job would reside in a project and you could schedule the linage job via the on board scheduler or even run thiongs via a 3rd party scheduler.. And all that could run inside of the environment (including the lineage!)



    ------------------------------
    Mark Hickok
    ------------------------------



  • 6.  RE: Searching documentation about DataStage export and DSX or ISX

    Posted Tue December 19, 2023 09:31 AM

    You could analyse the lineage in Information Server if you have IGC licensed. Also you have to differenciate between design-time and runtime lineage. The maintance task could also be automated via scripts.



    ------------------------------
    Udo Neumann
    ------------------------------



  • 7.  RE: Searching documentation about DataStage export and DSX or ISX

    Posted Thu December 21, 2023 09:39 AM

    Thank you for the point. I will have to choose between design-time and real time lineage.



    ------------------------------
    Jérôme Mainaud
    ------------------------------



  • 8.  RE: Searching documentation about DataStage export and DSX or ISX

    Posted 2 days ago

    I realize this is a very old thread and probably not very useful any more, but...

    I just finished writing a home-grown solution that simply queries DSODB for the parameter values passed in to all jobs, uses grep/awk/etc to reduce the noise (removing anything that is formatted as a date or time, anything that resembles a password or token, and anything I could find that is not a table definition, MQ queue name, web site, or other lineage-related information), and then keeps a permanent record of any unique parameters passed to the specific job.  To reduce the load on my database (and since I am keeping a permanent record of unique parameters anyway), I only pull "the last two days" of runtime parameters each day.

    I combine this with extracts from an ISX-converted-to-XML backup of my projects for a "where used" including an attempt at runtime metadata.

    Limitations:

     - DSODB only stores 250-ish characters of any given query in the runtime XML, so if you pass in a query like "select [lots and lots of specifications] from [tablename]" then "[tablename]" will likely not end up in your extract.

     - Obviously filtering out any cruft is going to be a time-consuming bit of coding, and highly specific to your own environment and codebase.  

    My query (note: I am only pulling parallel jobs here - you can easily remove JOBTYPE='PAR' to get sequences as well):

    SQL="SELECT JOBEXEC.JOBNAME, JOBEXEC.PROJECTNAME, JOBRUN.INVOCATIONID, XMLSERIALIZE(JOBRUNPARAMS.PARAMLIST as VARCHAR(30000))
      FROM ((DSODB.JOBRUN JOBRUN
         INNER JOIN DSODB.JOBEXEC JOBEXEC ON (JOBRUN.JOBID = JOBEXEC.JOBID))
         INNER JOIN DSODB.JOBRUNPARAMS JOBRUNPARAMS ON (JOBRUNPARAMS.RUNID = JOBRUN.RUNID))
         WHERE JOBEXEC.JOBTYPE = 'PAR' AND
         JOBRUN.RUNSTARTTIMESTAMP > (CURRENT_DATE - $numberofdays DAYS)"
    Some sample code for removing the cruft:
    cat $tempfile \
         | grep -v "job_name_I_don't_want_to_track" \
         | grep -v "another_job_name_I_don't_want_to_track" \
         | perl -pe 's|<param name=\"A_PARAMETER_SET_I_ALWAYS_WANT_TO_REMOVE.*?/>||g' \
         | sed 's|\.\.\."/><param| FROM SQL_too_long"/><param|pg' \   ####  This one turns the "SQL was too long" ellipses into "FROM SQL_too_long" so we have a marker of what we might have missed.
         | perl -pe 's|[Ss][Ee][Ll][Ee][Cc][Tt].*?[Ff][Rr][Oo][Mm]|SELECT FROM|pg' \  ###  converts all cases of SeLeCt fRoM to SELECT FROM
         | perl -pe 's/"INSERT INTO ([^ (]*) .*?\/>/"INSERT INTO \1 "\/>/p' \  ##  Extracts just the table name from "INSERT INTO" and throws the rest away.
         | perl -pe 's/"UPDATE ([^ (]*) .*?\/>/"UPDATE \1 "\/>/p' \   ### same as above but for UPDATE
         | perl -pe 's/"DELETE FROM ([^ (]*) .*?\/>/"DELETE FROM \1 "\/>/p' \   ### see also DELETE FROM
    The following lines will convert <params> <param name="name" value="value"/> to name=value to save space and improve readability:
         | sed 's/<param name="//g' \
         | sed 's/" value="/=/g' \
         | sed 's/"\/>/ /g' \
         | sed 's/<params>//g' \
         | sed 's/<\/params>//g' \
    The following will find a bunch of date/time formats and change them to generic "YYMMDD" placeholders in the same format as the data found
         | sed -E 's/20([0-9]{2})-([0-9]{2})-([0-9]{2}) ([0-9]{2}):([0-9]{2}):([0-9]{2}).([0-9]{6})/YYYY-MM-DD HH:MM:SS.mmmmmm/g' \
         | sed -E 's/20([0-9]{6})_([0-9]{8})-([0-9]{8})/YYYYMMDD_HHMMSS-HHMMSS/g' \
         | sed -E 's/20([0-9]{6})_([0-9]{6})/YYYYMMDD_HHMMSS/g' \
         | sed -E 's/20([0-9]{2})-([0-9]{2})-([0-9]{2})[T ]([0-9]{2}):([0-9]{2}):([0-9]{2})[:.]([0-9]{2,6})-([0-9]{2}):([0-9]{2})/YYYY-MM-DDTHH:MM:SSS+TZ/g' \
         | sed -E 's/20([0-9]{2})-([0-9]{2})-([0-9]{2})[T ]([0-9]{2}):([0-9]{2}):([0-9]{2})-([0-9]{2}):([0-9]{2})/YYYY-MM-DDTHH:MM+TZ/g' \
         | sed -E 's/20([0-9]{2})-([0-9]{2})-([0-9]{2}) ([0-9]{2}):([0-9]{2}):([0-9]{2})/YYYY-MM-DD HH:MM:SS/g' \
         | sed -E 's/20([0-9]{2})_([0-9]{2})_([0-9]{2})/YYYY_MM_DD/g' \
         | sed -E 's/20([0-9]{6})_([0-9]{4})/YYMMDD_HHMM/g' \
         | sed -E 's/([=. _])20([0-9]{12})([. _])/(YYMMDDHHMMSS)/g' \
         | sed -E 's/20([0-9]{10})/YYMMDDHHMM/g' \
         | sed -E 's/20([0-9]{2})-([0-9]{2})-([0-9]{2})/YYYY-MM-DD/g' \
         | sed -E 's|([0-9]{2})/([0-9]{2})/20([0-9]{2})|MM/DD/YYYY|g' \
    Then I take the results of all of this mess and append them to a permanent file for the specific project/job/instance name and sort -u the permanent file to get rid of duplicates.
    Is it as good as Metadata Workbench / IGC / WKC?  Absolutely not. Those are good tools.
    But I have tried to get them working for 12+ years and never got consistent results in our specific environment (we pass a lot of things like table names in via parameters and things get weird).  I built this in three weeks and it is solving our problem.


    ------------------------------
    Nate Hoy
    ------------------------------



  • 9.  RE: Searching documentation about DataStage export and DSX or ISX

    Posted Wed December 20, 2023 10:50 AM

    For design-time metadata, I just use an ISX.  An ISX is basically a ZIP file (you can literally rename it to end in .gz or .zip and whatever Zip tool you prefer opens it right up) with all your objects in individual XML files.  I back up all of my objects automatically on a daily basis then I re-extract all the job/sequence/etc descriptions out of the ISX and throw ".txt" on the end of each filename.  I then put all of those files out on a fileshare and people can use Agent Ransack to search for impact analysis stuff.

    We could probably do a lot better, but the point is that an ISX already contains your entire project, neatly bundled up and easy to understand via anything that can handle hierarchical files.  This all happens over batch.  

    I created an authorization file to protect my passwords, the syntax I use is:

     $ISHOME/Clients/istools/cli/istool.sh export -domain "$DS_SERVICES_HOSTNAME" -af "$AUTHFILE"  -archive "/datastore/backup/proj_backup/$PROJECTNAME.isx" -datastage ' -incexec "'$DS_ENGINE_HOSTNAME'/'$PROJECTNAME'/*/*.*" '

    This is in 11.7.1.4 but I have been using the same command since 8.7.

    Then you take the resulting ISX and simply extract out the comments.  Different object types will have different extensions (.pjb = Parallel Job, etc).



    ------------------------------
    Nate Hoy
    ------------------------------



  • 10.  RE: Searching documentation about DataStage export and DSX or ISX

    Posted Thu December 21, 2023 06:59 AM

    As Udo already has pointed out, basing lineage on an export (no matter if this is DSX, XML or ISX) will give you the lineage for design time and not runtime. That means, that if in your project you use shared containers (subflows in CP4D), RCP or have e.g. your tablenames parameterized in your jobs, than you are likely to not get the full data lineage.

    KR Ralf



    ------------------------------
    Ralf Martin
    Principal Consultant
    Infologistix GmbH
    Bregenz
    ------------------------------



  • 11.  RE: Searching documentation about DataStage export and DSX or ISX

    Posted Thu December 21, 2023 10:50 AM

    Good points about design time vs run-time..   BUT..  You should NOT have to CHOOSE between them.  With IBM Knowledge Catalog and Manta you can have BOTH!



    ------------------------------
    Mark Hickok
    ------------------------------