I realize this is a very old thread and probably not very useful any more, but...
I just finished writing a home-grown solution that simply queries DSODB for the parameter values passed in to all jobs, uses grep/awk/etc to reduce the noise (removing anything that is formatted as a date or time, anything that resembles a password or token, and anything I could find that is not a table definition, MQ queue name, web site, or other lineage-related information), and then keeps a permanent record of any unique parameters passed to the specific job. To reduce the load on my database (and since I am keeping a permanent record of unique parameters anyway), I only pull "the last two days" of runtime parameters each day.
I combine this with extracts from an ISX-converted-to-XML backup of my projects for a "where used" including an attempt at runtime metadata.
Limitations:
- DSODB only stores 250-ish characters of any given query in the runtime XML, so if you pass in a query like "select [lots and lots of specifications] from [tablename]" then "[tablename]" will likely not end up in your extract.
- Obviously filtering out any cruft is going to be a time-consuming bit of coding, and highly specific to your own environment and codebase.
My query (note: I am only pulling parallel jobs here - you can easily remove JOBTYPE='PAR' to get sequences as well):
SQL="SELECT JOBEXEC.JOBNAME, JOBEXEC.PROJECTNAME, JOBRUN.INVOCATIONID, XMLSERIALIZE(JOBRUNPARAMS.PARAMLIST as VARCHAR(30000))
FROM ((DSODB.JOBRUN JOBRUN
INNER JOIN DSODB.JOBEXEC JOBEXEC ON (JOBRUN.JOBID = JOBEXEC.JOBID))
INNER JOIN DSODB.JOBRUNPARAMS JOBRUNPARAMS ON (JOBRUNPARAMS.RUNID = JOBRUN.RUNID))
WHERE JOBEXEC.JOBTYPE = 'PAR' AND
JOBRUN.RUNSTARTTIMESTAMP > (CURRENT_DATE - $numberofdays DAYS)"
Some sample code for removing the cruft:
cat $tempfile \
| grep -v "job_name_I_don't_want_to_track" \
| grep -v "another_job_name_I_don't_want_to_track" \
| perl -pe 's|<param name=\"A_PARAMETER_SET_I_ALWAYS_WANT_TO_REMOVE.*?/>||g' \
| sed 's|\.\.\."/><param| FROM SQL_too_long"/><param|pg' \ #### This one turns the "SQL was too long" ellipses into "FROM SQL_too_long" so we have a marker of what we might have missed.
| perl -pe 's|[Ss][Ee][Ll][Ee][Cc][Tt].*?[Ff][Rr][Oo][Mm]|SELECT FROM|pg' \ ### converts all cases of SeLeCt fRoM to SELECT FROM
| perl -pe 's/"INSERT INTO ([^ (]*) .*?\/>/"INSERT INTO \1 "\/>/p' \ ## Extracts just the table name from "INSERT INTO" and throws the rest away.
| perl -pe 's/"UPDATE ([^ (]*) .*?\/>/"UPDATE \1 "\/>/p' \ ### same as above but for UPDATE
| perl -pe 's/"DELETE FROM ([^ (]*) .*?\/>/"DELETE FROM \1 "\/>/p' \ ### see also DELETE FROM
The following lines will convert <params> <param name="name" value="value"/> to name=value to save space and improve readability:
| sed 's/<param name="//g' \
| sed 's/" value="/=/g' \
| sed 's/"\/>/ /g' \
| sed 's/<params>//g' \
| sed 's/<\/params>//g' \
The following will find a bunch of date/time formats and change them to generic "YYMMDD" placeholders in the same format as the data found
| sed -E 's/20([0-9]{2})-([0-9]{2})-([0-9]{2}) ([0-9]{2}):([0-9]{2}):([0-9]{2}).([0-9]{6})/YYYY-MM-DD HH:MM:SS.mmmmmm/g' \
| sed -E 's/20([0-9]{6})_([0-9]{8})-([0-9]{8})/YYYYMMDD_HHMMSS-HHMMSS/g' \
| sed -E 's/20([0-9]{6})_([0-9]{6})/YYYYMMDD_HHMMSS/g' \
| sed -E 's/20([0-9]{2})-([0-9]{2})-([0-9]{2})[T ]([0-9]{2}):([0-9]{2}):([0-9]{2})[:.]([0-9]{2,6})-([0-9]{2}):([0-9]{2})/YYYY-MM-DDTHH:MM:SSS+TZ/g' \
| sed -E 's/20([0-9]{2})-([0-9]{2})-([0-9]{2})[T ]([0-9]{2}):([0-9]{2}):([0-9]{2})-([0-9]{2}):([0-9]{2})/YYYY-MM-DDTHH:MM+TZ/g' \
| sed -E 's/20([0-9]{2})-([0-9]{2})-([0-9]{2}) ([0-9]{2}):([0-9]{2}):([0-9]{2})/YYYY-MM-DD HH:MM:SS/g' \
| sed -E 's/20([0-9]{2})_([0-9]{2})_([0-9]{2})/YYYY_MM_DD/g' \
| sed -E 's/20([0-9]{6})_([0-9]{4})/YYMMDD_HHMM/g' \
| sed -E 's/([=. _])20([0-9]{12})([. _])/(YYMMDDHHMMSS)/g' \
| sed -E 's/20([0-9]{10})/YYMMDDHHMM/g' \
| sed -E 's/20([0-9]{2})-([0-9]{2})-([0-9]{2})/YYYY-MM-DD/g' \
| sed -E 's|([0-9]{2})/([0-9]{2})/20([0-9]{2})|MM/DD/YYYY|g' \
Then I take the results of all of this mess and append them to a permanent file for the specific project/job/instance name and sort -u the permanent file to get rid of duplicates.
Is it as good as Metadata Workbench / IGC / WKC? Absolutely not. Those are good tools.
But I have tried to get them working for 12+ years and never got consistent results in our specific environment (we pass a lot of things like table names in via parameters and things get weird). I built this in three weeks and it is solving our problem.
------------------------------
Nate Hoy
------------------------------
Original Message:
Sent: Thu December 21, 2023 09:39 AM
From: Jérôme Mainaud
Subject: Searching documentation about DataStage export and DSX or ISX
Thank you for the point. I will have to choose between design-time and real time lineage.
------------------------------
Jérôme Mainaud
Original Message:
Sent: Tue December 19, 2023 09:30 AM
From: Udo Neumann
Subject: Searching documentation about DataStage export and DSX or ISX
You could analyse the lineage in Information Server if you have IGC licensed. Also you have to differenciate between design-time and runtime lineage. The maintance task could also be automated via scripts.
------------------------------
Udo Neumann
Original Message:
Sent: Tue December 19, 2023 09:04 AM
From: Mark Hickok
Subject: Searching documentation about DataStage export and DSX or ISX
You might also look at doing the export in XML format (as opposed to DSX) - as THEN it could be read by Manta (the lineage company IBM purchased a couple of months ago) and have the lineage created all.
But - a better solution would be to look at modernizing the platform to the Cloud Pak for Data and DataStage NextGen.. That way the DataStage job would reside in a project and you could schedule the linage job via the on board scheduler or even run thiongs via a 3rd party scheduler.. And all that could run inside of the environment (including the lineage!)
------------------------------
Mark Hickok
Original Message:
Sent: Mon December 18, 2023 06:55 AM
From: Jérôme Mainaud
Subject: Searching documentation about DataStage export and DSX or ISX
Hi John,
Thank you for your idea. I will look at these reports.
My goal is to automatically import the lineage in a data catalog every night. Therefore I need to get a parseable result without requiring human action.
The report can be exported in CSV format, so it is somehow parseable.
However I see two limitations for now:
- I see no automated ways to trigger this.
- "The default maximum number of nodes that are displayed in a lineage report is 500. If more than this number of nodes is present in your lineage report, the report is truncated."
If I can find a solution for both of them, it cloud be cheaper.
I see this is a part of the InfoSphere Information Governance Catalog which offers a REST API. This API seems to be focused on assets, but maybe there is information about lineage.
------------------------------
Jérôme Mainaud
Original Message:
Sent: Mon December 18, 2023 04:33 AM
From: John McKeever
Subject: Searching documentation about DataStage export and DSX or ISX
Hi Jérôme,
Is it not possible for your client to run the data lineage reports themselves directly on their platform?
It'd certainly be quicker, easier, cheaper and more accurate than asking you to import their entire DataStage landscape into your environment and running the reports there,
John
------------------------------
John McKeever
Original Message:
Sent: Thu December 14, 2023 01:56 PM
From: Jérôme Mainaud
Subject: Searching documentation about DataStage export and DSX or ISX
Hello,
A client asked me to extract dataset lineage information from DataStage.
They send me an export of their process as a DSX file and showed me how to export it on their desktop clickable application.
They use an on-premise DataStage server.
Now I'm looking for a solution to automate this export. If possible by calling an API otherwise by calling a non-interactive command line client.
I'm also looking for documentation, or a specification of the file format DSX.
During my first search, I've heard about an ISX export format that is said to be more recent. So I'm also looking for information about its format or any API to produce it. I've already found a istool export command.
However my client seamed reluctant to use this format for some reason.
Any link to relevant documentation would be appreciated.
Thank you,
------------------------------
Jérôme Mainaud
------------------------------