Recently, I have been working on a project where I need to process incredibly large volumes of data. In IBM Planning Analytics/TM1 we can leverage TurboIntegrator (TI) Processes which are incredibly fast at Extract, Transform and Load (ETL) tasks. But, no matter how fast a process is - the more data you throw at it, the longer it will take.
It doesn't take a genius to know that users want to see their results quickly and it's our job as developers to streamline these processes to make them as fast as possible and get the users analysing instead of waiting.
On this particular project, I applied my usual bag of tricks to make the processes run faster. The results were good, it processes about 20 times more data than their old application and ran in around a third of the time. But I had a gut feeling it could still be faster, so I started looking at the things I wouldn't usually consider. I went through the processes line by line and function by function. Swapping out every function in there with an alternative method and testing the speed impact. Most changes were negligible or actually made things worse. But I found a couple of surprising results that saw big improvements, things that I hadn't seen mentioned before when talking about processing time.
One of them was the Expand function.
What is The Expand Function?
The Expand Function
in TI processes is used to convert Variable names enclosed in % symbols to the contents of the variable at run time.
So if I had a process that had the variable vsCountry; I could use the below code (on my metadata or data tabs) to export the Country variable for the data record that is being processed:
TextOutput ( cFileName, Expand ( %vsCountry% ) );
The Expand function also converts its entire contents into 1 concatenated string, this becomes particularly useful when trying to do something like exporting a large file that contains both String (text) and Numeric data variables. The TextOutput
functions only accept Strings, so you can either convert all of your Numeric variables to String using a function like NumberToString
or use Expand to do it for you... so the 2 examples below would do the same thing:
TextOutput ( cFileName, vsCountry, vsStore, vsProduct, NumberToString ( vnVolume), NumberToString ( vnRevenue), NumberToString ( vnCost ), NumberToString ( vnProfit ) );
TextOutput ( cFileName, Expand ( %vsCountry%, %vsStore%, %vsProduct%, %vnVolume%, %vnRevenue%, %vnCost%, %vnProfit% ) );
And likewise, the example in the Expand reference guide entry from IBM is showing how to use Expand on an ODBCOutput to insert a row of data into a database table. But this also has an alternative as shown below where you concatenate the string up manually:
ODBCOutPut( 'TransData', Expand( 'INSERT INTO SALES( MONTH, PRODUCT, SALES )
VALUES ( "%V0%", "%V1%",%V2% )' ) );
ODBCOutPut( 'TransData', 'INSERT INTO SALES( MONTH, PRODUCT, SALES )
VALUES ( "' | V0 | '", "' | V1 | '", | NumberToString ( V2 ) | ' )' );
You can start to see how the Expand function can often make the code easier to read and hence why it is used extensively in TI Processes.
When doing my analysis, I went through a TI process that exported circa 1 billion rows and approximately 30 columns (with a mixture of String and Numeric data.)
Initially, I had used the Expand function to create my export string, so I tried a simple swap and just using the variable names directly with NumberToString where required.
The altered process exported in approximately 50% of the time it had been taking with Expand.
After seeing the saving in my real-world example, I tested the theory by running some extremely simple TI code:
cOutputFile = 'C:\test.txt';
sString = 'String';
nNumeric = 1;
iCount = 1;
iMax = 10000000;
While ( iCount <= iMax );
#TextOutput ( cOutputFile, Expand ( '%sString%, %nNumeric%' ) );
#TextOutput ( cOutputFile, sString, NumberToString ( nNumeric ) );
#TextOutput ( cOutputFile, Expand ( '%sString%' ) );
#TextOutput ( cOutputFile, sString );
#TextOutput ( cOutputFile, Expand ( '%nNumeric%' ) );
#TextOutput ( cOutputFile, NumberToString ( nNumeric ) );
#TextOutput ( cOutputFile, Expand ( '%nNumeric%, %nNumeric%, %nNumeric%' ) );
#TextOutput ( cOutputFile, NumberToString ( nNumeric ), NumberToString ( nNumeric ), NumberToString ( nNumeric ) );
iCount = iCount + 1;
I unhashed each TextOutput one at a time and executed the process 10 times for that snippet, then compiled an average of the results.
||AVG Time with Expand (s)
||AVG Time without Expand (s)
|1 String, 1 Numeric
When exporting 1 string and 1 numeric column; the average time taken for the process was 44.64 seconds when using the Expand function compared with only 13.23 seconds without it. The difference of 31.41 seconds equating to 70% of the runtime.
The test when only exporting a string column was a bit quicker without the Expand function, but with a much less significant time saving, this indicates that the big difference here is derived from the way that the Expand function converts a numeric value into text being less efficient than the simple NumberToString function.
The test is a very simplified example of what you are likely to be doing in the real world but the results highlight a promising potential time saving for many systems and processes.#PlanningAnalyticswithWatson
The Expand function is extremely useful, I will probably continue using it for smaller to medium data quantities where the difference would be negligible. There are also other use cases for Expand that I haven't discussed here, such as utilising it to load data from a dynamic data source (see here.) So it certainly has its place.
But, if you have long-running TI processes that use Expand on large volumes of data - I would certainly suggest doing some trial and error testing to see whether getting rid can win your users back some time.