Platform

Platform

 View Only

Optimizing pipeline and formula performance for large data 

Wed February 13, 2019 12:23 AM

♦ Applies to: Applies to: Apptio R.11 and Apptio TBM Studio R12.x 

 

In this article:

Introduction

This document is intended to provide TBMAs or other users who modify Apptio configuration with best practices for crafting table pipelines and formulas to perform optimally. This becomes important when data tables are very large and it can be especially important when dealing with workspace performance issues (see also Managing performance in your development environment).  If you are editing in Studio and what you are editing touches a massive table then the tips in this document may be critical to managing performance in your workspace. 

 

Recommended best practices

Make sure Data Refresh Cycles are appropriate

When a user creates a new table and uploads data the Data Refresh Cycle defaults to "Ad-hoc updates".  This is bad when you are working with very large data because that setting causes the data to "carry forward".  Carrying forward means that if data is not loaded in an open time period for a table, then the system will use the data loaded in the most recent time period.  What this can do is result in the system processing huge amounts of data in periods to which the data does not apply.

If the data is applicable to only one period then the Data Refresh Cycle should be set to "Monthly versions; Update every month" or another setting which limits the periods to which the data applies appropriately.

If the data is applicable to more than one period, then it should be partitioned by date as early in the pipeline as possible to ensure only data for a given period is being processed.  See the next section for additional information on where to do your date partitioning in the pipeline.

Do row or column limiting operations before formulas wherever possible

When considering the steps in a pipeline, do your best to limit row count or column count before you perform formula operations.  This will help make the pipeline as efficient as possible.

Therefore, try to have the following types of pipeline steps before any formula steps, and if you need a formula to make any of these work, do just the formulas you need  before these step types, and do other formulas later in the pipeline after these step types:

  • Hide and Rename
  • Filter
  • Date Partition
  • Group

By doing this you are limiting the amount of work the system needs to do to execute the formulas.  This can be huge when it comes really large tables.

Here are a couple of examples to emphasize how important this is:

  • Let's say you have a table with 3 million rows.  It has a Formulas step with 10 formulas, and then a Filter step.  The Filter step relies on one of the formulas, but only one.  After the Filter executes, the table only has 1 million rows.  By placing the other 9 formulas in the Formulas step before the Filter, you are asking the system to execute an extra 9 * 2,000,000 = 18,000,000 operations than it would need to if you moved those 9 formulas after the Filter step into a separate Formulas step.
  • Let's say you have a table with 1 million rows which is configured for ad-hoc uploads.  This means that the most recently uploaded data projects forward through time.  You have the pipeline configured with a Formulas pipeline step before a Date Partition step.  After the Date Partition step you have ~83,000 rows in any given period.  However, by having the Formulas step before the Date Partition step, you are asking the system to calculate the Formulas for all 1 million rows in every period that is open.  Therefore, the system is executing approximately 12 million formula calculations for every formula in the formula step, instead of only 1 million distributed over the 12 periods.
  • Let's say you have a table with 1 million rows, but 250 columns.  You have a Formulas step in the table running 10 formulas.  However, you are really only using 30 of the columns in any modeling or reporting.  Depending on the data volumes in the other 220 columns you might be asking the system to load hundreds megabytes, or even gigabytes of data into memory to even start processing those formulas. 

Use Remove Duplicates instead of Group when possible

If a large table has a column you wish to use to support lookups from other tables, and a Group step is used to make that column contain only one instance of each value, consider using Remove Duplicates instead.  Grouping is a memory intensive operation, but Remove Duplicates is not. 

Do not group before a model step when there is nothing between the two and no transforms

Grouping is memory intensive.  Model steps group on their identifier columns by default.  Therefore, if a group step immediately precedes a model step, and there are no transforms of the table, it is duplicating effort with no value.  If your pipeline looks like this, and there are no transforms of the table, you can simply remove the group step:

Hide unnecessary columns

If you have a table with a very large number of columns, but you are only using a subset of those columns.  Hide the columns you do not need. This is especially relevant with data from 3rd party systems or JSON data that may expand to have hundreds or thousands of columns.

Avoid complex nesting of if() statements

Nesting if() statements can become inefficient as complexity increases. If you are nesting more than two if() statements, or if you are using other functions like lookup() inside of the if statement (see Avoid lookups inside if statements below), consider using the TableMatch function (help-gov link). Tablematch is more efficient, and after you become familiar with it, it is easier to understand than nested if statements. 

 

  • See the examples section of the TableMatch function documentation for more, but borrowing from that the following complex if statement ...

 

if(Acct No=100 AND Code IN("RSF","RSG"),"Software",if(Acct No=200 AND
Code="RRT","Network",if(Acct No=300 AND Code IN("CAC","CAD"),"Servers",if(Acct No=400 AND Code
IN("CAC","CAD"),"Support",if(Acct No IN(500,600) AND Code="SIS","Data Centers","Unknown")))))

 

... can be replaced by this Tablematch table:

 

Acct No Code Cost Center
100 RSF,RSG Software
200 RRT Network
300 CAC,CAD Servers
400 CAC,CAD Support
500,600 SIS Data Centers
    Unknown

 

It is more efficient and easier to read.

 

Avoid lookups inside if statements

Nesting lookup() functions within if statements can be very inefficient, and especially if you are doing something like the following:

 

Column X =if(Lookup(Column A,Table A,Column A,Column B) = "Value X",Lookup(Column A,
Table B,Column A,Column B),Lookup(Column A,Table B,Column A,Column B))

 

The formula above performs a lookup once, then if the result of that lookup is "Value X", it performs the same lookup again, but if the result isn't "Value X", it will perform a different lookup. If the table the lookup is being done from is 100,000 rows, then it is performing a minimum of 200,000 lookup operations. The size of the tables the lookup is looking in also impacts the performance.

 

The best thing to do in this scenario is to perform the lookups in their own columns, as follows:

 

Column A =Lookup(Column A,Table A,Column A,Column B)
Column B =Lookup(Column A,Table B,Column A,Column B)
Column X =if(Column A="Value X",Column A,Column B)

 

This reduces the number of lookup operations by up to one third because you are performing the lookup to Table A only once.

 

Use search() and find() appropriately

To best manage memory and achieve optimal performance, Apptio uses dictionaries to represent strings. Imagine this as giving a string an index number. The search() and find() functions can be inefficient with large data sets because the system has to use the actual string and not the dictionary representation of the string, which requires much more memory. Also, the system must examine every character of the string, and sometimes more, when matching complex patterns. With large data sets, this can become inefficient.  

 

While there are times when search() or find() are the only option, in some cases, right() or left() can be used and are faster. For example, let's say we want to find the labor cost pools in the following list:

 

  • External Labor
  • Facilities & Power
  • Hardware
  • Internal Labor
  • Other
  • Outside Services
  • Software
  • Telecom

 

We could use =if(search("Labor", Cost Pool)>0,"Labor","") to find them. However, a more efficient way would be to use =if(right(Cost Pool,5)="Labor","Labor","") because right() doesn't have to look at every character. Instead, it can just grab the last five characters and do the comparison. For a large data set, this can make a significant difference.

 

Don't duplicate formulas unnecessarily

Sometimes, when multiple users work on tables, you can get unnecessary duplication of formulas. For example, you might discover something like this in two different formula steps for a table, or even in the same formula step:

Column B =Lookup(Column A,Other Table,Column A,Column B)
Column C =Lookup(Column A,Other Table,Column A,Column B)&&"Other Stuff"

 

In this case, the system is performing twice as many lookups to the "Other Table" as it needs to; the following is better:

Column B =Lookup(Column A,Other Table,Column A,Column B)
Column C =Column B&&"Other Stuff"

 

Don't do string manipulations unnecessarily

Using functions that do string operations is sometimes necessary.  For example, using upper() or lower() prior to comparing values in two different columns.  However, do not use these functions for cosmetic reasons.  Use of functions which do string manipulations must interrogate every character in a string.  This requires more memory to execute. In workspaces, this can start to be significant in aggregate. 

For a more comprehensive list of string functions see the Annotated list of functions by type (gov link) document's "String functions" section.


Consider what functions do

When using formulas, consider carefully what the formula must do.

 

 

How slow is 1+1

• Well… I add two numbers together for each row.

• So it grows linearly with data volume

• O(N)N is the number of rows.

 

How slow is UPPER(A)

• For every row, I look at every character in column A, and I convert that character to upper case

• So it grows based off the number of rows multiplied by the number of characters in column A

• O(M*N) where M is the number of characters in column A

 

How slow is SumIf(A,…)

• For every row, I go through every other row and check to see if they match

• So it grows with the square of the number of rows

• O(N^2)

 

 (Back to top)

 

  This article is open for your feedback. At the bottom of this page, click Add a comment.






#TBMStudio

Statistics
0 Favorited
34 Views
0 Files
0 Shares
0 Downloads

Comments

Thu September 16, 2021 11:37 AM

Thanks @Jeremy Bryant.  With the migration of documents to Help Center some links need love.  The ones above should work now.​
#TBMStudio

Wed September 15, 2021 10:12 PM

@Dan Kelly some of the links on the page are broken such as the link to the table match function​
#TBMStudio