Db2

Connect with Db2, Informix, Netezza, open source, and other data experts to gain value from your data, share insights, and solve problems.

View Only

Back to Library

DB2 10: Generated columns - More useful than ever.

Mon March 09, 2020 05:48 PM

Michelle Betbadal

Posted by: Serge Rielau

Motivation

As part of my job I work in a group called the SQL Language council (SLC).
The purpose of that group is, roughly three fold:

Ensure any SQL added to DB2 (and to some extend IDS, Netezza and Derby) remains compatible between the products.
Bring SQL extensions forward for standardization by ANSI and ISO
Ensure that SQL Extensions are added with a vision that extends beyond the immediate product or business requirement.

Needless to say meetings of the SLC can be quite exciting as different code-bases clash.
I like those.

On the flip side meeting scan be excruciatingly boring and be consumed by discussions of minute details.
I have a tendency not to be found in those.

But what is always the most intriguing about the SLC is the conscious attempt to not paint ourselves into a corner when it comes to defining language.

With product cycles of 1 1/2 to 3 years maintaining a vision across multiple releases is quite a challenge.
Therefore I love it when, at last, syntax and ideas that have been in place of over a decade finally become reality as features pile on top of features and they all lock into place adding to more than the sum of their individual parts,

Generated column support is a tale worth re-telling for that reason since it has gained more and more function over the releases culminating in DB2 10..

Setting the stage

As always let's take a look an example to motivate the problem generated columns are meant to solve.
First assume an expensive function.
Perhaps this is a function deriving your credit rating based on numerous properties of your financial standing.
As a proxy for such a real life expensive function I choose the "3a+1" once more.
The function will converge for any (let's hope) positive whole number eventually to 1 and return the number of steps it took.

CREATE OR REPLACE FUNCTION three_a_plus_one(IN n INTEGER)
  RETURNS INTEGER SPECIFIC three_a_plus_one
  CONTAINS SQL DETERMINISTIC NO EXTERNAL ACTION
BEGIN
  DECLARE a       INTEGER;
  DECLARE steps   INTEGER DEFAULT 0;
  SET a = n;
  WHILE a <> 1 DO
    SET a = CASE WHEN (a / 2) * 2 = a THEN a / 2
                 ELSE 3 * a + 1 END,
        steps = steps + 1;
  END WHILE;
  RETURN steps;
END
/
VALUES three_a_plus_one(27);
          1
-----------
        111

Already for a small number such as 27 it takes 111 steps for the function to converge on 1.
Now, because these values are so expensive to produce we want to pre-compute them and store the result in a table.

CREATE TABLE three_a (n INTEGER NOT NULL PRIMARY KEY, steps INTEGER NOT NULL);

INSERT INTO three_a 
  WITH rec(n, steps) AS (VALUES (1, three_a_plus_one(1))
                         UNION ALL
                         SELECT n + 1, three_a_plus_one(n + 1) FROM rec WHERE n < 10000)
  SELECT * FROM rec;
  
COMMIT;
  
SELECT * FROM three_a ORDER BY n FETCH FIRST 10 ROWS ONLY;
          N       STEPS
----------- -----------
          1           0
          2           1
          3           7
          4           2
          5           5
          6           8
          7          16
          8           3
          9          19
         10           6

10 rows were retrieved.

We can query the table and for any n retrieve the number of steps.

But there are problems with such a table.

The number of steps is really redundant information.
We are being taught not to store redundant information because it has this habit of coming out of synch.
For example if we update N or insert a new row and forget to update STEPS we will have inconsistent data in the database.
We rely on the fact that anyone who is querying the data knows to retrieve STEPS instead of running THREE_A_PLUS_ONE(n).
Anyone application that doesn't know that, or which hasn't been changed after introducing the STEPS column, will not benefit from the pre-computation.

To deal with the first problem we can add a check constraint that ensures that the dependency is always true:

ALTER TABLE three_a ADD CONSTRAINT steps CHECK (steps = three_a_plus_one(n));

INSERT INTO three_a VALUES(300000, 5);
SQL0545N  The requested operation is not allowed because a row does not satisfy the check constraint "SERGE.THREE_A.STEPS".

As long as we add the right values everything is fine:

INSERT INTO three_a VALUES(300000, three_a_plus_one(300000));
SELECT * FROM three_a WHERE n = 300000;
          N       STEPS
----------- -----------
     300000          52

Let's add one more twist to this before taking a closer look.
Having to add a value that is really generated and which we can only do wrong is annoying.
This is a good usage for a before trigger:

CREATE TRIGGER steps_gen BEFORE INSERT OR UPDATE ON three_a
  REFERENCING NEW AS n FOR EACH ROW
  BEGIN
    SET n.steps = three_a_plus_one(n.n);
  END;
/

INSERT INTO three_a(n) VALUES (300001);
SELECT * FROM three_a WHERE n = 300001;
          N       STEPS
----------- -----------
     300001         189

Now, independent of whether we use the trigger to insert the row or we specify the function directly, we are effectively executing this expensive function twice.
The first execution is within the trigger or the VALUES clause.
The second execution is by the check constraint which does nothing else but double check what must be true by design.

The second problem, as previously mentioned is that an existing application cannot exploit the precomputed values without being changed.

SELECT COUNT(*) FROM three_a WHERE three_a_plus_one(n) = 100;
          1
-----------
         36

EXPLAIN PLAN FOR SELECT COUNT(*) FROM three_a WHERE three_a_plus_one(n) = 100;
    ID TYPE            OBJECT_SCHEMA        OBJECT_NAME          PREDICATE_TEXT                                    
------ --------------- -------------------- -------------------- --------------------------------------------------
     1 RETURN                                                                                                      
     2 GRPBY                                                                                                       
     3 TBSCAN          SERGE                THREE_A              ( "SERGE   "."THREE_A_PLUS_ONE"(Q1.N) = 100)

Expression Generated Columns

Generated columns are meant to solve both problems stated above:

The duplicate execution of an expression caused by automatically computed columns values and the corresponding check constraint
The exploitation of the functionally dependent column without the need to rewrite any queries.

The feature was originally added in DB2 7.1 which is where the story begins.

Definition

Using an expression generated column the above table can get expressed like this:

DROP TABLE three_a;
CREATE TABLE three_a(n     INTEGER NOT NULL PRIMARY KEY,
                     steps INTEGER NOT NULL GENERATED ALWAYS AS (three_a_plus_one(n)));

The syntax is very similar to that of an IDENTITY column (also introduced in DB2 7,1) , except that instead of the identity properties an expression is provided.
Since the generated column acts as a check constraint there are several rules that apply to the expression to ensure the constraint is always true:

The expression must be exclusively based on other columns in the same table.
There must not be any references to variables special registers or subqueries
The expression may contain functions, but these functions must be deterministic, have no external action and cannot read SQL data.

Since there cannot be overrides by the user with different values an expression generated column is ALWAYS generated.
Identity columns by contrast do allow override.

Inserting rows

INSERT INTO three_a(n)
  WITH rec(n) AS (VALUES (1)
                  UNION ALL
                  SELECT n + 1 FROM rec WHERE n < 100000)
  SELECT * FROM rec;

There was no need to specify the function anywhere in the insert.
In fact we did not even need to use the column at all.
DB2 automatically will inject the equivalent of a before trigger to produce the required value.
In contrast to the handcrafted schema we first used there is no need to check consistency of the database since DB2 is fully in control.

INSERT INTO three_a(n) VALUES (300000);
Optimized Statement:
-------------------
INSERT INTO SERGE.THREE_A AS Q3 
  SELECT 300000,
         "SERGE "."THREE_A_PLUS_ONE"(300000) 
    FROM  (VALUES 1) AS Q1

DB2 has injected the SQL function as if it were user specified with no additional overhead.

In the above two INSERT statements the column has been omitted from the INSERT column list altogether.
If a column is omitted from the column list DB2 will default that column.
For a generated column defaulting implies generation of the computed value.
Therefore you can also specify the DEFAULT keyword as a placeholder for the value:

INSERT INTO three_a VALUES (300003, DEFAULT);

You may not, however provide an explicit value, even if that value is correct:

INSERT INTO three_a VALUES (300005, three_a_plus_one(300005));
SQL0798N  A value cannot be specified for column "STEPS" which is defined as GENERATED ALWAYS.

Updating

Very similar to inserting DB2 will maintain the generated column value on updates.
If the a value the generated column depends on is changed the generated column will be adjusted accordingly.

SELECT steps FROM three_a WHERE n = 300003;
      STEPS
-----------
        158

UPDATE three_a SET n = 251111 WHERE n = 300003;
SELECT steps FROM three_a WHERE n = 251111;
      STEPS
-----------
        256

Exploitation

Despite not having defined an explicit check constraint DB2 not only knows of its existence.
It can also use the knowledge to rewrite the query:

SELECT COUNT(*) FROM three_a WHERE three_a_plus_one(n) = 100;
          1
-----------
         36

Optimized Statement:
-------------------
SELECT Q3.$C0 
  FROM (SELECT COUNT(*) 
          FROM (SELECT $RID$ 
                  FROM SERGE.THREE_A AS Q1 
                  WHERE (Q1.STEPS = 100)
               ) AS Q2
       ) AS Q3

Note how the optimized SQL returned by db2exfmt shows that the expression has been replaced by the column STEPS.

This opens up an interesting possibility.

Indexing

If the generated column itself is indexed then the optimizer can choose that index after DB2 has exploited the generated column reference.
That is, de-facto, a generated column provides similar function to an index on an expression.

CREATE INDEX steps ON three_a(steps);
SELECT COUNT(*) FROM three_a WHERE three_a_plus_one(n) = 100;
          1
-----------
         36

      Rows 
     RETURN
     (   1)
      Cost 
       I/O 
       |
        1 
     GRPBY 
     (   2)
     6.81703 
        1 
       |
     38.8879 
     IXSCAN
     (   3)
     6.81315 
        1 
       |
      10002 
 INDEX: SERGE   
      STEPS
       Q1

The difference to an index on an expression, of course, is that here the expression is actually materialized in the table.
This costs a bit more space. It does however also speed up queries that do table scans, such as a hash join over the expression.

Altering generated columns

So what do you do if you have a functionally dependent column today without using a generated column syntax?
If you want to exploit the functionality you want to make that column a generated column.
Another scenario may be where you want to replace the generated column with another expression, perhaps to fix a logic error.
To achieve that you can use the regular ALTER TABLE statement, but with a twist.
Since the table currently has data you cannot simply replace the expression.
When adding a true check constraint DB2 will never change the data in the table.
In this case however this may be necessary.
DB2 not only needs to validate rows, it also needs to fix them.
Therefore whenever you add a generated column or make a non generated column a generated column, you need to place the table into check pending mode and use SET INTEGRITY to recover from that once the ALTER TABLE is complete.

Altering columns to add or remove the generated column properties (along with identity and default) was introduced in DB2 8.2.

In the following example will replace the three_a_plus_one() function with a faster inlined version:

CREATE OR REPLACE FUNCTION three_a_plus_one(IN n INTEGER)
  RETURNS INTEGER SPECIFIC three_a_plus_one
  CONTAINS SQL DETERMINISTIC NO EXTERNAL ACTION
BEGIN ATOMIC
  DECLARE a       INTEGER;
  DECLARE steps   INTEGER DEFAULT 0;
  SET a = n;
  WHILE a <> 1 DO
    SET a = CASE WHEN (a / 2) * 2 = a THEN a / 2
                 ELSE 3 * a + 1 END,
        steps = steps + 1;
  END WHILE;
  RETURN steps;
END
/
SQL0478N  DROP, ALTER, TRANSFER OWNERSHIP or REVOKE on object type "FUNCTION" cannot be processed because there is an object "SERGE.THREE_A", of type "TABLE", which depends on it.

The function cannot be replaced while the generated column is in place.
So we need to drop the generated column property, replace the function and then add it back.

ALTER TABLE three_a ALTER COLUMN steps DROP EXPRESSION;

CREATE OR REPLACE FUNCTION three_a_plus_one(IN n INTEGER)
  RETURNS INTEGER SPECIFIC three_a_plus_one
  CONTAINS SQL DETERMINISTIC NO EXTERNAL ACTION
BEGIN ATOMIC
  DECLARE a       INTEGER;
  DECLARE steps   INTEGER DEFAULT 0;
  SET a = n;
  WHILE a <> 1 DO
    SET a = CASE WHEN (a / 2) * 2 = a THEN a / 2
                 ELSE 3 * a + 1 END,
        steps = steps + 1;
  END WHILE;
  RETURN steps;
END
/

SET INTEGRITY FOR three_a OFF;

ALTER TABLE three_a ALTER COLUMN steps SET GENERATED ALWAYS AS (three_a_plus_one(n));

SET INTEGRITY FOR three_a IMMEDIATE CHECKED FORCE GENERATED;

DB2 10: Hiding generated columns

Remember how I stated above that one of the benefits of generated columns is that they can be exploited without changing the applications?
That was true for queries which use the expression. But any queries of the form SELECT * or INSERT INTO T VALUES will break when a new column is being added.
This has often been brought forward as a reason not to use generated columns.

In DB2 9.5 ROW CHANGE TOKEN, and ROW CHANGE TIMESTAMP columns were added in support of optimistic locking strategies.
These new kinds of columns can be hidden from applications using an IMPLICITLY HIDDEN property.

In DB2 10, with temporal query support this new feature is available to all columns.
This includes generated columns!

You can hide them from select-star syntax as well as from the INSERT-column list.
This is very convenient!

To show the effect let's replay the example from the beginning.
But this time we will add the generated column after the fact.

DROP TABLE three_a;

CREATE OR REPLACE FUNCTION three_a_plus_one(IN n INTEGER)
  RETURNS INTEGER SPECIFIC three_a_plus_one
  CONTAINS SQL DETERMINISTIC NO EXTERNAL ACTION
BEGIN
  DECLARE a       INTEGER;
  DECLARE steps   INTEGER DEFAULT 0;
  SET a = n;
  WHILE a <> 1 DO
    SET a = CASE WHEN (a / 2) * 2 = a THEN a / 2
                 ELSE 3 * a + 1 END,
        steps = steps + 1;
  END WHILE;
  RETURN steps;
END
/

CREATE TABLE three_a(n INTEGER NOT NULL PRIMARY KEY);

INSERT INTO three_a(n)
  WITH rec(n) AS (VALUES (1)
                  UNION ALL
                  SELECT n + 1 FROM rec WHERE n < 10000)
  SELECT * FROM rec;

INSERT INTO three_a VALUES (300003);

SELECT * FROM three_a WHERE three_a_plus_one(n) > 250;
          N
-----------
       6171
       6943
       7963
       9257

Now we add the generated column with the goal of speeding up the select query.
We ultimately want to use an index on the expression.

SET INTEGRITY FOR three_a OFF; 
ALTER TABLE three_a 
  ADD COLUMN steps INTEGER NOT NULL 
                   GENERATED ALWAYS AS (three_a_plus_one(n)) 
                   IMPLICITLY HIDDEN;

SET INTEGRITY FOR three_a IMMEDIATE CHECKED FORCE GENERATED;

CREATE INDEX steps ON three_a(steps);

INSERT INTO three_a VALUES (300005);
SELECT * FROM three_a WHERE n = 300005;
          N
-----------
     300005

SELECT * FROM three_a WHERE three_a_plus_one(n) > 250;
          N
-----------
       6171
       6943
       7963
       9257

            Rows 
           RETURN
           (   1)
            Cost 
             I/O 
             |
           54.7708 
           FETCH 
           (   2)
           389.755 
           57.5609 
         /---+----\
     54.7708       10002 
     IXSCAN   TABLE: SERGE   
     (   3)       THREE_A
     6.82348        Q1
        1 
       |
      10002 
 INDEX: SERGE   
      STEPS
       Q1

Note that implicitly hidden columns can still be referenced explicitly:

SELECT steps FROM three_a WHERE n = 5678;
      STEPS
-----------
        129

#Db2

Statistics

0 Favorited

2 Views

0 Files

0 Shares

0 Downloads

Db2

Db2

DB2 10: Generated columns - More useful than ever.

Posted by: Serge Rielau

Motivation

Setting the stage

Expression Generated Columns

Definition

Inserting rows

Updating

Exploitation

Indexing

Altering generated columns

DB2 10: Hiding generated columns

Additional
Resources

Office

Quick Links

Db2

Db2

DB2 10: Generated columns - More useful than ever.

Posted by: Serge RielauMotivation

Setting the stage

Expression Generated Columns

Definition

Inserting rows

Updating

Exploitation

Indexing

Altering generated columns

DB2 10: Hiding generated columns

Additional Resources

Office

Quick Links

Posted by: Serge Rielau

Motivation

Additional
Resources