Posted by: Serge Rielau
Motivation
As part of my job I work in a group called the SQL Language council (SLC).
The purpose of that group is, roughly three fold:
- Ensure any SQL added to DB2 (and to some extend IDS, Netezza and Derby) remains compatible between the products.
- Bring SQL extensions forward for standardization by ANSI and ISO
- Ensure that SQL Extensions are added with a vision that extends beyond the immediate product or business requirement.
Needless to say meetings of the SLC can be quite exciting as different code-bases clash.
I like those.
On the flip side meeting scan be excruciatingly boring and be consumed by discussions of minute details.
I have a tendency not to be found in those.
But what is always the most intriguing about the SLC is the conscious attempt to not paint ourselves into a corner when it comes to defining language.
With product cycles of 1 1/2 to 3 years maintaining a vision across multiple releases is quite a challenge.
Therefore I love it when, at last, syntax and ideas that have been in place of over a decade finally become reality as features pile on top of features and they all lock into place adding to more than the sum of their individual parts,
Generated column support is a tale worth re-telling for that reason since it has gained more and more function over the releases culminating in DB2 10..
Setting the stage
As always let's take a look an example to motivate the problem generated columns are meant to solve.
First assume an expensive function.
Perhaps this is a function deriving your credit rating based on numerous properties of your financial standing.
As a proxy for such a real life expensive function I choose the "3a+1" once more.
The function will converge for any (let's hope) positive whole number eventually to 1 and return the number of steps it took.
CREATE OR REPLACE FUNCTION three_a_plus_one(IN n INTEGER)
RETURNS INTEGER SPECIFIC three_a_plus_one
CONTAINS SQL DETERMINISTIC NO EXTERNAL ACTION
BEGIN
DECLARE a INTEGER;
DECLARE steps INTEGER DEFAULT 0;
SET a = n;
WHILE a <> 1 DO
SET a = CASE WHEN (a / 2) * 2 = a THEN a / 2
ELSE 3 * a + 1 END,
steps = steps + 1;
END WHILE;
RETURN steps;
END
/
VALUES three_a_plus_one(27);
1
-----------
111
Already for a small number such as 27 it takes 111 steps for the function to converge on 1.
Now, because these values are so expensive to produce we want to pre-compute them and store the result in a table.
CREATE TABLE three_a (n INTEGER NOT NULL PRIMARY KEY, steps INTEGER NOT NULL);
INSERT INTO three_a
WITH rec(n, steps) AS (VALUES (1, three_a_plus_one(1))
UNION ALL
SELECT n + 1, three_a_plus_one(n + 1) FROM rec WHERE n < 10000)
SELECT * FROM rec;
COMMIT;
SELECT * FROM three_a ORDER BY n FETCH FIRST 10 ROWS ONLY;
N STEPS
----------- -----------
1 0
2 1
3 7
4 2
5 5
6 8
7 16
8 3
9 19
10 6
10 rows were retrieved.
We can query the table and for any n retrieve the number of steps.
But there are problems with such a table.
- The number of steps is really redundant information.
We are being taught not to store redundant information because it has this habit of coming out of synch.
For example if we update N or insert a new row and forget to update STEPS we will have inconsistent data in the database.
- We rely on the fact that anyone who is querying the data knows to retrieve STEPS instead of running THREE_A_PLUS_ONE(n).
Anyone application that doesn't know that, or which hasn't been changed after introducing the STEPS column, will not benefit from the pre-computation.
To deal with the first problem we can add a check constraint that ensures that the dependency is always true:
ALTER TABLE three_a ADD CONSTRAINT steps CHECK (steps = three_a_plus_one(n));
INSERT INTO three_a VALUES(300000, 5);
SQL0545N The requested operation is not allowed because a row does not satisfy the check constraint "SERGE.THREE_A.STEPS".
As long as we add the right values everything is fine:
INSERT INTO three_a VALUES(300000, three_a_plus_one(300000));
SELECT * FROM three_a WHERE n = 300000;
N STEPS
----------- -----------
300000 52
Let's add one more twist to this before taking a closer look.
Having to add a value that is really generated and which we can only do wrong is annoying.
This is a good usage for a before trigger:
CREATE TRIGGER steps_gen BEFORE INSERT OR UPDATE ON three_a
REFERENCING NEW AS n FOR EACH ROW
BEGIN
SET n.steps = three_a_plus_one(n.n);
END;
/
INSERT INTO three_a(n) VALUES (300001);
SELECT * FROM three_a WHERE n = 300001;
N STEPS
----------- -----------
300001 189
Now, independent of whether we use the trigger to insert the row or we specify the function directly, we are effectively executing this expensive function twice.
The first execution is within the trigger or the VALUES clause.
The second execution is by the check constraint which does nothing else but double check what must be true by design.
The second problem, as previously mentioned is that an existing application cannot exploit the precomputed values without being changed.
SELECT COUNT(*) FROM three_a WHERE three_a_plus_one(n) = 100;
1
-----------
36
EXPLAIN PLAN FOR SELECT COUNT(*) FROM three_a WHERE three_a_plus_one(n) = 100;
ID TYPE OBJECT_SCHEMA OBJECT_NAME PREDICATE_TEXT
------ --------------- -------------------- -------------------- --------------------------------------------------
1 RETURN
2 GRPBY
3 TBSCAN SERGE THREE_A ( "SERGE "."THREE_A_PLUS_ONE"(Q1.N) = 100)
Expression Generated Columns
Generated columns are meant to solve both problems stated above:
- The duplicate execution of an expression caused by automatically computed columns values and the corresponding check constraint
- The exploitation of the functionally dependent column without the need to rewrite any queries.
The feature was originally added in DB2 7.1 which is where the story begins.
Definition
Using an expression generated column the above table can get expressed like this:
DROP TABLE three_a;
CREATE TABLE three_a(n INTEGER NOT NULL PRIMARY KEY,
steps INTEGER NOT NULL GENERATED ALWAYS AS (three_a_plus_one(n)));
The syntax is very similar to that of an
IDENTITY column (also introduced in DB2 7,1) , except that instead of the identity properties an expression is provided.
Since the generated column acts as a check constraint there are several rules that apply to the expression to ensure the constraint is always true:
- The expression must be exclusively based on other columns in the same table.
There must not be any references to variables special registers or subqueries
- The expression may contain functions, but these functions must be deterministic, have no external action and cannot read SQL data.
Since there cannot be overrides by the user with different values an expression generated column is ALWAYS generated.
Identity columns by contrast do allow override.
Inserting rows
INSERT INTO three_a(n)
WITH rec(n) AS (VALUES (1)
UNION ALL
SELECT n + 1 FROM rec WHERE n < 100000)
SELECT * FROM rec;
There was no need to specify the function anywhere in the insert.
In fact we did not even need to use the column at all.
DB2 automatically will inject the equivalent of a before trigger to produce the required value.
In contrast to the handcrafted schema we first used there is no need to check consistency of the database since DB2 is fully in control.
INSERT INTO three_a(n) VALUES (300000);
Optimized Statement:
-------------------
INSERT INTO SERGE.THREE_A AS Q3
SELECT 300000,
"SERGE "."THREE_A_PLUS_ONE"(300000)
FROM (VALUES 1) AS Q1
DB2 has injected the SQL function as if it were user specified with no additional overhead.
In the above two INSERT statements the column has been omitted from the INSERT column list altogether.
If a column is omitted from the column list DB2 will default that column.
For a generated column defaulting implies generation of the computed value.
Therefore you can also specify the DEFAULT keyword as a placeholder for the value:
INSERT INTO three_a VALUES (300003, DEFAULT);
You may not, however provide an explicit value, even if that value is correct:
INSERT INTO three_a VALUES (300005, three_a_plus_one(300005));
SQL0798N A value cannot be specified for column "STEPS" which is defined as GENERATED ALWAYS.
Updating
Very similar to inserting DB2 will maintain the generated column value on updates.
If the a value the generated column depends on is changed the generated column will be adjusted accordingly.
SELECT steps FROM three_a WHERE n = 300003;
STEPS
-----------
158
UPDATE three_a SET n = 251111 WHERE n = 300003;
SELECT steps FROM three_a WHERE n = 251111;
STEPS
-----------
256
Exploitation
Despite not having defined an explicit check constraint DB2 not only knows of its existence.
It can also use the knowledge to rewrite the query:
SELECT COUNT(*) FROM three_a WHERE three_a_plus_one(n) = 100;
1
-----------
36
Optimized Statement:
-------------------
SELECT Q3.$C0
FROM (SELECT COUNT(*)
FROM (SELECT $RID$
FROM SERGE.THREE_A AS Q1
WHERE (Q1.STEPS = 100)
) AS Q2
) AS Q3
Note how the optimized SQL returned by db2exfmt shows that the expression has been replaced by the column STEPS.
This opens up an interesting possibility.
Indexing
If the generated column itself is indexed then the optimizer can choose that index after DB2 has exploited the generated column reference.
That is, de-facto, a generated column provides similar function to an index on an expression.
CREATE INDEX steps ON three_a(steps);
SELECT COUNT(*) FROM three_a WHERE three_a_plus_one(n) = 100;
1
-----------
36
Rows
RETURN
( 1)
Cost
I/O
|
1
GRPBY
( 2)
6.81703
1
|
38.8879
IXSCAN
( 3)
6.81315
1
|
10002
INDEX: SERGE
STEPS
Q1
The difference to an index on an expression, of course, is that here the expression is actually materialized in the table.
This costs a bit more space. It does however also speed up queries that do table scans, such as a hash join over the expression.
Altering generated columns
So what do you do if you have a functionally dependent column today without using a generated column syntax?
If you want to exploit the functionality you want to make that column a generated column.
Another scenario may be where you want to replace the generated column with another expression, perhaps to fix a logic error.
To achieve that you can use the regular ALTER TABLE statement, but with a twist.
Since the table currently has data you cannot simply replace the expression.
When adding a true check constraint DB2 will never change the data in the table.
In this case however this may be necessary.
DB2 not only needs to validate rows, it also needs to fix them.
Therefore whenever you add a generated column or make a non generated column a generated column, you need to place the table into check pending mode and use SET INTEGRITY to recover from that once the ALTER TABLE is complete.
Altering columns to add or remove the generated column properties (along with identity and default) was introduced in DB2 8.2.
In the following example will replace the three_a_plus_one() function with a faster inlined version:
CREATE OR REPLACE FUNCTION three_a_plus_one(IN n INTEGER)
RETURNS INTEGER SPECIFIC three_a_plus_one
CONTAINS SQL DETERMINISTIC NO EXTERNAL ACTION
BEGIN ATOMIC
DECLARE a INTEGER;
DECLARE steps INTEGER DEFAULT 0;
SET a = n;
WHILE a <> 1 DO
SET a = CASE WHEN (a / 2) * 2 = a THEN a / 2
ELSE 3 * a + 1 END,
steps = steps + 1;
END WHILE;
RETURN steps;
END
/
SQL0478N DROP, ALTER, TRANSFER OWNERSHIP or REVOKE on object type "FUNCTION" cannot be processed because there is an object "SERGE.THREE_A", of type "TABLE", which depends on it.
The function cannot be replaced while the generated column is in place.
So we need to drop the generated column property, replace the function and then add it back.
ALTER TABLE three_a ALTER COLUMN steps DROP EXPRESSION;
CREATE OR REPLACE FUNCTION three_a_plus_one(IN n INTEGER)
RETURNS INTEGER SPECIFIC three_a_plus_one
CONTAINS SQL DETERMINISTIC NO EXTERNAL ACTION
BEGIN ATOMIC
DECLARE a INTEGER;
DECLARE steps INTEGER DEFAULT 0;
SET a = n;
WHILE a <> 1 DO
SET a = CASE WHEN (a / 2) * 2 = a THEN a / 2
ELSE 3 * a + 1 END,
steps = steps + 1;
END WHILE;
RETURN steps;
END
/
SET INTEGRITY FOR three_a OFF;
ALTER TABLE three_a ALTER COLUMN steps SET GENERATED ALWAYS AS (three_a_plus_one(n));
SET INTEGRITY FOR three_a IMMEDIATE CHECKED FORCE GENERATED;
DB2 10: Hiding generated columns
Remember how I stated above that one of the benefits of generated columns is that they can be exploited without changing the applications?
That was true for queries which use the expression. But any queries of the form SELECT * or INSERT INTO T VALUES will break when a new column is being added.
This has often been brought forward as a reason not to use generated columns.
In DB2 9.5 ROW CHANGE TOKEN, and ROW CHANGE TIMESTAMP columns were added in support of optimistic locking strategies.
These new kinds of columns can be hidden from applications using an IMPLICITLY HIDDEN property.
In DB2 10, with temporal query support this new feature is available to all columns.
This includes generated columns!
You can hide them from select-star syntax as well as from the INSERT-column list.
This is very convenient!
To show the effect let's replay the example from the beginning.
But this time we will add the generated column after the fact.
DROP TABLE three_a;
CREATE OR REPLACE FUNCTION three_a_plus_one(IN n INTEGER)
RETURNS INTEGER SPECIFIC three_a_plus_one
CONTAINS SQL DETERMINISTIC NO EXTERNAL ACTION
BEGIN
DECLARE a INTEGER;
DECLARE steps INTEGER DEFAULT 0;
SET a = n;
WHILE a <> 1 DO
SET a = CASE WHEN (a / 2) * 2 = a THEN a / 2
ELSE 3 * a + 1 END,
steps = steps + 1;
END WHILE;
RETURN steps;
END
/
CREATE TABLE three_a(n INTEGER NOT NULL PRIMARY KEY);
INSERT INTO three_a(n)
WITH rec(n) AS (VALUES (1)
UNION ALL
SELECT n + 1 FROM rec WHERE n < 10000)
SELECT * FROM rec;
INSERT INTO three_a VALUES (300003);
SELECT * FROM three_a WHERE three_a_plus_one(n) > 250;
N
-----------
6171
6943
7963
9257
Now we add the generated column with the goal of speeding up the select query.
We ultimately want to use an index on the expression.
SET INTEGRITY FOR three_a OFF;
ALTER TABLE three_a
ADD COLUMN steps INTEGER NOT NULL
GENERATED ALWAYS AS (three_a_plus_one(n))
IMPLICITLY HIDDEN;
SET INTEGRITY FOR three_a IMMEDIATE CHECKED FORCE GENERATED;
CREATE INDEX steps ON three_a(steps);
INSERT INTO three_a VALUES (300005);
SELECT * FROM three_a WHERE n = 300005;
N
-----------
300005
SELECT * FROM three_a WHERE three_a_plus_one(n) > 250;
N
-----------
6171
6943
7963
9257
Rows
RETURN
( 1)
Cost
I/O
|
54.7708
FETCH
( 2)
389.755
57.5609
/---+----\
54.7708 10002
IXSCAN TABLE: SERGE
( 3) THREE_A
6.82348 Q1
1
|
10002
INDEX: SERGE
STEPS
Q1
Note that implicitly hidden columns can still be referenced explicitly:
SELECT steps FROM three_a WHERE n = 5678;
STEPS
-----------
129
#Db2