Motivation
Fact: "John's birthday is Oct 12, 1965"
- If you know John and you know his birthday to be Oct 12, 1965, then never mind.
- If you know John and you thought his birthday was a different one, then there is a conflict.
- If you know John and you didn't know his birthday, then you just learned something new about John.
- If you didn't know John then you now know of his existence and his birthday.
How does this relate to databases, DB2 and SQL?
When ingesting new data into a table it is very common that you encounter duplicates and conflicting data.
Over the years different vendors have come up with various schemes to handle this.
in this post I'll discuss some the the legacy statements and the MERGE statement which is meant to solve the problem in ANSI SQL.
UPDATE FROM
Imagine a product inventory. There are product ids, names and inventory levels.
Once a day sales are reported and inventory levels need to be adjusted for all products.
CREATE TABLE product(id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(20),
inventory INTEGER);
INSERT INTO product VALUES
(1, 'Car' , 10),
(2, 'TV' , 22),
(3, 'House' , 4),
(4, 'Dog' , 13),
(5, 'Diapers', 34);
A variety store, presumably....
Here are our sales for the day:
CREATE TABLE sales(id INTEGER NOT NULL PRIMARY KEY,
sold INTEGER);
INSERT INTO sales VALUES
(2, 4),
(3, 1),
(5, 12);
MS SQL Server and IBM IDS both support a feature called UPDATE FROM which allows to update the product table based on sales:
UPDATE product
SET inventory = product.inventory - sales.sold
FROM product JOIN sales
ON product.id = sales.id;
SQL0104N An unexpected token "UPDATE PRODUCT SET INVENTORY = PRODUCT.INVE"
was found following "BEGIN-OF-STATEMENT".
Expected tokens may include: "<space>".
The SQL standard and DB2 do not support this syntax. In ANSI SQL such a statement gets written like this:
UPDATE product
SET inventory = (SELECT inventory - sold
FROM sales WHERE sales.id = product.id)
WHERE EXISTS(SELECT 1 FROM sales
WHERE sales.id = product.id);
SELECT * FROM product ORDER BY id;
ID NAME INVENTORY
----------- -------------------- -----------
1 Car 10
2 TV 18
3 House 3
4 Dog 13
5 Diapers 22
I can't count how often I have explained that the ANSI SQL statement on DB2 is just as efficient as the proprietary UPDATE FROM syntax.
Despite the fact that the update selects twice from sales, DB2 will actually rewrite the query to exactly what UPDATE FROM does.
One can, however hardly argue with the fact that the standard SQL version of this update is hard to understand and a lot more to type.
So why did DB2 not implement UPDATE FROM and bring the syntax forward in the standard?
The reason is that UPDATE FROM only deals with a fraction of the overall problem.
UPDATE FROM assumes that only updates are being done.
But what if we do not only sell inventory, but we also buy things?
Or what if, once the last item is sold, the product should be deleted?
in that case we need a statement that combines UPDATE, DELETE and INSERT into one.
The MERGE statement is meant to do that.
MERGE
The MERGE statement was introduced in DB2 V8.1 FP1.
It was co-proposed by Oracle and IBM for ANSI SQL and has since been extended beyond the standard and implemented by other database vendors.
Rather than detailing the syntax here, let's build up examples.
UPDATE FROM using MERGE
At the core of the MERGE statement are four constructs:
- Merge target
The target of the merge statement is the table that you intend to update.
This does not necessarily have to be a base table.
It can be a view or even an updatable query.
- Merge source
The source is a any table, view, values-clause or a query.
Basically anything that you can put into the FROM clause of a query can be a merge source.
- Merge join condition
The join condition specifies how the rows in the source and the target are to be matched. Typically the join condition is an equal condition on the primary keys.
- Merge operation
This construct tells MERGE what to do if a source row does or does not match a a target row.
Let's recode the UPDATE FROM example from above using MERGE
MERGE INTO product AS T
USING sales AS S
ON S.id = T.id
WHEN MATCHED THEN UPDATE SET inventory = T.inventory - S.sold;
SELECT * FROM product ORDER BY id;
ID NAME INVENTORY
----------- -------------------- -----------
1 Car 10
2 TV 14
3 House 2
4 Dog 13
5 Diapers 10
Personally I like to always use correlation names for target ("T") and source ("S"). Obviously there is a high likely hood for column name collisions in the join condition. But also on the right hand side of the UPDATE operation's SET clause you need to ensure to qualify column names since both source and target are within scope. The SET-clause can be of the same complexity as a regular SET clause in an UPDATE statement. You can set multiple column, use row and scalar sub queries and complex expressions.
This statement looks much denser and it is hopefully much easier to read than the UPDATE with the EXISTS predicate. But is it efficient?
Rows
RETURN
( 1)
Cost
I/O
|
3
UPDATE
( 2)
33.8903
NA
/---+----\
3 5
^MSJOIN TABLE: SERGE
( 3) PRODUCT
13.5799 Q1
NA
/-------+-------\
5 0.6
FETCH FILTER
( 4) ( 6)
6.79033 6.78823
NA NA
/----+----\ |
5 5 3
IXSCAN TABLE: SERGE TBSCAN
( 5) PRODUCT ( 7)
0.0183508 Q2 6.78823
NA NA
| |
5 3
INDEX: SYSIBM SORT
SQL120322101103640 ( 8)
Q2 6.78774
NA
|
3
TBSCAN
( 9)
6.78672
NA
|
3
TABLE: SERGE
SALES
Q3
This is, in fact, an UPDATE FROM plan. DB2 chose a merge join, and it picked a SORT over a table scan instead of exploiting the index to get order.
But given the size of the source table it is quite possible that this is the cheapest option. So: Yes, this is an efficient way to update the target table.
Duplicate Handling
Next let's investigate what DB2 does when there are duplicates in the source by adding an extra diaper into the mix and dropping the primary key.
ALTER TABLE sales DROP PRIMARY KEY;
INSERT INTO sales VALUES(5, 1);
MERGE INTO product AS T
USING sales AS S
ON S.id = T.id
WHEN MATCHED THEN UPDATE SET inventory = T.inventory - S.sold;
SQL0788N The same row of target table "SERGE.PRODUCT" was identified more
than once for an update, delete or insert operation of the MERGE statement.
DB2 did not like that! The semantic of MERGE is that any target row can only be set once. From a relational perspective this makes sense. If we can update the diaper inventory twice the it is indeterminate which of the source row would win. Also note that the changes do not accumulate.
It is important to understand how DB2 detects the duplicate. The plan will show:
Rows
RETURN
( 1)
Cost
I/O
|
2.88
UPDATE
( 2)
52.5818
7.76
/---+----\
2.88 5
FETCH TABLE: SERGE
( 3) PRODUCT
33.0845 Q1
4.88
/---+----\
2.88 5
FILTER TABLE: SERGE
( 4) PRODUCT
13.5851
2
|
3
FILTER
( 5)
13.5846
2
|
3
TBSCAN
( 6)
13.5781
2
|
3
SORT
( 7)
13.5772
2
|
3
HSJOIN^
( 8)
13.5754
2
/-----+------\
5 3
TBSCAN TBSCAN
( 9) ( 10)
6.78802 6.78672
1 1
| |
5 3
TABLE: SERGE TABLE: SERGE
PRODUCT SALES
Q2 Q3
That is a very different plan! Aside form the fact that DB2 chose a hash join, the operators 4-7 are new.
With the primary key on SALES.ID and a join condition on that key DB2 was certain that only one row in the source could match to any row in the target.
Without the primary key DB2 cannot be sure.
So what it does is to count (using ROW_NUMBER) the number of rows with the same target rowid produced by the hash join.
If there is more than one for a given rowid the error is raised.
The FETCH operator (3) is there to reposition the cursor after having lost position due to the SORT.
Now, what can you do when you have duplicates in the source?
You need to either eliminate the duplicates or aggregate them.
In this case, since we are providing delta sales, we can aggregate the duplicates:
MERGE INTO product AS T
USING (SELECT id, sum(sold) AS sold
FROM sales GROUP BY id) AS S
ON S.id = T.id
WHEN MATCHED THEN UPDATE SET inventory = T.inventory - S.sold;
SELECT * FROM product ORDER BY id;
ID NAME INVENTORY
----------- -------------------- -----------
1 Car 10
2 TV 6
3 House 0
4 Dog 13
5 Diapers -15
That worked. How did the aggregation change the plan?
Rows
RETURN
( 1)
Cost
I/O
|
3
UPDATE
( 2)
33.8924
5
/---+----\
3 5
^MSJOIN TABLE: SERGE
( 3) PRODUCT
13.582 Q1
2
/-------+-------\
5 0.6
FETCH FILTER
( 4) ( 6)
6.79033 6.79033
1 1
/----+----\ |
5 5 3
IXSCAN TABLE: SERGE GRPBY
( 5) PRODUCT ( 7)
0.0183508 Q2 6.78966
0 1
| |
5 3
INDEX: SYSIBM TBSCAN
SQL120322101103640 ( 8)
Q2 6.78927
1
|
3
SORT
( 9)
6.78841
1
|
3
TBSCAN
( 10)
6.78672
1
|
3
TABLE: SERGE
SALES
Q3
The plan is pretty again! Why is that?
DB2 knows that the GROUP BY id will enforce uniqueness on ID and therefore the join will be safe.
You should always try to enforce uniqueness across the columns in the merge join condition.
GROUP BY is the best way to do that.
But, it seems we have oversold diapers. That was bound to happen eventually.
UPSERT using MERGE
Now is a good time to think about ordering in new items.
In a sense a negative sales constitutes a buy, so we can recycle the same table and simply use negative numbers.
But we also want to widen the selection in our store and buy new things from a catalog of available products.
For that we need to insert new rows into the product table.
We could do this in a separate insert statement.
But that would mean we have to scan the sales table twice: First do all the updates, then do the deletes.
Finally this is where the power of MERGE can be brought to bear.
DELETE FROM sales;
INSERT INTO sales VALUES
(5,-100),
(5, 8),
(3, -4),
(4, 1),
(6, -15);
CREATE TABLE catalog(id INTEGER, name VARCHAR(20));
INSERT INTO catalog VALUES
(1, 'Car'),
(2, 'TV'),
(3, 'House'),
(4, 'Dog'),
(5, 'Diapers'),
(6, 'Milk'),
(7, 'Book');
MERGE INTO product AS T
USING (SELECT sales.id, sum(sold) AS sold, max(catalog.name) as name
FROM sales, catalog WHERE sales.id = catalog.id GROUP BY sales.id) AS S
ON S.id = T.id
WHEN MATCHED THEN UPDATE SET inventory = T.inventory - S.sold
WHEN NOT MATCHED THEN INSERT VALUES(S.id, S.name, -S.sold);
SELECT * FROM product ORDER BY id;
ID NAME INVENTORY
----------- -------------------- -----------
1 Car 10
2 TV 6
3 House 4
4 Dog 12
5 Diapers 77
6 Milk 15
Note the query composing the source. By grouping on sales.id we ensure DB2 derives uniqueness to simplify the merge access plan.
To comply with GROUP BY rules a MAX has been added to the product name although we only max on by value.
Alternatively we could have added the name look-up as a scalar subquery in the insert values clause.
Generally, to get the best plan it is also recommended to specify INSERT always as the last phase.
Here is the access plan:
Rows
RETURN
( 1)
Cost
I/O
|
0.32
INSERT
( 2)
31.5008
4.64
/---+----\
0.32 6
TBSCAN TABLE: SERGE
( 3) PRODUCT
24.7309 Q17
3.64
|
0.32
TEMP
( 4)
24.7186
3.64
|
0.32
UPDATE
( 5)
24.7141
3.64
/---+----\
0.32 6
FETCH TABLE: SERGE
( 6) PRODUCT
22.5478 Q1
3.32
/---+----\
0.32 6
NLJOIN TABLE: SERGE
( 7) PRODUCT
20.3807
3
/------------+------------\
4 0.08
HSJOIN< UNION
( 8) ( 16)
20.3731 0.00172145
3 0
/---+----\ /------+------\
6 4 0.04 0.04
TBSCAN GRPBY FILTER FILTER
( 9) ( 10) ( 17) ( 19)
6.78867 13.5826 0.000685844 0.000685844
1 2 0 0
| | | |
6 5 1 1
TABLE: SERGE TBSCAN TBSCAN TBSCAN
PRODUCT ( 11) ( 18) ( 20)
Q6 13.5819 4.5345e-005 4.5345e-005
2 0 0
| | |
5 1 1
SORT TABFNC: SYSIBM TABFNC: SYSIBM
( 12) GENROW GENROW
13.5808
2
|
5
HSJOIN
( 13)
13.5784
2
/-----+------\
7 5
TBSCAN TBSCAN
( 14) ( 15)
6.78932 6.78802
1 1
| |
7 5
TABLE: SERGE TABLE: SERGE
CATALOG SALES
Q2 Q3
The TEMP(4) operator prevents Halloween style phenomenon similar to what I described in my post on
INSENSITIVE cursors.
The truly interesting part however is the inner of NLJOIN(7)
That's a UNION ALL over two VALUES constructs.
One represents the UPDATE SET clause the other the INSERT VALUES clause.
If the outer join HSJOIN(8) produced a NULL row for the target the INSERT branch of the UNION ALL is executed.
If the join matches a row then the UPDATE branch is being executed.
The UPDATE and INSERT operators themselves also execute conditionally.
"UPDELSERT"
Most items in our store are selling well and we keep expanding the range of products.
But car's seem not to be our thing. Not everyone is cut to be a car sales man.
This brings up the question how we can shrink our product table.
Obviously we can simply DELETE items, but can it be done with MERGE?
For example, if we exactly sell out of a product (the stock drops to zero) we could delete the entry altogether.
When MERGE matches a source and target row we can DELETE that row.
But that collides with UPDATE.
DB2 supports an extension to the merge-when-clause for that purpose which allows us to refine the behavior:
DELETE FROM sales;
INSERT INTO sales VALUES
(1, 10),
(5, 3),
(2, -4);
MERGE INTO product AS T
USING (SELECT sales.id, sum(sold) AS sold, max(catalog.name) as name
FROM sales, catalog WHERE sales.id = catalog.id GROUP BY sales.id) AS S
ON S.id = T.id
WHEN MATCHED AND T.inventory = S.sold
THEN DELETE
WHEN MATCHED AND T.inventory < S.sold
THEN SIGNAL SQLSTATE '78000' SET MESSAGE_TEXT = 'Oversold: ' || S.name
WHEN MATCHED
THEN UPDATE SET inventory = T.inventory - S.sold
WHEN NOT MATCHED
THEN INSERT VALUES(S.id, S.name, -S.sold);
SELECT * FROM product ORDER BY id;
ID NAME INVENTORY
----------- -------------------- -----------
2 TV 10
3 House 4
4 Dog 12
5 Diapers 74
6 Milk 15
In addition to allowing deletes the previous statement also introduced an option to raise errors.
Note that we chose not to extend the WHEN MATCHED predicate for the UPDATE. The merge when clauses operate like a case-expression in that the first condition to be true for a row results in the execution of the matching then clause. So MERGE will only execute the UPDATE branch of the statement for rows that match and neither caused an error or a delete. The optimizer plan now shows another more branch in the UNION ALL reflecting the DELETE. In addition there is an extra JOIN which models the conditional SIGNAL:
Rows
RETURN
( 1)
Cost
I/O
|
0.36
INSERT
( 2)
36.9278
5.44
/---+----\
0.36 5
TBSCAN TABLE: SERGE
( 3) PRODUCT
30.1578 Q25
4.44
|
0.36
TEMP
( 4)
30.1456
4.44
|
0.36
UPDATE
( 5)
30.1411
4.44
/---+----\
0.36 5
FETCH TABLE: SERGE
( 6) PRODUCT
27.7039 Q1
4.08
/---+----\
0.36 5
TBSCAN TABLE: SERGE
( 7) PRODUCT
25.2663
3.72
|
0.36
TEMP
( 8)
25.254
3.72
|
0.36
DELETE
( 9)
25.2495
3.72
/---+----\
0.36 5
FETCH TABLE: SERGE
( 10) PRODUCT
22.8123 Q2
3.36
/---+----\
0.36 5
NLJOIN TABLE: SERGE
( 11) PRODUCT
20.3747
3
/---------------+----------------\
3 0.12
HSJOIN< NLJOIN
( 12) ( 20)
20.3695 0.00173513
3 0
/---+----\ /-------------+-------------\
5 3 1 0.12
TBSCAN GRPBY TBSCAN UNION
( 13) ( 14) ( 21) ( 22)
6.78802 13.58 4.5345e-005 0.00116514
1 2 0 0
| | | +-----------------+-----------------+
5 3 1 0.04 0.04 0.04
TABLE: SERGE TBSCAN TABFNC: SYSIBM FILTER FILTER FILTER
PRODUCT ( 15) GENROW ( 23) ( 25) ( 27)
Q7 13.5796 0.0002135 0.0002135 0.0002135
2 0 0 0
| | | |
3 1 1 1
SORT TBSCAN TBSCAN TBSCAN
( 16) ( 24) ( 26) ( 28)
13.5787 4.5345e-005 4.5345e-005 4.5345e-005
2 0 0 0
| | | |
3 1 1 1
HSJOIN TABFNC: SYSIBM TABFNC: SYSIBM TABFNC: SYSIBM
( 17) GENROW GENROW GENROW
13.5769
2
/-----+------\
7 3
TBSCAN TBSCAN
( 18) ( 19)
6.78932 6.78672
1 1
| |
7 3
TABLE: SERGE TABLE: SERGE
CATALOG SALES
Q3 Q4
Single row MERGE using VALUES
so far we have been been using a query as a source. This is the most common scenario in a warehousing environment. But MERGE is also commonly used in OLTP type of scenarios. In that case A single row of variables represents the source. We refine our MERGE now to singleton modifications:
VARIABLE id INTEGER;
VARIABLE sold INTEGER;
VARIABLE name VARCHAR(20);
BEGIN
SET (:id, :name, :sold) = (3, 'House', 5);
END;
/
MERGE INTO product AS T
USING (VALUES(CAST(:id AS INTEGER), CAST(:name AS VARCHAR(20)), CAST(:sold AS INTEGER)) )
AS S(ID, NAME, SOLD)
ON S.id = T.id
WHEN MATCHED AND T.inventory = S.sold
THEN DELETE
WHEN MATCHED AND T.inventory < S.sold
THEN SIGNAL SQLSTATE '78000' SET MESSAGE_TEXT = 'Oversold: ' || S.name
WHEN MATCHED
THEN UPDATE SET inventory = T.inventory - S.sold
WHEN NOT MATCHED
THEN INSERT VALUES(S.id, S.name, -S.sold);
SQL0438N Application raised error or warning with diagnostic text: "Oversold: House".
DB2 had difficulty deriving the parameter types, so the input variables were CAST.
Looking at the plan we will see some simplifications. Specifically the TEMP below the INSERT has disappeared. The reason is that DB2 knows only one row will be processed. It will never try to match a second row which eliminates the Halloween scenario.
Rows
RETURN
( 1)
Cost
I/O
|
0.12
INSERT
( 2)
15.1827
2.24
/---+----\
0.12 5
UPDATE TABLE: SERGE
( 3) PRODUCT
8.41275 Q23
1.24
/---+----\
0.12 5
DELETE TABLE: SERGE
( 4) PRODUCT
7.60036 Q1
1.12
/---+----\
0.12 5
NLJOIN TABLE: SERGE
( 5) PRODUCT
6.78797 Q2
1
/--------------------+--------------------\
1 0.12
NLJOIN UNION
( 6) ( 12)
6.78661 0.00116514
1 0
/---------+----------\ +-----------------+-----------------+
1 1 0.04 0.04 0.04
>NLJOIN TBSCAN FILTER FILTER FILTER
( 7) ( 11) ( 13) ( 15) ( 17)
6.78623 4.5345e-005 0.0002135 0.0002135 0.0002135
1 0 0 0 0
/-------+--------\ | | | |
1 1 1 1 1 1
TBSCAN FETCH TABFNC: SYSIBM TBSCAN TBSCAN TBSCAN
( 8) ( 9) GENROW ( 14) ( 16) ( 18)
4.5345e-005 6.78619 4.5345e-005 4.5345e-005 4.5345e-005
0 1 0 0 0
| /----+----\ | | |
1 1 5 1 1 1
TABFNC: SYSIBM IXSCAN TABLE: SERGE TABFNC: SYSIBM TABFNC: SYSIBM TABFNC: SYSIBM
GENROW ( 10) PRODUCT GENROW GENROW GENROW
0.0155866 Q5
0
|
5
INDEX: SYSIBM
SQL120322101103640
Q5
Note how the operators are stacked. No temps, no sorts, no repositioning. This is an optimal access path.
Good to know rules
Note that MERGE is a composite statement, very much in the same way as an inline atomic compound.
- Despite the pipelining we have seen in these examples DB2 enforced the following semantics:
- All rows to be merged are first classified towards the merge when clause that applies to them.
- Once 1. is complete for all rows DB2 will execute Before triggers for the set of rows matching the first when clause.
- Once 2. is complete for all matching rows DB2 will execute the THEN clause for all rows.
That is DB2 will actually INSERT/UPDATE/DELETE that subset of rows
- Once 3 is complete DB2 will apply RI and CHECK constraints towards the modification
- Once 4 is complete AFTER TRIGGERs will be fired on the set of rows
- Repeat steps 2 through 5 for the second to last merge THEN clauses is turn
- MERGE is an ATOMIC operation. Any error will undo the entire statement.
- Since DB2 has obtain update (U) locks on any row touched in the target it is advisable to avoid hash joins in environments where the target table may be modified by other concurrent sessions.
This is because a hash join will end up U-locking the entire table.
Nested loop joins are best here.
you can disable hash joins by lowering the optimization level to three.
- Always place the INSERT operation after DELETE and UPDATE to avoid extra TEMP operators.
This has been perhaps my longest blog entry so far.
My apologies to those who awaited a middle-of-the week posting.
This one was long in the making.