Posted by: Serge Rielau
Background
Roberto from Italy sent me the following question:
Is there something like the STATS_MODE() function in DB2?
He says STATS_MODE() is supposed to return the most frequently occurring value from a multi-set of values.
A bit of rifling through the internet, steering clear of zones forbidden to me, reveals that STATS_MODE is an aggregate function similar to SUM() or AVG().
The function returns the most frequent value within a group.
If there are two equally frequent values one of them will be picked with no specific rule.
DB2 does not have a function by that name (It's a non intuitive name to me I must say).
Neither does DB2 have a function under another name that does the same thing.
But how hard can it be to solve the same problem using common SQL structures? Let's take a look.
Scenario
I couldn't resist the subject line, so my scenario will, for once, not use employees.
Imagine a bunch of kids and suburbs.
We are social workers (or social marketing experts staring at Facebook...) trying to find the key influencers in each neighborhood.
CREATE TABLE kids(name VARCHAR(20), suburb VARCHAR(20), likes VARCHAR(20));
INSERT INTO kids VALUES
('Anton', 'Scarborough', 'Mona'),
('Brenda', 'Cabbagetown','Charley'),
('Charley', 'Danforth', 'Norbert'),
('Dora', 'Uptown', 'Olga'),
('Edwin', 'Scarborough', 'Laura'),
('Fleur', 'Regent Park', 'Kyle'),
('George', 'Scarborough', 'Anton'),
('Hong', 'Cabbagetown', 'Mona'),
('Ida', 'Danforth', 'Norbert'),
('Jay', 'Uptown', 'Olga'),
('Kyle', 'Regent Park', 'Quentin'),
('Laura', 'Scarborough', 'Piotr'),
('Mona', 'Cabbagetown', 'Stacy'),
('Norbert', 'Danforth', 'Charley'),
('Olga', 'Uptown', 'Vera'),
('Piotr', 'Scarborough', 'Anton'),
('Quentin', 'Regent Park', 'Olga'),
('Rita', 'Scarborough', 'Anton'),
('Stacy', 'Cabbagetown', 'Mona'),
('Ulf', 'Danforth', 'Charley'),
('Vera', 'Uptown', 'Olga'),
('Wendy', 'Regent Park', 'Quentin'),
('Xavier', 'Scarborough', 'Mona'),
('Yvette', 'Cabbagetown', 'Stacy'),
('Zoe', 'Danforth', 'Charley');
Step by Step
To find the most frequently occurring names under "LIKES" per "SUBURB" we need to first count each occurrence:
SELECT suburb, likes, count(*)
FROM kids
GROUP BY suburb, likes;
SUBURB LIKES 3
-------------------- -------------------- -----------
Scarborough Anton 3
Cabbagetown Charley 1
Danforth Charley 3
Regent Park Kyle 1
Scarborough Laura 1
Cabbagetown Mona 2
Scarborough Mona 2
Danforth Norbert 2
Regent Park Olga 1
Uptown Olga 3
Scarborough Piotr 1
Regent Park Quentin 2
Cabbagetown Stacy 2
Uptown Vera 1
14 rows were retrieved.
Interesting...Anyone who ever thought GROUP BY implies ORDER BY.. think again...anyway, lets sort by suburb and then the count:
SELECT suburb, likes, count(*) as cnt
FROM kids GROUP BY suburb, likes
ORDER BY suburb, cnt desc;
SUBURB LIKES CNT
-------------------- -------------------- -----------
Cabbagetown Mona 2
Cabbagetown Stacy 2
Cabbagetown Charley 1
Danforth Charley 3
Danforth Norbert 2
Regent Park Quentin 2
Regent Park Kyle 1
Regent Park Olga 1
Scarborough Anton 3
Scarborough Mona 2
Scarborough Laura 1
Scarborough Piotr 1
Uptown Olga 3
Uptown Vera 1
14 rows were retrieved.
Things are clearing up a bit.
Looking at the result we know we want: Mona, Charley, Quentin, Anton and Olga.
How can we filter them out?
We need to get the first row of this result set for each suburb.
For such tasks I always like the ROW_NUMBER() OVER() function.
SELECT suburb, likes, cnt, ROW_NUMBER() OVER(PARTITION BY suburb ORDER BY cnt DESC) AS rn
FROM (SELECT suburb, likes, count(*) as cnt
FROM kids
GROUP BY suburb, likes);
SUBURB LIKES CNT RN
-------------------- -------------------- ----------- --------------------
Cabbagetown Mona 2 1
Cabbagetown Stacy 2 2
Cabbagetown Charley 1 3
Danforth Charley 3 1
Danforth Norbert 2 2
Regent Park Quentin 2 1
Regent Park Kyle 1 2
Regent Park Olga 1 3
Scarborough Anton 3 1
Scarborough Mona 2 2
Scarborough Laura 1 3
Scarborough Piotr 1 4
Uptown Olga 3 1
Uptown Vera 1 2
14 rows were retrieved.
This is pretty close. Now we just need to filter out all but the first rows using "RN" in a predicate.
SELECT suburb, likes
FROM (SELECT suburb, likes, ROW_NUMBER() OVER(PARTITION BY suburb ORDER BY cnt DESC) AS rn
FROM (SELECT suburb, likes, count(*) as cnt
FROM kids
GROUP BY suburb, likes))
WHERE rn = 1;
SUBURB LIKES
-------------------- --------------------
Cabbagetown Mona
Danforth Charley
Regent Park Quentin
Scarborough Anton
Uptown Olga
Tadah!
Granted this is longer than:
SELECT suburb, STATS_MODE(likes)
FROM kids
GROUP BY suburb;
So STATS_MODE() seems like a handy function if you do this sort of thing a lot.
Now, is the solution we have above a workaround, an emulation perhaps which is slower than the real deal would be?
For that we should look at the explain and see what DB2 does with this query
#Db2