Db2

Connect with Db2, Informix, Netezza, open source, and other data experts to gain value from your data, share insights, and solve problems.

View Only

Back to Library

Who's the most popular kid on the block?

Mon March 09, 2020 05:27 PM

Michelle Betbadal

Posted by: Serge Rielau
Background

Roberto from Italy sent me the following question:
Is there something like the STATS_MODE() function in DB2?
He says STATS_MODE() is supposed to return the most frequently occurring value from a multi-set of values.

A bit of rifling through the internet, steering clear of zones forbidden to me, reveals that STATS_MODE is an aggregate function similar to SUM() or AVG().
The function returns the most frequent value within a group.

If there are two equally frequent values one of them will be picked with no specific rule.

DB2 does not have a function by that name (It's a non intuitive name to me I must say).
Neither does DB2 have a function under another name that does the same thing.

But how hard can it be to solve the same problem using common SQL structures? Let's take a look.

Scenario

I couldn't resist the subject line, so my scenario will, for once, not use employees.

Imagine a bunch of kids and suburbs.
We are social workers (or social marketing experts staring at Facebook...) trying to find the key influencers in each neighborhood.

CREATE TABLE kids(name VARCHAR(20), suburb VARCHAR(20), likes VARCHAR(20));
INSERT INTO kids VALUES
('Anton', 'Scarborough', 'Mona'),
('Brenda', 'Cabbagetown','Charley'),
('Charley', 'Danforth', 'Norbert'),
('Dora', 'Uptown', 'Olga'),
('Edwin', 'Scarborough', 'Laura'),
('Fleur', 'Regent Park', 'Kyle'),
('George', 'Scarborough', 'Anton'),
('Hong', 'Cabbagetown', 'Mona'),
('Ida', 'Danforth', 'Norbert'),
('Jay', 'Uptown', 'Olga'),
('Kyle', 'Regent Park', 'Quentin'),
('Laura', 'Scarborough', 'Piotr'),
('Mona', 'Cabbagetown', 'Stacy'),
('Norbert', 'Danforth', 'Charley'),
('Olga', 'Uptown', 'Vera'),
('Piotr', 'Scarborough', 'Anton'),
('Quentin', 'Regent Park', 'Olga'),
('Rita', 'Scarborough', 'Anton'),
('Stacy', 'Cabbagetown', 'Mona'),
('Ulf', 'Danforth', 'Charley'),
('Vera', 'Uptown', 'Olga'),
('Wendy', 'Regent Park', 'Quentin'),
('Xavier', 'Scarborough', 'Mona'),
('Yvette', 'Cabbagetown', 'Stacy'),
('Zoe', 'Danforth', 'Charley');

Step by Step

To find the most frequently occurring names under "LIKES" per "SUBURB" we need to first count each occurrence:

SELECT suburb, likes, count(*)
  FROM kids
  GROUP BY suburb, likes;

SUBURB               LIKES                          3
-------------------- -------------------- -----------
Scarborough          Anton                          3
Cabbagetown          Charley                        1
Danforth             Charley                        3
Regent Park          Kyle                           1
Scarborough          Laura                          1
Cabbagetown          Mona                           2
Scarborough          Mona                           2
Danforth             Norbert                        2
Regent Park          Olga                           1
Uptown               Olga                           3
Scarborough          Piotr                          1
Regent Park          Quentin                        2
Cabbagetown          Stacy                          2
Uptown               Vera                           1
14 rows were retrieved.

Interesting...Anyone who ever thought GROUP BY implies ORDER BY.. think again...anyway, lets sort by suburb and then the count:

SELECT suburb, likes, count(*) as cnt 
  FROM kids GROUP BY suburb, likes 
  ORDER BY suburb, cnt desc;

SUBURB               LIKES                        CNT
-------------------- -------------------- -----------
Cabbagetown          Mona                           2
Cabbagetown          Stacy                          2
Cabbagetown          Charley                        1
Danforth             Charley                        3
Danforth             Norbert                        2
Regent Park          Quentin                        2
Regent Park          Kyle                           1
Regent Park          Olga                           1
Scarborough          Anton                          3
Scarborough          Mona                           2
Scarborough          Laura                          1
Scarborough          Piotr                          1
Uptown               Olga                           3
Uptown               Vera                           1

14 rows were retrieved.

Things are clearing up a bit.
Looking at the result we know we want: Mona, Charley, Quentin, Anton and Olga.
How can we filter them out?
We need to get the first row of this result set for each suburb.
For such tasks I always like the ROW_NUMBER() OVER() function.

SELECT suburb, likes, cnt, ROW_NUMBER() OVER(PARTITION BY suburb ORDER BY cnt DESC) AS rn
  FROM (SELECT suburb, likes, count(*) as cnt 
          FROM kids 
          GROUP BY suburb, likes);

SUBURB               LIKES                        CNT                   RN
-------------------- -------------------- ----------- --------------------
Cabbagetown          Mona                           2                    1
Cabbagetown          Stacy                          2                    2
Cabbagetown          Charley                        1                    3
Danforth             Charley                        3                    1
Danforth             Norbert                        2                    2
Regent Park          Quentin                        2                    1
Regent Park          Kyle                           1                    2
Regent Park          Olga                           1                    3
Scarborough          Anton                          3                    1
Scarborough          Mona                           2                    2
Scarborough          Laura                          1                    3
Scarborough          Piotr                          1                    4
Uptown               Olga                           3                    1
Uptown               Vera                           1                    2

14 rows were retrieved.

This is pretty close. Now we just need to filter out all but the first rows using "RN" in a predicate.

SELECT suburb, likes 
  FROM (SELECT suburb, likes, ROW_NUMBER() OVER(PARTITION BY suburb ORDER BY cnt DESC) AS rn
          FROM (SELECT suburb, likes, count(*) as cnt 
                  FROM kids 
                  GROUP BY suburb, likes))
  WHERE rn = 1;

SUBURB               LIKES               
-------------------- --------------------
Cabbagetown          Mona                
Danforth             Charley             
Regent Park          Quentin             
Scarborough          Anton               
Uptown               Olga

Tadah!
Granted this is longer than:

SELECT suburb, STATS_MODE(likes) 
  FROM kids 
  GROUP BY suburb;

So STATS_MODE() seems like a handy function if you do this sort of thing a lot.
Now, is the solution we have above a workaround, an emulation perhaps which is slower than the real deal would be?
For that we should look at the explain and see what DB2 does with this query

#Db2

Statistics

0 Favorited

3 Views

0 Files

0 Shares

0 Downloads

Db2

Db2

Who's the most popular kid on the block?

Posted by: Serge Rielau
Background

Scenario

Step by Step

Additional
Resources

Office

Quick Links

Db2

Db2

Who's the most popular kid on the block?

Posted by: Serge RielauBackground

Scenario

Step by Step

Additional Resources

Office

Quick Links

Posted by: Serge Rielau
Background

Additional
Resources