Db2

 View Only
  • 1.  distinct vs group by

    Posted Thu December 05, 2024 01:37 PM

    Hi,

    we are moving many workflow from DB2 iSeries to DB2 LUW (with BLU) and we have some queries that dont finish on DB2 LUW

    We have noticed that DB2 BLU some time is not able to execute complex queries with many subqueries that include DISTINCT, query take hours without finishing or fill all the temps TBS. 

    changing the DISTINCT by a GROUP BY solve the probleme.

    is distinct and group by produce differents execution plan ?

    are there any special setup (db2set) so that DB2 execute distinct like a group by.



    ------------------------------
    malek shabou
    ------------------------------


  • 2.  RE: distinct vs group by

    Posted Thu December 05, 2024 04:27 PM

    When queries with multiple subqueries and DISTINCT clauses cause performance issues in Db2 for z/OS, it often points to inefficiencies in how the query is being executed or how the underlying database is structured. Here's an approach to diagnose and address the problem:

    1. Understand the Query Plan

    • Use EXPLAIN to generate the access path for the query. It shows how Db2 plans to retrieve the data.
    • Check for:
      • Excessive table scans.
      • Nested loop joins with large datasets.
      • High-cost operations like sorting due to DISTINCT.

    2. Optimize Subqueries

    • Review subqueries to ensure they aren't repeating unnecessary calculations.
    • Where possible, replace correlated subqueries with joins, as joins are often more efficient.

    3. Index Optimization

    • Ensure appropriate indexes exist for columns involved in DISTINCT, WHERE, and JOIN clauses.
    • Use clustered indexes if sorting is frequently needed.

    4. Reduce Intermediate Results

    • Use common table expressions (CTEs) or temporary tables to break down the query into smaller, manageable parts.
    • Materialize intermediate results that can be reused to avoid recomputation.

    5. Memory and Temp Tablespace Considerations

    • Ensure your temporary tablespaces are adequately sized for the workload. Review TEMP tablespace usage and resize if necessary.
    • Adjust sort and bufferpool settings to optimize memory usage for large queries.

    6. Refactor the Query

    • Sometimes, rewriting the query can significantly improve performance. Simplify where possible:
      • Minimize the use of DISTINCT if it's redundant due to primary keys or unique indexes.
      • Aggregate early (e.g., use GROUP BY instead of DISTINCT in subqueries).

    7. Parallelism and Workload Management

    • Db2 can process queries in parallel. Ensure the system is configured to take advantage of this feature if the workload justifies it.
    • Use workload management (WLM) to allocate sufficient resources to your query without impacting other processes.

    Example Optimization Scenario

    Imagine a query:

    SELECT DISTINCT a.column1, b.column2
    FROM table_a a
    JOIN table_b b ON a.key = b.key
    WHERE a.condition = 'X'
      AND b.condition = 'Y';
    

    Optimizations:

    • Ensure indexes exist on table_a.key, table_b.key, a.condition, and b.condition.
    • Use EXPLAIN to verify Db2 is using indexes.
    • Break the query:
      1. Extract distinct values into a temp table.
      2. Join on the reduced dataset.

    Tools and References

    Would you like specific examples or assistance with a particular part of the query?



    ------------------------------
    Raydo Matthee
    CEO | IBM Certified Facilitator
    Skunkworks (Pty) Ltd
    Johannesburg
    raydo@skunkworks.africa
    ------------------------------



  • 3.  RE: distinct vs group by

    Posted Thu December 05, 2024 05:53 PM

    Hi Raydo,

    Thank you for your detailed response, here some more informations:

    • we are in a dataware use case
    • query are generated by our ETL (OTIC, ex-GENIO )
    • all tables are column organized
    • there is no index
    • there is only HSJOIN in the execution plan
    • the CTQ operator is always on the top of the plan (there is no function or expression that force the exit from the CDE engine

    that's the objectif of my question, i want to know how the CDE engine of DB2 deal with DISTINCT / GROUP BY, what are the difference, may be there is some early filtring with GROUP BY, ....

    PS: we have huge queries with some time 5-10 nested datasourcesn INNER JOIN and LEFT JOIN between subqueries

    Malek,



    ------------------------------
    malek shabou
    ------------------------------



  • 4.  RE: distinct vs group by

    Posted Fri December 06, 2024 03:20 AM

    Hi Malek,

    in general Raydo's suggestions are correct, even if he assumed Db2 for z/OS instead of your environment with Db2 LUW and BLU.

    Mentioning "CTQ is always on top of the plan" sounds like a good sign to me. At least, you know how to create a plan and look at it. However, there should be no other CTQ nodes in the graph below. In some rare cases, Db2 need to transfer data from columnar to row type and back again causing performance losses.

    Regarding "there is no index", I can recommend following two pages Db2 docs on BLU restrictions, SAP blog article about secondary indexes. Secondary indexes are available and can save time even in BLU situations.

    I can also suggest to open a case with IBM to send in your query and get some assitance from IBM.

    And BTW, sometimes, some queries are really huge and can consume a lot of time. 

    Cheers

    Roland



    ------------------------------
    Roland Schock
    IBM Champion and IBM Gold Consultant
    ------------------------------



  • 5.  RE: distinct vs group by

    Posted Fri December 06, 2024 01:26 PM

    Your issue is related to resources.  You mentioned you moved from the "I"  System( mid range computer) to DB2 luw.

    Your DB2 Luw system will need to be beef up with more CPU, Memory, etc to expect the same performance as a I system.    You may even need to use DPF.   

    Also double check your predicates for the query.   Some that may have been Stage 1 on the I System may now be Stage 2 on Luw.   

    Also DISTINCT with Order by on Luw may force a double sort where it did not on I.  



    ------------------------------
    Douglas Partch
    ------------------------------



  • 6.  RE: distinct vs group by

    Posted Sat December 07, 2024 10:44 AM

    Hi,

    there is no resources issue, there server have 64 cores and 512GB of memory. i dont know why DB2 behave like that. some query with DISTINCT in their sub query does not finish, replacing DISTINCT by GROUP BY the query take a couple of secondes.

    an other pain point with CDE engine is that it some Table join with multiple predicates if cardinality estimation is not good, DB2 can swap the join order, and the result is temp full or query never finish :(

    it is always hard to find which join is causing this issue, i need to use column group statistics so that the query can run. 

    it can be good idea if DB2 can log a warning during execution of a query if the cardinality if very far from the estimation, it will make make life easier for DBA :)

    i never had theses issue with other RDBMS (terradata/snowflake/redshift) even with bad plan query finish (slow but they finish) with DB2 uses are very happy (most of time) because BLU is very fast but when it sucks it realy sucks 

    Best regards,



    ------------------------------
    malek shabou
    ------------------------------



  • 7.  RE: distinct vs group by

    Posted Sat December 07, 2024 07:03 PM

    Malek,

    Logically, nearly every DISTINCT could be replaced internally by a GROUP BY on all columns of the result set in most cases.  However DB2 does not do that for us.  It can be that a subquery contains OLAP specs, as example, that may cause GROUP BY to be executed before OLAP, but DISTINCT afterwards on the results of the OLAP, in this case making DISTINCT different to GROUP BY.  Also it can be that the SQL is coded to specify far more columns than necessary to uniquely identify rows.

    Uniqueness actually has less requirements than GROUP BY because it does not need the result data to be in any particular sequence.

    It would be quite hard to be very definitive about particular SQLs as to why GROUP BY performs better without looking at the specific SQL and access path.

    I have noticed that Column Store Databases can have nasty access paths for some things better suited to Row store databases, like subqueries with FETCH FIRST 1 ROWS ONLY as an example.  Things like this may find an optimization for GROUP BY that is not found for DISTINCT.  This may not be issue in your queries, however sometimes queries need to be designed to suit the platform being run on.

    I would comment that DISTINCT is a heavily over used "technique" in SQLs often not really needed, but used in subqueries or Common Table Expressions to force materialise an intermediate result set (applicable to row store databases more so)

    On a column store database, normal use of DISTINCT ought to perform O.K. just doing a sort to workfile, perhaps degraded if far too many columns are included in the  DISTINCT. 

    Having a really bad performing DISTINCT, one needs to analyse the access path to determine why it is so bad, and if there is a reason it might need to behave differently to a GROUP BY in that case.



    ------------------------------
    Michael Hannan
    ------------------------------