Informix

 View Only
  • 1.  index on timestamp & optimizer strategy failure

    Posted Fri December 02, 2022 09:57 AM
    I've got a large events table (~1500 million rows) with 2 timestamps (datetime hour to second) recording a time associated with the event and a time when the row was inserted.
    The event-time may be in the future, but in most of the cases it is near the actual point in time. If the device sending the event was offline for a longer time, the timestamp may be outdated as well. The table rows are kept at least 13 months.
    This columns are indexed and often used to extract a detailed view on a timeslice and often joined with other tables.
    Its only possible to run auto-update-statistics on weekends.

    onstat -hd for the event-time shows a surprising distribution.
    Constructed on 2022-11-27 03:36:07.00000
    High Mode, 0.500000 Resolution
    --- DISTRIBUTION ---
    ( 1999-12-31 20:00:00)
    1: ( 7146507, 1555073, 2021-05-25 09:08:49)
    2: ( 7146507, 1358555, 2021-06-18 16:47:08)
    3: ( 7146507, 1437036, 2021-07-17 00:10:26)
    4: ( 7146507, 1222787, 2021-08-09 21:26:24)
    ...
    198: ( 7146507, 102980, 2022-11-23 14:52:20)
    199: ( 7146507, 104399, 2022-11-24 19:52:20)
    200: ( 7131543, 242304, 9309-07-14 20:18:07)
    Timestamps should not be in that far future. And there are very few rows of them.

    onstat -hd for the creation time:
    Constructed on 2022-11-27 03:21:23.00000
    High Mode, 0.500000 Resolution
    --- DISTRIBUTION ---
    ( 2003-12-29 21:21:49)
    1: ( 7146419, 831315, 2021-05-25 11:39:31)
    2: ( 7146419, 613449, 2021-06-18 19:05:34)
    ...
    199: ( 7146419, 69627, 2022-11-24 21:18:12)
    200: ( 7146419, 96297, 2022-11-27 01:34:35)
    201: ( 2452, 226, 2022-11-27 03:20:07)

    So users often issue queries with a condition selecting a timeslice where the distribution information on the table will suggest only few rows and high selectivity (index on creation time).That's good so far although it's wrong, because meanwhile tons of rows populate the index within the given timeframe and there is often a better strategy because for example the latest history of a specific item is in quest. (the item has for example 50 event rows), the optimizer thinks there are only a few rows within the timeslice and is therefore using the creation-time index instead of the item-id as primary access path.

    On the other hand a event-timeframe is inspected independend for a lot of items for special behaviour. In this case it is mostly a wise decision to use that index. Unfortunately these queries vary in behaviour as well (depending on optimizer decision). I think (without proof) that a false date in far future might rise that behaviour.

    So my questions: would I earn more reliable optimizer strategies if I set resolution to 0.01 ? Would you suggest another value?
    Is it worth the effort?
    For shure it shold be avoided to  have extraordinary timestamps - meanwhile there is some sort of limitation within the application, but I'm not automatically getting rid of those values.  But on the other hand are'nt that values responsible that the optimzer accepts the index for today?
    Should I have at least one future row in each index to influence the last bucket to span over today?

    The thing I want archieve is to have as less as possible optimizer directives. These directives are often too demanding for the users and I think the optimizer is the perfect instrument for choosing the query-strategy. 

    Sorry for that excessive question...

    ------------------------------
    Peter Weigert
    ------------------------------

    #Informix


  • 2.  RE: index on timestamp & optimizer strategy failure

    IBM Champion
    Posted Mon December 05, 2022 01:58 AM
    Hello Peter,

    did you test your SQL without the high or medium statistic? You will have a good distribution of the values and high statistics with a very high resolution is like a simulation of the index with increased effort.

    Best Regards
    Henri

    ------------------------------
    Henri Cujass
    leolo IT, CTO
    Germany
    IBM Champion 2021 and 2022
    ------------------------------



  • 3.  RE: index on timestamp & optimizer strategy failure

    Posted Mon December 05, 2022 08:40 AM
    Hello Henri,

    I did not test against other statistics - that's a productive engine. And - thanks to directives - I can handle most of the cases where the optimizer fails. I'm seeking for a general best practice. Ithink it's a general problem that big tables having an indexed timestamp can't update statistics on dayly base - and if they can, the problem persists for investigation of the latest hours.

    And I'm seeking for a reason why the optimizer sometimes changes strategy when I do not expect this could happen on a table with a extensive history. It's a massive effort to test the optimizer behaviour against tons of scenarios - and once you think you got the clue how it works, a new release might trash all theories. 
    And I'm trying to ease my life to avoid getting asked why some query is so slow, when it was quick yesterday. 

    Best rgds Peter

    ------------------------------
    Peter Weigert
    ------------------------------



  • 4.  RE: index on timestamp & optimizer strategy failure

    IBM Champion
    Posted Mon December 05, 2022 11:52 AM
    Peter:

    The optimizer uses the distributions to guess whether to use an index or scan the table and which index to use. If the distributions are stale, as happens with say an event-time or creation-time column that is only increasing over time the right-most branch of the index on such a column will always have more rows that meet a "> some-time" filter on them and in many "find the new rows from today" queries the optimizer feels that it won't find any rows that match on that criteria. There could be millions of such, but it just doesn't now. The alternative to using the time-based index is to perform a table scan so, if there are other filters supported by indexes, then the engine will choose a potentially less discriminating index because it looks like it will be a better choice. If there are none, or the other filters are not indexed, then it may just elect to scan the table or drive a query off a different table that will create less discriminating join producing millions of possible matches that need to be filtered out.

    As you have surmised, if you do not want to try to change the update statistics as recommended by me and others, then optimizer directives to force the use of the time-based index is the only choice.

    Note that you could copy the table to a test database and play with the update statistics there before applying the best method to production.

    Art

    ------------------------------
    Art S. Kagel, President and Principal Consultant
    ASK Database Management Corp.
    www.askdbmgt.com
    ------------------------------



  • 5.  RE: index on timestamp & optimizer strategy failure

    IBM Champion
    Posted Mon December 05, 2022 08:22 AM
    Peter:

    I do not think that a finer resolution (ie more buckets) will improve things at all. I have a few suggestions, the most important of which is that you really need to run update statistics on this table more frequently, perhaps more than once a day. Given the size of the table, you cannot do that using HIGH. I would suggest doing MEDIUM with a finer resolution than the default for MEDIUM (2.5) but not as fine as HIGH uses (0.5), perhaps 1.0, and using the SAMPLING SIZE clause to force the engine to sample between 30-50% of the rows (without a SAMPLING SIZE clause the maximum number of rows that MEDIUM samples is very small. If you have AUTO_STAT_MODE set to '1' you will need to use the FORCE option to force the update statistics command to not be no-op.

    Another option would be to continue doing what you are doing now, but use a courser resolution for those two columns continuing to use HIGH on the weekends. The smaller number of buckets may let the optimizer take account of the new values added during the week better. AUS will maintain whatever level of distributions it finds on the table already when the evaluator kicks off, so if you manually change the resolutions once by running update statistics yourself, AUS will maintain that. 

    Last resort, try using my dostats utility instead of AUS. It just may do a better job creating your distributions for you.  B^)

    Art

    ------------------------------
    Art S. Kagel, President and Principal Consultant
    ASK Database Management Corp.
    www.askdbmgt.com
    ------------------------------



  • 6.  RE: index on timestamp & optimizer strategy failure

    Posted Tue December 06, 2022 04:34 AM
    Hi Peter,

    Just to checking I understand the problem correctly: the optimiser is using the index on creation time because the distribution suggests there is hardly any data instead of a more favourable index.

    If so, there is an undocumented fix for this you could try. For the fix to work the creation time column must have a default value of CURRENT (or TODAY).

    Setting SQL_DEF_CTRL in onconfig to 0x2 (or any existing value with 0x2 added) will enable a feature that will extrapolate the distribution into the future by 30 days. There's a section on it in my blog, https://informixdba.wordpress.com/2015/04/30/experience-with-auto-update-statistics-aus/, where it discusses IC91678.

    I had hoped this would become the default optimiser behaviour but it seems it hasn't.

    Ben.

    ------------------------------
    Benjamin Thompson
    ------------------------------



  • 7.  RE: index on timestamp & optimizer strategy failure

    Posted Tue December 06, 2022 05:06 AM
    Hi Ben,

    that's exactly one of the problematic scenarios. And I'm with you, that should be the default behaviour.
    And it should be documented!
    Answers just like yours was exactly what I hoped for :-)

    Of course as soon as I'll get an engine that can hold that much data for testing, I will try Art's suggestions as well.
    I only wanted to avoid testing (like a shot in the dark) without any expert suggestions.

    thx a lot
    Peter

    ------------------------------
    Peter Weigert
    ------------------------------