I've got a large events table (~1500 million rows) with 2 timestamps (datetime hour to second) recording a time associated with the event and a time when the row was inserted.
The event-time may be in the future, but in most of the cases it is near the actual point in time. If the device sending the event was offline for a longer time, the timestamp may be outdated as well. The table rows are kept at least 13 months.
This columns are indexed and often used to extract a detailed view on a timeslice and often joined with other tables.
Its only possible to run auto-update-statistics on weekends.
onstat -hd for the event-time shows a surprising distribution.
Constructed on 2022-11-27 03:36:07.00000
High Mode, 0.500000 Resolution
--- DISTRIBUTION ---
( 1999-12-31 20:00:00)
1: ( 7146507, 1555073, 2021-05-25 09:08:49)
2: ( 7146507, 1358555, 2021-06-18 16:47:08)
3: ( 7146507, 1437036, 2021-07-17 00:10:26)
4: ( 7146507, 1222787, 2021-08-09 21:26:24)
...
198: ( 7146507, 102980, 2022-11-23 14:52:20)
199: ( 7146507, 104399, 2022-11-24 19:52:20)
200: ( 7131543, 242304, 9309-07-14 20:18:07)
Timestamps should not be in that far future. And there are very few rows of them.
onstat -hd for the creation time:
Constructed on 2022-11-27 03:21:23.00000
High Mode, 0.500000 Resolution
--- DISTRIBUTION ---
( 2003-12-29 21:21:49)
1: ( 7146419, 831315, 2021-05-25 11:39:31)
2: ( 7146419, 613449, 2021-06-18 19:05:34)
...
199: ( 7146419, 69627, 2022-11-24 21:18:12)
200: ( 7146419, 96297, 2022-11-27 01:34:35)
201: ( 2452, 226, 2022-11-27 03:20:07)
So users often issue queries with a condition selecting a timeslice where the distribution information on the table will suggest only few rows and high selectivity (index on creation time).That's good so far although it's wrong, because meanwhile tons of rows populate the index within the given timeframe and there is often a better strategy because for example the latest history of a specific item is in quest. (the item has for example 50 event rows), the optimizer thinks there are only a few rows within the timeslice and is therefore using the creation-time index instead of the item-id as primary access path.
On the other hand a event-timeframe is inspected independend for a lot of items for special behaviour. In this case it is mostly a wise decision to use that index. Unfortunately these queries vary in behaviour as well (depending on optimizer decision). I think (without proof) that a false date in far future might rise that behaviour.
So my questions: would I earn more reliable optimizer strategies if I set resolution to 0.01 ? Would you suggest another value?
Is it worth the effort?
For shure it shold be avoided to have extraordinary timestamps - meanwhile there is some sort of limitation within the application, but I'm not automatically getting rid of those values. But on the other hand are'nt that values responsible that the optimzer accepts the index for today?
Should I have at least one future row in each index to influence the last bucket to span over today?
The thing I want archieve is to have as less as possible optimizer directives. These directives are often too demanding for the users and I think the optimizer is the perfect instrument for choosing the query-strategy.
Sorry for that excessive question...
------------------------------
Peter Weigert
------------------------------
#Informix