Informix

 View Only
  • 1.  Non-Hashable Data Type

    IBM Champion
    Posted Wed June 08, 2022 08:37 PM
    Informix 12.10.FC14
    Solaris 10  1/13

    I have a table with a smart blob column.  (It is a photo.)  I am interested in determining the total number of photos, and also the number of unique photos.

    To determine the total number of photos, I did the following:

    SELECT * FROM <table> WHERE <photo> IS NOT NULL;

    There are 22 such rows.  (This is a small table for testing purposes.)

    I also did the following query (because this is how I would do it on a large table):

    SELECT COUNT( <photo> ) WHERE <photo> IS NOT NULL;

    Same answer: 22.

    To determine the number of unique photos, I first thought to use ifx_checksum(), figuring I am not absolutely, positively guaranteed that results would be unique, but I am comfortable that "almost certainly" is good enough for present purposes.

    So, I did this:

    SELECT ifx_checksum( <photo>, 0) FROM <table> WHERE <photo> IS NOT NULL;

    The table is small, and I got back the 22 rows for all the existing photos.  By inspection, I observed that there were 22 photos, and 21 distinct photos (as determined by the ifx_checksum hash, anyway).

    But, then, I thought some more, and figured that the LO Handle would likely be close enough for what I want to do, and probably would be less computational work than executing ifx_checksum() for every row in a very large table.

    So, I did these two commands:

    SELECT  <photo> FROM <table> WHERE <photo> IS NOT NULL;
    SELECT DISTINCT <photo> FROM <table> WHERE <photo> IS NOT NULL;  (highlight explained later)

    As I expected, I got 22 rows, and 21 rows from those two queries, respectively.

    Then I figured (for a large table) better to just COUNT the rows-- not necessary to bring them all back.  So, I did this:

    SELECT COUNT( <photo> ) FROM <table> WHERE <photo> IS NOT NULL;
    SELECT COUNT( DISTINCT <photo> ) FROM <table> WHERE <photo> IS NOT NULL;

    I got 22 for the first query, but, instead of 21 for the second query, I got a -9643 Error (BLOB is not hashable).

    Q: Why is <photo> hashable (i.e., "DISTINCT <photo>" works) in the first highlighted query, but not in the second?

    Just for grins, I then explicitly used the ifx_checksum() hash function:

    SELECT COUNT( DISTINCT ifx_checksum( <photo>, 0) ) FROM <table> WHERE <photo> IS NOT NULL;

    Worked like a charm, and returned 21.

    Thank you for any comments about why "DISTINCT" worked in one case, but threw a "non-hashable data type" error in the second example.

    DG

    ------------------------------
    David Grove
    ------------------------------

    #Informix


  • 2.  RE: Non-Hashable Data Type

    IBM Champion
    Posted Thu June 09, 2022 06:40 AM
    Interesting question...

    From the following I'd conclude that already the first query (DISTINCT <photo>) should have received this error:

    select extended_id, name::char(20) name, cannothash from sysxtdtypes where extended_id in (10,11);
    extended_id name cannothash

    10 blob T
    11 clob T

    What I didn't quite get is your comment on LO handles and how they'd be related to uniqueness of those photos?

    BR,
     Andreas

    ------------------------------
    Andreas Legner
    ------------------------------



  • 3.  RE: Non-Hashable Data Type

    IBM Champion
    Posted Thu June 09, 2022 12:56 PM
    Thank you for your comments, Andreas.

    In fact, I, too, actually wondered the exact same thing (why the first command [with "DISTINCT"] runs without error).  I figured either both uses of "DISTINCT" should fail, or both should succeed.  Before I opened a tech support case, I thought I would check in with the Informix cognoscenti to see if I might be ignorant of something that is well-known by other Informic practitioners.  I will now proceed to open a case.

    Regarding LO Handles and uniqueness... I recognize that handles are not unique w.r.t. the actual objects.  This is even "more true" after an import/export, or an operation involving copying an LO.  (I came faced to face with the latter when following IBM's Tech Support Note [or whatever they call them] on how to move LOs from one sbspace to another.  I observed how unique BLOBs [stored once and using multiple instances of the same handle when that BLOB is referenced multiple times] became multiple instances of storage of the same BLOB, each instance with its own handle.)  IOW, I recognize that different handles can reference the same BLOB.

    The way I see it is that there just isn't any true PK (in the Relational Theory sense) for a LO other than the complete object itself.

    In this case, however, the distinct handles do, in fact, map to distinct photos.

    But, I recognize this isn't the case, generally, and will be using ifx_checksum() as a surogate for a PK.  Not theoretically infallible, but I'm willing to use it-- at least until I encounter a problem in practice.

    Again, thank you.

    DG

    ------------------------------
    David Grove
    ------------------------------



  • 4.  RE: Non-Hashable Data Type

    IBM Champion
    Posted Fri June 10, 2022 07:07 PM
    Edited by System Fri January 20, 2023 04:17 PM
    This situation that doesn't make sense to me.  I am referring to being able to SELECT a DISTINCT column that is non-hashable and actually have it execute and return a resultset).  I wonder what is actually being returned.  Is the DISTINCT value returned by selecting a smart blob column some derivative of its handle?   You can't display a handle, per se, as a single value in the resultset, right?   I mean a handle is a data structure (with several component elements).  What I see displayed (for the smart blob column) looks like this (which I am confident we are all familiar with):

    <blob; length = 46582>

    Where, exactly, does that come from?  Is it derived from a component of the handle?  It certainly is not a unique identifier for the handle or blob because there can be multiple different smart blobs with the exact same length.  In fact, is there even such a thing as a unique identifier for the handle of a large object?

    I'm thinking that when I use function 'ifx_checksum()', I am essentially creating a (almost certainly) unique identifier for a smart large object.  Sure, it does not have the data elements that comprise a handle, but, even without that information, it could serve as a surrogate for a unique identifier for the large object, right?  At least to the extent that that hash function is unique.

    Sigh, I am finding it challenging to delve deeply into smart blobs.

    Thank you for any comments.

    Regards,

    DG

    ------------------------------
    David Grove
    ------------------------------



  • 5.  RE: Non-Hashable Data Type

    IBM Champion
    Posted Tue June 14, 2022 08:12 PM
    I still don't know how "<blob; length = 46582>" is represented internally.

    Because I was now dealing with a table that had about 125,000 photos, and the length of all of those photos is limited to 5 figures, that representation cannot possibly be unique with respect to the photos.  But (as I will explain shortly) it is.  So, there is some more detailed representation "behind the scenes".  Probably something to do with the handle.

    In my ignorance, I presumed that the hash function "ifx_checksum()" was likely "more unique" than the <blob; length=46582>" piece of data.

    I did a comparison between number of rows returned by "SELECT DISTINCT <photo column> FROM <table>;" and the value returned by "SELECT COUNT( DISTINCT ifx_checksum(<photo column>, 0)) FROM <table>;"  I figured that the first might produce a count of distinct photos (by counting the number of rows), and the second would almost certainly be unique because of the very, very small likelihood of duplicating a hash value.

    I observed a difference of "1".  The first produced 125,143 rows, and the second produced a COUNT of 125,144.

    I spent much time tracking down the discrepancy.

    Turned out that the first query, which simply depended on the SELECT DISTINCT <photo column> produced one row for each unique photo, and the COUNT from ifx_checksum() function produced a value that was 1 less.  This was because two different photos hashed to the same value.  (The two photos were taken several years apart at locations separated by hundreds of miles, and were of two different races and sexes.)  But, the "plain vanilla" 'DISTINCT <photo column>' correctly preserved the uniqueness of all the images.

    I wonder how that works.  To be 100% certain of uniqueness, (without hashing) wouldn't each of the 125K+ photos have to be compared to every photo?  That would have taken vastly more time than it actually did.  And, If it did hash (even though Informix reports it as non-hashable), well, even then, hash codes are not absolutely, positively, 100% guaranteed in all cases, unique, right?  Whatever the magic, the ifx_checksum() hash function was not as effective as whatever built-in mechanism operated with the 'DISTINCT <photo column>'.

    I had thought it would take millions (or even billions) of photos for 'ifx_checksum()' to have an expectation of producing duplicate values.  But, it only took ~125K in this case.

    So, using DISTINCT in the PROJECT clause (for a value that isn't even hashable according to sysxtdtypes) is "more unique" than using the explicit hash function, 'ifx_checksum()'.

    Still, the use of DISTINCT remains limited because, if I try to form it this way:

    "SELECT COUNT(DISTINCT <photo column>) FROM <table>;"

    it fails with the message that <photo column> is not hashable.

    I have not yet opened a case to try to get an answer because I'm waiting for another case to be resolved first.

    It's just an interesting mystery-- at least to me.

    DG

    ------------------------------
    David Grove
    ------------------------------