future support of UTF-8 character handling

7. RE: future support of UTF-8 character handling

Like

Roland Wintgen

Posted Mon April 28, 2025 09:06 AM

Sebastien,

I'm already familiar with the meaning of SQL_LOGICAL_CHAR, but I think this was more of a workaround to get around the real problem, which is the handling of multibyte characters. I can imagine that the support of multi-byte characters is not trivial and the implementation should also be properly done. I completely agree with you.

So maybe it would be better to introduce an environment variable SQL_CHAR_SEMANTICS (according to https://4js.com/online_documentation/fjs-fgl-manual-html/#fgl-topics/c_fgl_localization_022.html) instead of an optional parameter BYTE or CHAR for a CHAR data type, which is used when creating a new database to specify how strings should be handled. This would at least prevent there being several columns within a table that would differ in the handling of single or multibyte characters.

Kind regards

------------------------------
Roland Wintgen
DBA, Genero/4GL Developer
EVG Martens GmbH & Co. KG
Moenchengladbach
------------------------------

Original Message

Original Message:
Sent: Mon April 28, 2025 08:01 AM
From: Sebastien FLAESCH
Subject: future support of UTF-8 character handling

Hi Roland,

I am maybe missing some details but from my knowledge, SQL_LOGICAL_CHAR is basically a ratio that applies to define the size of CHAR/VARCHAR columns when creating an SQL table.

So when you have SQL_LOGICAL_CHAR=3, it's as if you would do it by yourself, by converting

CREATE TABLE tab ( ... column_xyz VARCHAR(50) ... )

to:

CREATE TABLE tab ( ... column_xyz VARCHAR(150) ... ) -- 50 x 3 = 150 bytes

But all length semantics are still based on byte units...

So SQL_LOGICAL_CHAR is rather just a helper to get enough bytes to store your UTF-8 strings.

Implementing complete support for char length semantics is quite some work!!!

I prefer that Informix team takes the time to do that properly...

About mixing length semantics and character sets, imagine you have a single table with 2 columns, using different length semantics and charsets.

If you fetch your data into a program variable, the language you use must also support the distinction of length semantics and charset AT THE VARIABLE level... ready for that?

BTW, see also:

https://4js.com/online_documentation/fjs-fgl-manual-html/#fgl-topics/c_fgl_odiagifx_003.html

Seb

------------------------------
Sebastien FLAESCH

Original Message:
Sent: Mon April 28, 2025 07:36 AM
From: Roland Wintgen
Subject: future support of UTF-8 character handling

Sebastien,

thank you for you thoughts.

I already upvoted the RFE, but I doubt that it will be implemented again at some point, even if the status is "future consideration".

The best solution would of course be if Informix would adapt the CHAR data type, so that it would generally only support multibyte characters and you would therefore no longer have to worry about how many bytes a character consists of.

I have no idea why this was not done and instead the workaround via SQL_LOGICAL_CHAR was taken. Maybe they wanted to offload the problem onto the database developer instead of solving the problem right away?

Perhaps I didn't understand your comment correctly. You say that introducing a new data type for multibyte characters would not be a good idea. On the other hand, you don't want to mix single and multibyte characters because that would lead to problems in application development. But wouldn't this mixing be the case if it would depend on whether CHAR columns could be defined as either a single or multibyte character, as described in my proposal?

Do you think there is a better solution?

Kind regards

------------------------------
Roland Wintgen
DBA, Genero/4GL Developer
EVG Martens GmbH & Co. KG
Moenchengladbach

Original Message:
Sent: Mon April 28, 2025 06:20 AM
From: Sebastien FLAESCH
Subject: future support of UTF-8 character handling

Hello Roland,

+1

I have already suggested this to IBM/HCL a while ago, and someone reported this idea a decade ago:

https://ibm-data-and-ai.ideas.ibm.com/ideas/INFX-I-190

https://ideas.ibm.com/search?query=INFX-I-190

The concept you are looking for is called "byte/char length semantics", to define in which unit a length is expressed, either in a CHAR/VARCHAR size, in subscripts ( col[5,8] ) or functions such as LENGTH(expr)

With other SQL engines, you can define the byte or char(acter) length semantics at various levels (server, database, session, column) - as you discovered with O..

When defined at a global level (session, database), it applies when you create a table with a VARCHAR(50) without specifying any BYTE/CHAR clause.

Some SQL engines ONLY support char length semantics.

To me, introducing new character data types is not a good idea:
You don't want to mix byte-length and char-length semantics: It's a nightmare with application code.
Migrating from a Single Byte Character Set like ISO-8859-15 to UTF-8 should be done for the whole database (all CHAR/VARCHAR columns).
Migration must include application code to handle UTF-8 and the concept of byte/char length semantics.
C-based programs use C lib functions to manage UTF-8 character strings, based on LANG/LC_ALL/locale settings.

If you want more details you have see how to support this with Genero BDL:
https://4js.com/online_documentation/fjs-fgl-manual-html/#fgl-topics/c_fgl_localization_038.html

Seb

------------------------------
Sebastien FLAESCH

Original Message:
Sent: Fri April 25, 2025 03:39 AM
From: Roland Wintgen
Subject: future support of UTF-8 character handling

Hello Informix community,

the fact that the current implementation of UTF-8 in the Informix database is not optimal and difficult to handle is probably known to everyone who has ever dealt with this topic.

Before I submit an RFE, I would like to know the opinions of other members if they would support my proposal or perhaps have a better idea on how to improve UTF-8 strings in future Informix releases.

The current implementation for supporting UTF-8 character strings requires the use of the SQL_LOGICAL_CHAR environment variable, which is, however, quite inflexible (see also RFE https://ibm-data-and-ai.ideas.ibm.com/ideas/INFX-I-244). If you set the value to 2 and over time discover that a single character in another character set requires 3 or even 4 bytes, this can in the worst case lead to values being truncated when saving to a table column (the fact that neither a warning nor an SQL error is generated in these cases is a different problem, but does not make things any easier; see also the RFE https://ibm-data-and-ai.ideas.ibm.com/ideas/INFX-I-8). This would have a particularly tragic effect when using ENCRYPT_AES(). In this case, the encrypted values may not be recoverable and are therefore lost.

The topic was also discussed at the last Informix Technology Days in Munich and JC Lengyel then suggested that perhaps new data types for characters or character strings should be introduced in future that would be especially suitable for processing UTF-8 characters. If I have understood correctly, IBM Db2 uses the GRAPHIC and VARGRAPHIC data types for this purpose. However, the introduction of new data types is problematic, especially with regard to compatibility with legacy applications, as these may have to be adapted manually, which requires a lot of effort, and there may be cases where access to these applications is not possible, so that they cannot be migrated at all.

During my research on this topic, I may have come across a better and simpler alternative that does not require the introduction of new data types.
When defining character strings in database tables, another large database manufacturer, whose name starts with O, offers the option of marking them as UTF-8 or multibyte via an optional parameter (CHAR) (https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/Data-Types.html#GUID-7B72E154-677A-4342-A1EA-C74C1EA928E6). If this option is not used or the value BYTE is specified, the data type behaves like the standard type, which can only store 1-byte characters.

An example could look like this:

CREATE TABLE test_byte
(my_column CHAR(10 BYTE))

In this case, the column my_column can hold exactly 10 bytes (this corresponds to the current behavior, unless SQL_LOGICAL_CHAR is used). If a UTF-8 character string consisting of several bytes is saved, there is a risk that it will be truncated and the data will be incorrect when retrieved.

However, if the CHAR option is used, as in this example

CREATE TABLE test_char
(my_column CHAR(10 CHAR))

the my_column can always store 10 characters, regardless of whether a character consists of 1, 2, 3 or 4 bytes (this may depend on the value of the DB_LOCALE environment variable that was used when the database was created).

All functions that process character strings, e.g. LENGTH() or the substring routines, as well as the functions for encrypting and decrypting table columns (ENCRYPT_AES/ENCRYPT_TDES, DECRYPT_CHAR/DECRYPT_BINARY) would of course have to be adapted and, when specifying lengths, the option in the definition of the data type would have to be taken into account accordingly.

From my point of view, the implementation of an optional CHAR or BYTE parameter for string data types in Informix would be a viable solution to the problem for both database administrators and developers and would avoid the introduction of new data types that are specifically suited to the use of multibyte characters and would even introduce new problems for legacy applications.

What is your opinion on this topic?
Are there any suggestions or improvements or do you think it would make sense to submit my idea as an RFE?

Kind regards

------------------------------
Roland Wintgen
DBA, Genero/4GL Developer
EVG Martens GmbH & Co. KG
Moenchengladbach
------------------------------

8. RE: future support of UTF-8 character handling

Like

Sebastien FLAESCH

Posted Mon April 28, 2025 10:27 AM

Roland,

Basically agreed.

I assume/hope HCL Informix team has some ideas and resources for that.

The big part of the work is internal stuff I guess (storage, char string functions, etc).

To me support for UTF-8 is a complete migration of the whole database / tables.

So instead an environment variable or new SQL option at the table or column level, I would rather expect a global onconfig parameter (for all databases created by the server), or an option when creating the database with CREATE DATABASE ... WITH LENGTH_SEMANTICS=CHAR/BYTE ( to be used with DB_LOCALE/CLIENT_LOCALE = *.utf8 )

Then dbexport from the ISO-8859-15 database, and dbimport into a new UTF-8 database.

I really doubt someone can manage to have mix of SBCS and UTF-8 in the same DB, with the same client application accessing both kind of char string data.

If it's based on an env var, as you suggest, then it must work just like when using DB_LOCALE defining the db locale at CREATE DATABASE time.
Then, if you use a different setting when connection to the database, Informix should give an error like when using a wrong DB_LOCALE.

Seb

------------------------------
Sebastien FLAESCH
------------------------------

Original Message

Original Message:
Sent: Mon April 28, 2025 09:05 AM
From: Roland Wintgen
Subject: future support of UTF-8 character handling

Sebastien,

I'm already familiar with the meaning of SQL_LOGICAL_CHAR, but I think this was more of a workaround to get around the real problem, which is the handling of multibyte characters. I can imagine that the support of multi-byte characters is not trivial and the implementation should also be properly done. I completely agree with you.

So maybe it would be better to introduce an environment variable SQL_CHAR_SEMANTICS (according to https://4js.com/online_documentation/fjs-fgl-manual-html/#fgl-topics/c_fgl_localization_022.html) instead of an optional parameter BYTE or CHAR for a CHAR data type, which is used when creating a new database to specify how strings should be handled. This would at least prevent there being several columns within a table that would differ in the handling of single or multibyte characters.

Kind regards

------------------------------
Roland Wintgen
DBA, Genero/4GL Developer
EVG Martens GmbH & Co. KG
Moenchengladbach

Original Message:
Sent: Mon April 28, 2025 08:01 AM
From: Sebastien FLAESCH
Subject: future support of UTF-8 character handling

Hi Roland,

I am maybe missing some details but from my knowledge, SQL_LOGICAL_CHAR is basically a ratio that applies to define the size of CHAR/VARCHAR columns when creating an SQL table.

So when you have SQL_LOGICAL_CHAR=3, it's as if you would do it by yourself, by converting

CREATE TABLE tab ( ... column_xyz VARCHAR(50) ... )

to:

CREATE TABLE tab ( ... column_xyz VARCHAR(150) ... ) -- 50 x 3 = 150 bytes

But all length semantics are still based on byte units...

So SQL_LOGICAL_CHAR is rather just a helper to get enough bytes to store your UTF-8 strings.

Implementing complete support for char length semantics is quite some work!!!

I prefer that Informix team takes the time to do that properly...

About mixing length semantics and character sets, imagine you have a single table with 2 columns, using different length semantics and charsets.

If you fetch your data into a program variable, the language you use must also support the distinction of length semantics and charset AT THE VARIABLE level... ready for that?

BTW, see also:

https://4js.com/online_documentation/fjs-fgl-manual-html/#fgl-topics/c_fgl_odiagifx_003.html

Seb

------------------------------
Sebastien FLAESCH

Original Message:
Sent: Mon April 28, 2025 07:36 AM
From: Roland Wintgen
Subject: future support of UTF-8 character handling

Sebastien,

thank you for you thoughts.

I already upvoted the RFE, but I doubt that it will be implemented again at some point, even if the status is "future consideration".

The best solution would of course be if Informix would adapt the CHAR data type, so that it would generally only support multibyte characters and you would therefore no longer have to worry about how many bytes a character consists of.

I have no idea why this was not done and instead the workaround via SQL_LOGICAL_CHAR was taken. Maybe they wanted to offload the problem onto the database developer instead of solving the problem right away?

Perhaps I didn't understand your comment correctly. You say that introducing a new data type for multibyte characters would not be a good idea. On the other hand, you don't want to mix single and multibyte characters because that would lead to problems in application development. But wouldn't this mixing be the case if it would depend on whether CHAR columns could be defined as either a single or multibyte character, as described in my proposal?

Do you think there is a better solution?

Kind regards

------------------------------
Roland Wintgen
DBA, Genero/4GL Developer
EVG Martens GmbH & Co. KG
Moenchengladbach

Original Message:
Sent: Mon April 28, 2025 06:20 AM
From: Sebastien FLAESCH
Subject: future support of UTF-8 character handling

Hello Roland,

+1

I have already suggested this to IBM/HCL a while ago, and someone reported this idea a decade ago:

https://ibm-data-and-ai.ideas.ibm.com/ideas/INFX-I-190

https://ideas.ibm.com/search?query=INFX-I-190

The concept you are looking for is called "byte/char length semantics", to define in which unit a length is expressed, either in a CHAR/VARCHAR size, in subscripts ( col[5,8] ) or functions such as LENGTH(expr)

With other SQL engines, you can define the byte or char(acter) length semantics at various levels (server, database, session, column) - as you discovered with O..

When defined at a global level (session, database), it applies when you create a table with a VARCHAR(50) without specifying any BYTE/CHAR clause.

Some SQL engines ONLY support char length semantics.

To me, introducing new character data types is not a good idea:
You don't want to mix byte-length and char-length semantics: It's a nightmare with application code.
Migrating from a Single Byte Character Set like ISO-8859-15 to UTF-8 should be done for the whole database (all CHAR/VARCHAR columns).
Migration must include application code to handle UTF-8 and the concept of byte/char length semantics.
C-based programs use C lib functions to manage UTF-8 character strings, based on LANG/LC_ALL/locale settings.

If you want more details you have see how to support this with Genero BDL:
https://4js.com/online_documentation/fjs-fgl-manual-html/#fgl-topics/c_fgl_localization_038.html

Seb

------------------------------
Sebastien FLAESCH

Original Message:
Sent: Fri April 25, 2025 03:39 AM
From: Roland Wintgen
Subject: future support of UTF-8 character handling

Hello Informix community,

the fact that the current implementation of UTF-8 in the Informix database is not optimal and difficult to handle is probably known to everyone who has ever dealt with this topic.

Before I submit an RFE, I would like to know the opinions of other members if they would support my proposal or perhaps have a better idea on how to improve UTF-8 strings in future Informix releases.

The current implementation for supporting UTF-8 character strings requires the use of the SQL_LOGICAL_CHAR environment variable, which is, however, quite inflexible (see also RFE https://ibm-data-and-ai.ideas.ibm.com/ideas/INFX-I-244). If you set the value to 2 and over time discover that a single character in another character set requires 3 or even 4 bytes, this can in the worst case lead to values being truncated when saving to a table column (the fact that neither a warning nor an SQL error is generated in these cases is a different problem, but does not make things any easier; see also the RFE https://ibm-data-and-ai.ideas.ibm.com/ideas/INFX-I-8). This would have a particularly tragic effect when using ENCRYPT_AES(). In this case, the encrypted values may not be recoverable and are therefore lost.

The topic was also discussed at the last Informix Technology Days in Munich and JC Lengyel then suggested that perhaps new data types for characters or character strings should be introduced in future that would be especially suitable for processing UTF-8 characters. If I have understood correctly, IBM Db2 uses the GRAPHIC and VARGRAPHIC data types for this purpose. However, the introduction of new data types is problematic, especially with regard to compatibility with legacy applications, as these may have to be adapted manually, which requires a lot of effort, and there may be cases where access to these applications is not possible, so that they cannot be migrated at all.

During my research on this topic, I may have come across a better and simpler alternative that does not require the introduction of new data types.
When defining character strings in database tables, another large database manufacturer, whose name starts with O, offers the option of marking them as UTF-8 or multibyte via an optional parameter (CHAR) (https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/Data-Types.html#GUID-7B72E154-677A-4342-A1EA-C74C1EA928E6). If this option is not used or the value BYTE is specified, the data type behaves like the standard type, which can only store 1-byte characters.

An example could look like this:

CREATE TABLE test_byte
(my_column CHAR(10 BYTE))

In this case, the column my_column can hold exactly 10 bytes (this corresponds to the current behavior, unless SQL_LOGICAL_CHAR is used). If a UTF-8 character string consisting of several bytes is saved, there is a risk that it will be truncated and the data will be incorrect when retrieved.

However, if the CHAR option is used, as in this example

CREATE TABLE test_char
(my_column CHAR(10 CHAR))

the my_column can always store 10 characters, regardless of whether a character consists of 1, 2, 3 or 4 bytes (this may depend on the value of the DB_LOCALE environment variable that was used when the database was created).

All functions that process character strings, e.g. LENGTH() or the substring routines, as well as the functions for encrypting and decrypting table columns (ENCRYPT_AES/ENCRYPT_TDES, DECRYPT_CHAR/DECRYPT_BINARY) would of course have to be adapted and, when specifying lengths, the option in the definition of the data type would have to be taken into account accordingly.

From my point of view, the implementation of an optional CHAR or BYTE parameter for string data types in Informix would be a viable solution to the problem for both database administrators and developers and would avoid the introduction of new data types that are specifically suited to the use of multibyte characters and would even introduce new problems for legacy applications.

What is your opinion on this topic?
Are there any suggestions or improvements or do you think it would make sense to submit my idea as an RFE?

Kind regards

------------------------------
Roland Wintgen
DBA, Genero/4GL Developer
EVG Martens GmbH & Co. KG
Moenchengladbach
------------------------------

9. RE: future support of UTF-8 character handling

Like

Roland Wintgen

Posted Mon April 28, 2025 11:34 AM

Sebastien,

Agreed. Let's keep our fingers crossed that maybe a future version 16.0 will offer this feature.

As the owner, IBM should actually be able to provide sufficient financial and human resources. Now that the restrictions on storage limits have been removed, this would be an appropriate follow-up project to tackle.

I'm surprised that not many more customers have the need to deal with UTF-8 character strings and wonder how they handle them. After all, the world is becoming more and more globalized and it is actually sad that a professional database does not offer appropiate support out of the box. In this aspect, some open-source databases actually seem to have an advantage.

Maybe someone will take up the subject before we retire.

Kind regards

------------------------------
Roland Wintgen
DBA, Genero/4GL Developer
EVG Martens GmbH & Co. KG
Moenchengladbach
------------------------------

Original Message

Original Message:
Sent: Mon April 28, 2025 10:27 AM
From: Sebastien FLAESCH
Subject: future support of UTF-8 character handling

Roland,

Basically agreed.

I assume/hope HCL Informix team has some ideas and resources for that.

The big part of the work is internal stuff I guess (storage, char string functions, etc).

To me support for UTF-8 is a complete migration of the whole database / tables.

So instead an environment variable or new SQL option at the table or column level, I would rather expect a global onconfig parameter (for all databases created by the server), or an option when creating the database with CREATE DATABASE ... WITH LENGTH_SEMANTICS=CHAR/BYTE ( to be used with DB_LOCALE/CLIENT_LOCALE = *.utf8 )

Then dbexport from the ISO-8859-15 database, and dbimport into a new UTF-8 database.

I really doubt someone can manage to have mix of SBCS and UTF-8 in the same DB, with the same client application accessing both kind of char string data.

If it's based on an env var, as you suggest, then it must work just like when using DB_LOCALE defining the db locale at CREATE DATABASE time.
Then, if you use a different setting when connection to the database, Informix should give an error like when using a wrong DB_LOCALE.

Seb

------------------------------
Sebastien FLAESCH

Original Message:
Sent: Mon April 28, 2025 09:05 AM
From: Roland Wintgen
Subject: future support of UTF-8 character handling

Sebastien,

I'm already familiar with the meaning of SQL_LOGICAL_CHAR, but I think this was more of a workaround to get around the real problem, which is the handling of multibyte characters. I can imagine that the support of multi-byte characters is not trivial and the implementation should also be properly done. I completely agree with you.

So maybe it would be better to introduce an environment variable SQL_CHAR_SEMANTICS (according to https://4js.com/online_documentation/fjs-fgl-manual-html/#fgl-topics/c_fgl_localization_022.html) instead of an optional parameter BYTE or CHAR for a CHAR data type, which is used when creating a new database to specify how strings should be handled. This would at least prevent there being several columns within a table that would differ in the handling of single or multibyte characters.

Kind regards

------------------------------
Roland Wintgen
DBA, Genero/4GL Developer
EVG Martens GmbH & Co. KG
Moenchengladbach

Original Message:
Sent: Mon April 28, 2025 08:01 AM
From: Sebastien FLAESCH
Subject: future support of UTF-8 character handling

Hi Roland,

I am maybe missing some details but from my knowledge, SQL_LOGICAL_CHAR is basically a ratio that applies to define the size of CHAR/VARCHAR columns when creating an SQL table.

So when you have SQL_LOGICAL_CHAR=3, it's as if you would do it by yourself, by converting

CREATE TABLE tab ( ... column_xyz VARCHAR(50) ... )

to:

CREATE TABLE tab ( ... column_xyz VARCHAR(150) ... ) -- 50 x 3 = 150 bytes

But all length semantics are still based on byte units...

So SQL_LOGICAL_CHAR is rather just a helper to get enough bytes to store your UTF-8 strings.

Implementing complete support for char length semantics is quite some work!!!

I prefer that Informix team takes the time to do that properly...

About mixing length semantics and character sets, imagine you have a single table with 2 columns, using different length semantics and charsets.

If you fetch your data into a program variable, the language you use must also support the distinction of length semantics and charset AT THE VARIABLE level... ready for that?

BTW, see also:

https://4js.com/online_documentation/fjs-fgl-manual-html/#fgl-topics/c_fgl_odiagifx_003.html

Seb

------------------------------
Sebastien FLAESCH

Original Message:
Sent: Mon April 28, 2025 07:36 AM
From: Roland Wintgen
Subject: future support of UTF-8 character handling

Sebastien,

thank you for you thoughts.

I already upvoted the RFE, but I doubt that it will be implemented again at some point, even if the status is "future consideration".

The best solution would of course be if Informix would adapt the CHAR data type, so that it would generally only support multibyte characters and you would therefore no longer have to worry about how many bytes a character consists of.

I have no idea why this was not done and instead the workaround via SQL_LOGICAL_CHAR was taken. Maybe they wanted to offload the problem onto the database developer instead of solving the problem right away?

Perhaps I didn't understand your comment correctly. You say that introducing a new data type for multibyte characters would not be a good idea. On the other hand, you don't want to mix single and multibyte characters because that would lead to problems in application development. But wouldn't this mixing be the case if it would depend on whether CHAR columns could be defined as either a single or multibyte character, as described in my proposal?

Do you think there is a better solution?

Kind regards

------------------------------
Roland Wintgen
DBA, Genero/4GL Developer
EVG Martens GmbH & Co. KG
Moenchengladbach

Original Message:
Sent: Mon April 28, 2025 06:20 AM
From: Sebastien FLAESCH
Subject: future support of UTF-8 character handling

Hello Roland,

+1

I have already suggested this to IBM/HCL a while ago, and someone reported this idea a decade ago:

https://ibm-data-and-ai.ideas.ibm.com/ideas/INFX-I-190

https://ideas.ibm.com/search?query=INFX-I-190

The concept you are looking for is called "byte/char length semantics", to define in which unit a length is expressed, either in a CHAR/VARCHAR size, in subscripts ( col[5,8] ) or functions such as LENGTH(expr)

With other SQL engines, you can define the byte or char(acter) length semantics at various levels (server, database, session, column) - as you discovered with O..

When defined at a global level (session, database), it applies when you create a table with a VARCHAR(50) without specifying any BYTE/CHAR clause.

Some SQL engines ONLY support char length semantics.

To me, introducing new character data types is not a good idea:
You don't want to mix byte-length and char-length semantics: It's a nightmare with application code.
Migrating from a Single Byte Character Set like ISO-8859-15 to UTF-8 should be done for the whole database (all CHAR/VARCHAR columns).
Migration must include application code to handle UTF-8 and the concept of byte/char length semantics.
C-based programs use C lib functions to manage UTF-8 character strings, based on LANG/LC_ALL/locale settings.

If you want more details you have see how to support this with Genero BDL:
https://4js.com/online_documentation/fjs-fgl-manual-html/#fgl-topics/c_fgl_localization_038.html

Seb

------------------------------
Sebastien FLAESCH

Original Message:
Sent: Fri April 25, 2025 03:39 AM
From: Roland Wintgen
Subject: future support of UTF-8 character handling

Hello Informix community,

the fact that the current implementation of UTF-8 in the Informix database is not optimal and difficult to handle is probably known to everyone who has ever dealt with this topic.

Before I submit an RFE, I would like to know the opinions of other members if they would support my proposal or perhaps have a better idea on how to improve UTF-8 strings in future Informix releases.

The current implementation for supporting UTF-8 character strings requires the use of the SQL_LOGICAL_CHAR environment variable, which is, however, quite inflexible (see also RFE https://ibm-data-and-ai.ideas.ibm.com/ideas/INFX-I-244). If you set the value to 2 and over time discover that a single character in another character set requires 3 or even 4 bytes, this can in the worst case lead to values being truncated when saving to a table column (the fact that neither a warning nor an SQL error is generated in these cases is a different problem, but does not make things any easier; see also the RFE https://ibm-data-and-ai.ideas.ibm.com/ideas/INFX-I-8). This would have a particularly tragic effect when using ENCRYPT_AES(). In this case, the encrypted values may not be recoverable and are therefore lost.

The topic was also discussed at the last Informix Technology Days in Munich and JC Lengyel then suggested that perhaps new data types for characters or character strings should be introduced in future that would be especially suitable for processing UTF-8 characters. If I have understood correctly, IBM Db2 uses the GRAPHIC and VARGRAPHIC data types for this purpose. However, the introduction of new data types is problematic, especially with regard to compatibility with legacy applications, as these may have to be adapted manually, which requires a lot of effort, and there may be cases where access to these applications is not possible, so that they cannot be migrated at all.

During my research on this topic, I may have come across a better and simpler alternative that does not require the introduction of new data types.
When defining character strings in database tables, another large database manufacturer, whose name starts with O, offers the option of marking them as UTF-8 or multibyte via an optional parameter (CHAR) (https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/Data-Types.html#GUID-7B72E154-677A-4342-A1EA-C74C1EA928E6). If this option is not used or the value BYTE is specified, the data type behaves like the standard type, which can only store 1-byte characters.

An example could look like this:

CREATE TABLE test_byte
(my_column CHAR(10 BYTE))

In this case, the column my_column can hold exactly 10 bytes (this corresponds to the current behavior, unless SQL_LOGICAL_CHAR is used). If a UTF-8 character string consisting of several bytes is saved, there is a risk that it will be truncated and the data will be incorrect when retrieved.

However, if the CHAR option is used, as in this example

CREATE TABLE test_char
(my_column CHAR(10 CHAR))

the my_column can always store 10 characters, regardless of whether a character consists of 1, 2, 3 or 4 bytes (this may depend on the value of the DB_LOCALE environment variable that was used when the database was created).

All functions that process character strings, e.g. LENGTH() or the substring routines, as well as the functions for encrypting and decrypting table columns (ENCRYPT_AES/ENCRYPT_TDES, DECRYPT_CHAR/DECRYPT_BINARY) would of course have to be adapted and, when specifying lengths, the option in the definition of the data type would have to be taken into account accordingly.

From my point of view, the implementation of an optional CHAR or BYTE parameter for string data types in Informix would be a viable solution to the problem for both database administrators and developers and would avoid the introduction of new data types that are specifically suited to the use of multibyte characters and would even introduce new problems for legacy applications.

What is your opinion on this topic?
Are there any suggestions or improvements or do you think it would make sense to submit my idea as an RFE?

Kind regards

------------------------------
Roland Wintgen
DBA, Genero/4GL Developer
EVG Martens GmbH & Co. KG
Moenchengladbach
------------------------------

Informix

Informix