JSON and UTF-8

5. RE: JSON and UTF-8

Like

Frank Swarbrick

IBM Champion

Posted Fri May 14, 2021 02:11 PM

I think we're getting closer, but I'm still not quite clear on the current behavior. Using UWIDTH in your example doesn't quite work, in that the final '}' is truncated from the result. Take a look at this. I've commented out cases that give incorrect results.

 IDENTIFICATION DIVISION.                             
 PROGRAM-ID.  'UTF8'.                                 
 ENVIRONMENT DIVISION.                                
 CONFIGURATION SECTION.                               
 REPOSITORY.                                          
     FUNCTION ALL INTRINSIC.                          
 DATA DIVISION.                                       
 WORKING-STORAGE SECTION.                             
 01  doc-data.                                        
     05  l              PIC X(10) VALUE 'é'.          
 01  doc-utf8           PIC U BYTE-LENGTH 40.         
 01  document           REDEFINES doc-utf8            
                        PIC X(40).                    
 01  doc-len            PIC 9(4).                     
 01  doc-width          PIC 9(4).                     
 PROCEDURE DIVISION.                                  
     move low-values to document                      
     JSON GENERATE document                           
          FROM l                                      
          COUNT doc-len                               
     COMPUTE doc-width = ULENGTH(document(1:doc-len)) 
     DISPLAY doc-len '/' doc-width                    
     DISPLAY doc-utf8                                 
     DISPLAY doc-utf8(1:doc-len)                      
**** DISPLAY doc-utf8(1:doc-width)                    
                                                      
**** DISPLAY doc-utf8(1:doc-width)                    
****         UPON CONSOLE                             
**** DISPLAY "'" doc-utf8(1:doc-width) "'"            
****         UPON CONSOLE                             
     DISPLAY doc-utf8(1:doc-len)                      
             UPON CONSOLE                             
     DISPLAY "'" doc-utf8(1:doc-len) "'"              
             UPON CONSOLE                             
                                                      
     DISPLAY DISPLAY-OF(doc-utf8)(1:doc-width)        
     DISPLAY "'" DISPLAY-OF(doc-utf8)(1:doc-width) "'"
                                                      
     GOBACK.                                          
 END PROGRAM 'UTF8'.

So the only case where doc-width (UWIDTH) is needed is in the case where the DISPLAY-OF function is being used. And actually, while the following currently is not allowed, I think if it were allowed then it also would work:

DISPLAY DISPLAY-OF(doc-utf8(1:doc-len))

Currently this results in a compile time error:

IGYPA3092-S   Argument-1 "DOC-UTF8 (ALPHANUMERIC REFERENCE MODIFIED    
              ITEM)" for function "ALPHANUMERIC FUNCTION DISPLAY-OF" is
              not an item of class national.  The statement was        
              discarded.

So it looks like reference modification is still "encoding-unit" based, rather than "character" based (which is why your example truncated the last character). In the case of DISPLAY-OF I am converting the entire declared string (including the trailing low-value bytes) first, and then using the UWIDTH result to ref-mod it. But if DISPLAY-OF were changed to accept a ref-modified UTF-8 field as input I don't think UWIDTH would be needed at all (for what I am trying to do here).

------------------------------
Frank Swarbrick
------------------------------

Original Message

Original Message:
Sent: Thu May 13, 2021 12:45 PM
From: Jeffery Shimoda
Subject: JSON and UTF-8

Hi Frank

I think I see the problem you are pointing out now.
To work around this, you could use ULENGTH function to obtain the number of utf-8 "characters" (aka UTF code points) and then use that to reference-modify doc-utf8.
See the following example.

       IDENTIFICATION DIVISION.       PROGRAM-ID.  'UTF8'.       DATA DIVISION.       WORKING-STORAGE SECTION.       01  doc-data.           05  location       PIC X(10) VALUE 'é'.       01  doc-utf8           PIC U BYTE-LENGTH 256.       01  document           REDEFINES doc-utf8                              PIC X(256).       01  doc-len            PIC 9(4).       01  ddoc-utf8          PIC U DYNAMIC LENGTH.       PROCEDURE DIVISION.           move low-values to document           JSON GENERATE document                FROM location                COUNT doc-len           DISPLAY doc-len           DISPLAY "'" doc-utf8(1:function ulength(document(1:doc-len)))             "'" upon console           GOBACK.       END PROGRAM 'UTF8'.

As things currently stand, the COUNT phrase returns the number of JSON character encoding units (where an encoding unit is 1 byte for generating utf-8 into an alphanumeric item, and 2 bytes for generating utf-16 into national items).
I'm guessing it might be useful to also allow users to have COUNT (or something like it) to return the number of generated utf-8 "characters" (code points)?

------------------------------
Jeffery Shimoda

Original Message:
Sent: Wed May 12, 2021 02:08 AM
From: Frank Swarbrick
Subject: JSON and UTF-8

There actually seem to be two issues. The first is unrelated to UTF-8, and in fact I already have a ticket in to Level II for it. That is, when the NAME IS OMITTED clause is specified, there is a "trailing space" in the result of JSON GENERATE.

That still leaves the multi-byte UTF8 issue. The COUNT result is getting the total number of bytes, rather than the total number of "characters". So when reference modification, which is based on characters rather than bytes, is used it does not work correctly.

I've attached a program used for additional testing.

Result when NAME IS OMITTED is specified and the destination field is set to all UTF-8 'X' characters:

0065
'{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} X'
'{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} XXXXXXXXXXXXXXX'
'{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} X'
'{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} XXXXXXXXXXXXXXX'

Result when NAME IS OMITTED is not specified, and the destination field is set to all UTF-8 'X' characters:

0077
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}X'
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}XXX'
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}X'
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}XXX'

Result when NAME IS OMITTED is not specified, and the destination field is set to all UTF-8 space characters:

0077
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}} '
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}'
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}} '
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}'

As you can see, the only reliable method is to 1) pre-initialize the destination field with all UTF-8 spaces, 2) use the TRIM function to eliminate the trailing spaces.

------------------------------
Frank Swarbrick

Original Message:
Sent: Fri May 07, 2021 10:40 AM
From: Jeffery Shimoda
Subject: JSON and UTF-8

Hi Frank,
I agree it would be very useful to have the JSON GENERATE and JSON PARSE statements directly support PIC U items (with and without DYNAMIC LENGTH). There is certainly a dissonance right now between the UTF-8/DYNAMIC LENGTH support, and the JSON GENERATE/PARSE statements. The dev team is aware of it and we would like to rectify that.

Regarding your observation about the extra byte at the end, I compiled and ran your program and I'm finding a slightly different problem - I found a space, not a low-value byte. Also, I noticed that the value of doc-len is output as 0067, when in fact I believe it should be 0066. It seems the root cause of the problem is the combination of having 'e' (with the accent character) along with using the NAME IS OMITTED phrase.

I created a cutdown version of your test below that further simplifies the json.

       IDENTIFICATION DIVISION.       PROGRAM-ID.  'UTF8'.       ENVIRONMENT DIVISION.       CONFIGURATION SECTION.       DATA DIVISION.       WORKING-STORAGE SECTION.       01  doc-data.           05  location       PIC X(10) VALUE 'é'.       01  doc-utf8           PIC U BYTE-LENGTH 256.       01  document           REDEFINES doc-utf8                              PIC X(256).       01  doc-len            PIC 9(4).       01  ddoc-utf8          PIC U DYNAMIC LENGTH.       PROCEDURE DIVISION.           move low-values to document           JSON GENERATE document      *         FROM doc-data                FROM location                COUNT doc-len      *         NAME doc-data IS OMITTED           DISPLAY doc-len           DISPLAY "'" function hex-of(document(1:doc-len)) "'"           DISPLAY "'" doc-utf8(1:doc-len) "'" upon console           GOBACK.       END PROGRAM 'UTF8'.

When I remove the NAME IS OMITTED phrase and generate directly from LOCATION, the doc-len is 19. When using the NAME IS OMITTED phrase generating from DOC-DATA, the doc-len is 20, and the data contains an superfluous space (utf-8 hex code 0x20) at the end. I'm going to take a look at this because I think it should be 19 in both cases.

------------------------------
Jeffery Shimoda

Original Message:
Sent: Fri April 30, 2021 06:36 PM
From: Frank Swarbrick
Subject: JSON and UTF-8

So it annoyed me from initial implementation that JSON documents had to be in UTF-8 instead of EBCDIC. Upon hearing about Enterprise COBOL supporting a UTF-8 datatype I thought that this would resolve most of my concerns. But I still can't do what it seems to me should be an obvious thing to do. That is, one should be able to do a JSON GENERATE to a UTF-8 data item. And of course conversely do a JSON PARSE of a UTF-8 data item (though I've not even bothered to try this one.)

Without this feature here's an example of my attempts to do a simple display of a generated JSON document.

       IDENTIFICATION DIVISION.
       PROGRAM-ID.  'UTF8'.

       ENVIRONMENT DIVISION. 
       CONFIGURATION SECTION. 
       REPOSITORY. 
           FUNCTION DISPLAY-OF INTRINSIC.
 
       DATA DIVISION. 
       WORKING-STORAGE SECTION. 
       01  doc-data.
           05  userId         PIC X(20) VALUE 'FJS'.
           05  userName       PIC X(30) VALUE 'Frank Swarbrick'.
           05  location       PIC X(10) VALUE 'Café'.

       01  doc-utf8           PIC U BYTE-LENGTH 256.
       01  document           REDEFINES doc-utf8 
                              PIC X(256).
       01  doc-len            PIC 9(4).
       01  ddoc-utf8          PIC U DYNAMIC LENGTH.

       PROCEDURE DIVISION.
           JSON GENERATE document 
                FROM doc-data
                COUNT doc-len 
                NAME doc-data IS OMITTED

           DISPLAY doc-len
           DISPLAY document(1:doc-len)
           DISPLAY doc-utf8(1:doc-len)
      *    DISPLAY DISPLAY-OF(doc-utf8(1:doc-len))
           DISPLAY DISPLAY-OF(doc-utf8)(1:doc-len)
           MOVE doc-utf8(1:doc-len) to ddoc-utf8
           DISPLAY DISPLAY-OF(ddoc-utf8)
           GOBACK.

The first two attempts "work" in as much as they display the document in UTF-8. So if you use "DISPLAY UTF-8" in a browse session, the results are correct. But converting to EBCDIC is problematic, because the 'é' character is two bytes in UTF-8 but only one byte in EBCDIC. So with the code I have here using the DISPLAY-OF function there ends up being an extra byte at the end (low-values in this case, because that's what the field get's "initialized" to.

I'd love to hear someone from the IBM COBOL compiler team address this issue.

------------------------------
Frank Swarbrick
------------------------------

6. RE: JSON and UTF-8

Like

Jeffery Shimoda

Posted Mon May 17, 2021 04:10 PM

Hi Frank,

I was playing around with this a bit more. You are right - when reference modifying doc-utf8 the length is the number of utf-8 encoding units (my apologies for the confusion...I was likely confusing it with reference modifying a 'fixed character' UTF-8 item, where the length is the number of 'characters').

Also, I'm going to bring up the example where you reference modified doc-utf8 within DISPLAY-OF to the team - it does seem like that could avoid an extra call to ULENGTH.

One thing that I wanted to clarify with you - what is your hex encoding of the 'e with an accent' character? In your original post you mentioned it was one byte, but I am seeing it as two bytes in my code (this could be an artifact of copying/pasting from the web?).

When inspecting my source code, the hex encoding of é is X'66B4'. When I convert that into utf-8, I get X'C383C2A9', which seems to be two distinct utf code points, each containing 2 encoding units.

Then, when I pass that back into DISPLAY-OF, it gets converted back into X'66B4' (as far as i can tell), then displayed.

On my terminal (TERM="dtterm" in USS via ssh) this gets displayed as a single é followed immediately by what appears to be a space character, but I believe it is not a real space character - I think my terminal just cannot properly handle the double byte character.

Here's my program for this experiment.

       IDENTIFICATION DIVISION.
       PROGRAM-ID.  'JEFF1'.
       ENVIRONMENT DIVISION.
       CONFIGURATION SECTION.
       REPOSITORY.
           FUNCTION ALL INTRINSIC.
       DATA DIVISION.
       WORKING-STORAGE SECTION.
       01  U1 PIC U BYTE-LENGTH 8.
       PROCEDURE DIVISION.
           MOVE X"66B4" to U1
           DISPLAY 'HEX-OF(U1)=' HEX-OF(U1)
           DISPLAY 'HEX-OF(U1(1:8))=' HEX-OF(U1(1:8))

           DISPLAY 'DISPLAY-OF(U1)="' DISPLAY-OF(U1) '"'
           DISPLAY 'HEX-OF(DISPLAY-OF(U1))=' HEX-OF(DISPLAY-OF(U1))

           DISPLAY 'DISPLAY-OF(U1)(1:2)="' DISPLAY-OF(U1)(1:2) '"'
           DISPLAY 'HEX-OF(DISPLAY-OF(U1)(1:2))='
             HEX-OF(DISPLAY-OF(U1)(1:2))

           GOBACK.
       END PROGRAM 'JEFF1'.

Here's the output of this program:

HEX-OF(U1)=C383C2A920202020
HEX-OF(U1(1:8))=C383C2A920202020
DISPLAY-OF(U1)="é    "
HEX-OF(DISPLAY-OF(U1))=66B440404040
DISPLAY-OF(U1)(1:2)="é"
HEX-OF(DISPLAY-OF(U1)(1:2))=66B4

------------------------------
Jeffery Shimoda
------------------------------

Original Message

Original Message:
Sent: Thu May 13, 2021 06:18 PM
From: Frank Swarbrick
Subject: JSON and UTF-8

I think we're getting closer, but I'm still not quite clear on the current behavior. Using UWIDTH in your example doesn't quite work, in that the final '}' is truncated from the result. Take a look at this. I've commented out cases that give incorrect results.

 IDENTIFICATION DIVISION.                              PROGRAM-ID.  'UTF8'.                                  ENVIRONMENT DIVISION.                                 CONFIGURATION SECTION.                                REPOSITORY.                                               FUNCTION ALL INTRINSIC.                           DATA DIVISION.                                        WORKING-STORAGE SECTION.                              01  doc-data.                                             05  l              PIC X(10) VALUE 'é'.           01  doc-utf8           PIC U BYTE-LENGTH 40.          01  document           REDEFINES doc-utf8                                    PIC X(40).                     01  doc-len            PIC 9(4).                      01  doc-width          PIC 9(4).                      PROCEDURE DIVISION.                                       move low-values to document                           JSON GENERATE document                                     FROM l                                                COUNT doc-len                                    COMPUTE doc-width = ULENGTH(document(1:doc-len))      DISPLAY doc-len '/' doc-width                         DISPLAY doc-utf8                                      DISPLAY doc-utf8(1:doc-len)                      **** DISPLAY doc-utf8(1:doc-width)                                                                          **** DISPLAY doc-utf8(1:doc-width)                    ****         UPON CONSOLE                             **** DISPLAY "'" doc-utf8(1:doc-width) "'"            ****         UPON CONSOLE                                  DISPLAY doc-utf8(1:doc-len)                                   UPON CONSOLE                                  DISPLAY "'" doc-utf8(1:doc-len) "'"                           UPON CONSOLE                                                                                        DISPLAY DISPLAY-OF(doc-utf8)(1:doc-width)             DISPLAY "'" DISPLAY-OF(doc-utf8)(1:doc-width) "'"                                                           GOBACK.                                           END PROGRAM 'UTF8'.

So the only case where doc-width (UWIDTH) is needed is in the case where the DISPLAY-OF function is being used. And actually, while the following currently is not allowed, I think if it were allowed then it also would work:

DISPLAY DISPLAY-OF(doc-utf8(1:doc-len))

Currently this results in a compile time error:

IGYPA3092-S   Argument-1 "DOC-UTF8 (ALPHANUMERIC REFERENCE MODIFIED                  ITEM)" for function "ALPHANUMERIC FUNCTION DISPLAY-OF" is              not an item of class national.  The statement was                      discarded.

So it looks like reference modification is still "encoding-unit" based, rather than "character" based (which is why your example truncated the last character). In the case of DISPLAY-OF I am converting the entire declared string (including the trailing low-value bytes) first, and then using the UWIDTH result to ref-mod it. But if DISPLAY-OF were changed to accept a ref-modified UTF-8 field as input I don't think UWIDTH would be needed at all (for what I am trying to do here).

------------------------------
Frank Swarbrick

Original Message:
Sent: Thu May 13, 2021 12:45 PM
From: Jeffery Shimoda
Subject: JSON and UTF-8

Hi Frank

I think I see the problem you are pointing out now.
To work around this, you could use ULENGTH function to obtain the number of utf-8 "characters" (aka UTF code points) and then use that to reference-modify doc-utf8.
See the following example.

       IDENTIFICATION DIVISION.       PROGRAM-ID.  'UTF8'.       DATA DIVISION.       WORKING-STORAGE SECTION.       01  doc-data.           05  location       PIC X(10) VALUE 'é'.       01  doc-utf8           PIC U BYTE-LENGTH 256.       01  document           REDEFINES doc-utf8                              PIC X(256).       01  doc-len            PIC 9(4).       01  ddoc-utf8          PIC U DYNAMIC LENGTH.       PROCEDURE DIVISION.           move low-values to document           JSON GENERATE document                FROM location                COUNT doc-len           DISPLAY doc-len           DISPLAY "'" doc-utf8(1:function ulength(document(1:doc-len)))             "'" upon console           GOBACK.       END PROGRAM 'UTF8'.

As things currently stand, the COUNT phrase returns the number of JSON character encoding units (where an encoding unit is 1 byte for generating utf-8 into an alphanumeric item, and 2 bytes for generating utf-16 into national items).
I'm guessing it might be useful to also allow users to have COUNT (or something like it) to return the number of generated utf-8 "characters" (code points)?

------------------------------
Jeffery Shimoda

Original Message:
Sent: Wed May 12, 2021 02:08 AM
From: Frank Swarbrick
Subject: JSON and UTF-8

There actually seem to be two issues. The first is unrelated to UTF-8, and in fact I already have a ticket in to Level II for it. That is, when the NAME IS OMITTED clause is specified, there is a "trailing space" in the result of JSON GENERATE.

That still leaves the multi-byte UTF8 issue. The COUNT result is getting the total number of bytes, rather than the total number of "characters". So when reference modification, which is based on characters rather than bytes, is used it does not work correctly.

I've attached a program used for additional testing.

Result when NAME IS OMITTED is specified and the destination field is set to all UTF-8 'X' characters:

0065
'{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} X'
'{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} XXXXXXXXXXXXXXX'
'{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} X'
'{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} XXXXXXXXXXXXXXX'

Result when NAME IS OMITTED is not specified, and the destination field is set to all UTF-8 'X' characters:

0077
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}X'
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}XXX'
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}X'
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}XXX'

Result when NAME IS OMITTED is not specified, and the destination field is set to all UTF-8 space characters:

0077
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}} '
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}'
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}} '
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}'

As you can see, the only reliable method is to 1) pre-initialize the destination field with all UTF-8 spaces, 2) use the TRIM function to eliminate the trailing spaces.

------------------------------
Frank Swarbrick

Original Message:
Sent: Fri May 07, 2021 10:40 AM
From: Jeffery Shimoda
Subject: JSON and UTF-8

Hi Frank,
I agree it would be very useful to have the JSON GENERATE and JSON PARSE statements directly support PIC U items (with and without DYNAMIC LENGTH). There is certainly a dissonance right now between the UTF-8/DYNAMIC LENGTH support, and the JSON GENERATE/PARSE statements. The dev team is aware of it and we would like to rectify that.

Regarding your observation about the extra byte at the end, I compiled and ran your program and I'm finding a slightly different problem - I found a space, not a low-value byte. Also, I noticed that the value of doc-len is output as 0067, when in fact I believe it should be 0066. It seems the root cause of the problem is the combination of having 'e' (with the accent character) along with using the NAME IS OMITTED phrase.

I created a cutdown version of your test below that further simplifies the json.

       IDENTIFICATION DIVISION.       PROGRAM-ID.  'UTF8'.       ENVIRONMENT DIVISION.       CONFIGURATION SECTION.       DATA DIVISION.       WORKING-STORAGE SECTION.       01  doc-data.           05  location       PIC X(10) VALUE 'é'.       01  doc-utf8           PIC U BYTE-LENGTH 256.       01  document           REDEFINES doc-utf8                              PIC X(256).       01  doc-len            PIC 9(4).       01  ddoc-utf8          PIC U DYNAMIC LENGTH.       PROCEDURE DIVISION.           move low-values to document           JSON GENERATE document      *         FROM doc-data                FROM location                COUNT doc-len      *         NAME doc-data IS OMITTED           DISPLAY doc-len           DISPLAY "'" function hex-of(document(1:doc-len)) "'"           DISPLAY "'" doc-utf8(1:doc-len) "'" upon console           GOBACK.       END PROGRAM 'UTF8'.

When I remove the NAME IS OMITTED phrase and generate directly from LOCATION, the doc-len is 19. When using the NAME IS OMITTED phrase generating from DOC-DATA, the doc-len is 20, and the data contains an superfluous space (utf-8 hex code 0x20) at the end. I'm going to take a look at this because I think it should be 19 in both cases.

------------------------------
Jeffery Shimoda

Original Message:
Sent: Fri April 30, 2021 06:36 PM
From: Frank Swarbrick
Subject: JSON and UTF-8

So it annoyed me from initial implementation that JSON documents had to be in UTF-8 instead of EBCDIC. Upon hearing about Enterprise COBOL supporting a UTF-8 datatype I thought that this would resolve most of my concerns. But I still can't do what it seems to me should be an obvious thing to do. That is, one should be able to do a JSON GENERATE to a UTF-8 data item. And of course conversely do a JSON PARSE of a UTF-8 data item (though I've not even bothered to try this one.)

Without this feature here's an example of my attempts to do a simple display of a generated JSON document.

       IDENTIFICATION DIVISION.
       PROGRAM-ID.  'UTF8'.

       ENVIRONMENT DIVISION. 
       CONFIGURATION SECTION. 
       REPOSITORY. 
           FUNCTION DISPLAY-OF INTRINSIC.
 
       DATA DIVISION. 
       WORKING-STORAGE SECTION. 
       01  doc-data.
           05  userId         PIC X(20) VALUE 'FJS'.
           05  userName       PIC X(30) VALUE 'Frank Swarbrick'.
           05  location       PIC X(10) VALUE 'Café'.

       01  doc-utf8           PIC U BYTE-LENGTH 256.
       01  document           REDEFINES doc-utf8 
                              PIC X(256).
       01  doc-len            PIC 9(4).
       01  ddoc-utf8          PIC U DYNAMIC LENGTH.

       PROCEDURE DIVISION.
           JSON GENERATE document 
                FROM doc-data
                COUNT doc-len 
                NAME doc-data IS OMITTED

           DISPLAY doc-len
           DISPLAY document(1:doc-len)
           DISPLAY doc-utf8(1:doc-len)
      *    DISPLAY DISPLAY-OF(doc-utf8(1:doc-len))
           DISPLAY DISPLAY-OF(doc-utf8)(1:doc-len)
           MOVE doc-utf8(1:doc-len) to ddoc-utf8
           DISPLAY DISPLAY-OF(ddoc-utf8)
           GOBACK.

The first two attempts "work" in as much as they display the document in UTF-8. So if you use "DISPLAY UTF-8" in a browse session, the results are correct. But converting to EBCDIC is problematic, because the 'é' character is two bytes in UTF-8 but only one byte in EBCDIC. So with the code I have here using the DISPLAY-OF function there ends up being an extra byte at the end (low-values in this case, because that's what the field get's "initialized" to.

I'd love to hear someone from the IBM COBOL compiler team address this issue.

------------------------------
Frank Swarbrick
------------------------------

7. RE: JSON and UTF-8

Like

Frank Swarbrick

IBM Champion

Posted Tue May 18, 2021 12:27 PM

I'm not exactly sure what's going on with your test, but the accented e, when I cut and paste it from the web, is x'C3A9'. When I paste it in to an ISPF edit session it gets converted to EBCDIC x'51'. I don't have sshd running on my z/OS so I can't test from that perspective. There is some discussion of the C383C2A9 issue if you google "utf8 C383C2A9".
Anyway, I think you have it pretty much in hand. I don't actually have a use case for this, as we're not using UTF-8 at the moment. It was just something I wondered how it might work, so I tested it.

------------------------------
Frank Swarbrick
------------------------------

Original Message

Original Message:
Sent: Sun May 16, 2021 10:56 PM
From: Jeffery Shimoda
Subject: JSON and UTF-8

Hi Frank,

I was playing around with this a bit more. You are right - when reference modifying doc-utf8 the length is the number of utf-8 encoding units (my apologies for the confusion...I was likely confusing it with reference modifying a 'fixed character' UTF-8 item, where the length is the number of 'characters').

Also, I'm going to bring up the example where you reference modified doc-utf8 within DISPLAY-OF to the team - it does seem like that could avoid an extra call to ULENGTH.

One thing that I wanted to clarify with you - what is your hex encoding of the 'e with an accent' character? In your original post you mentioned it was one byte, but I am seeing it as two bytes in my code (this could be an artifact of copying/pasting from the web?).

When inspecting my source code, the hex encoding of é is X'66B4'. When I convert that into utf-8, I get X'C383C2A9', which seems to be two distinct utf code points, each containing 2 encoding units.

Then, when I pass that back into DISPLAY-OF, it gets converted back into X'66B4' (as far as i can tell), then displayed.

On my terminal (TERM="dtterm" in USS via ssh) this gets displayed as a single é followed immediately by what appears to be a space character, but I believe it is not a real space character - I think my terminal just cannot properly handle the double byte character.

Here's my program for this experiment.

       IDENTIFICATION DIVISION.       PROGRAM-ID.  'JEFF1'.       ENVIRONMENT DIVISION.       CONFIGURATION SECTION.       REPOSITORY.           FUNCTION ALL INTRINSIC.       DATA DIVISION.       WORKING-STORAGE SECTION.       01  U1 PIC U BYTE-LENGTH 8.       PROCEDURE DIVISION.           MOVE X"66B4" to U1           DISPLAY 'HEX-OF(U1)=' HEX-OF(U1)           DISPLAY 'HEX-OF(U1(1:8))=' HEX-OF(U1(1:8))           DISPLAY 'DISPLAY-OF(U1)="' DISPLAY-OF(U1) '"'           DISPLAY 'HEX-OF(DISPLAY-OF(U1))=' HEX-OF(DISPLAY-OF(U1))           DISPLAY 'DISPLAY-OF(U1)(1:2)="' DISPLAY-OF(U1)(1:2) '"'           DISPLAY 'HEX-OF(DISPLAY-OF(U1)(1:2))='             HEX-OF(DISPLAY-OF(U1)(1:2))           GOBACK.       END PROGRAM 'JEFF1'.

Here's the output of this program:

HEX-OF(U1)=C383C2A920202020HEX-OF(U1(1:8))=C383C2A920202020DISPLAY-OF(U1)="é    "HEX-OF(DISPLAY-OF(U1))=66B440404040DISPLAY-OF(U1)(1:2)="é"HEX-OF(DISPLAY-OF(U1)(1:2))=66B4

------------------------------
Jeffery Shimoda

Original Message:
Sent: Thu May 13, 2021 06:18 PM
From: Frank Swarbrick
Subject: JSON and UTF-8

I think we're getting closer, but I'm still not quite clear on the current behavior. Using UWIDTH in your example doesn't quite work, in that the final '}' is truncated from the result. Take a look at this. I've commented out cases that give incorrect results.

 IDENTIFICATION DIVISION.                              PROGRAM-ID.  'UTF8'.                                  ENVIRONMENT DIVISION.                                 CONFIGURATION SECTION.                                REPOSITORY.                                               FUNCTION ALL INTRINSIC.                           DATA DIVISION.                                        WORKING-STORAGE SECTION.                              01  doc-data.                                             05  l              PIC X(10) VALUE 'é'.           01  doc-utf8           PIC U BYTE-LENGTH 40.          01  document           REDEFINES doc-utf8                                    PIC X(40).                     01  doc-len            PIC 9(4).                      01  doc-width          PIC 9(4).                      PROCEDURE DIVISION.                                       move low-values to document                           JSON GENERATE document                                     FROM l                                                COUNT doc-len                                    COMPUTE doc-width = ULENGTH(document(1:doc-len))      DISPLAY doc-len '/' doc-width                         DISPLAY doc-utf8                                      DISPLAY doc-utf8(1:doc-len)                      **** DISPLAY doc-utf8(1:doc-width)                                                                          **** DISPLAY doc-utf8(1:doc-width)                    ****         UPON CONSOLE                             **** DISPLAY "'" doc-utf8(1:doc-width) "'"            ****         UPON CONSOLE                                  DISPLAY doc-utf8(1:doc-len)                                   UPON CONSOLE                                  DISPLAY "'" doc-utf8(1:doc-len) "'"                           UPON CONSOLE                                                                                        DISPLAY DISPLAY-OF(doc-utf8)(1:doc-width)             DISPLAY "'" DISPLAY-OF(doc-utf8)(1:doc-width) "'"                                                           GOBACK.                                           END PROGRAM 'UTF8'.

So the only case where doc-width (UWIDTH) is needed is in the case where the DISPLAY-OF function is being used. And actually, while the following currently is not allowed, I think if it were allowed then it also would work:

DISPLAY DISPLAY-OF(doc-utf8(1:doc-len))

Currently this results in a compile time error:

IGYPA3092-S   Argument-1 "DOC-UTF8 (ALPHANUMERIC REFERENCE MODIFIED                  ITEM)" for function "ALPHANUMERIC FUNCTION DISPLAY-OF" is              not an item of class national.  The statement was                      discarded.

So it looks like reference modification is still "encoding-unit" based, rather than "character" based (which is why your example truncated the last character). In the case of DISPLAY-OF I am converting the entire declared string (including the trailing low-value bytes) first, and then using the UWIDTH result to ref-mod it. But if DISPLAY-OF were changed to accept a ref-modified UTF-8 field as input I don't think UWIDTH would be needed at all (for what I am trying to do here).

------------------------------
Frank Swarbrick

Original Message:
Sent: Thu May 13, 2021 12:45 PM
From: Jeffery Shimoda
Subject: JSON and UTF-8

Hi Frank

I think I see the problem you are pointing out now.
To work around this, you could use ULENGTH function to obtain the number of utf-8 "characters" (aka UTF code points) and then use that to reference-modify doc-utf8.
See the following example.

       IDENTIFICATION DIVISION.       PROGRAM-ID.  'UTF8'.       DATA DIVISION.       WORKING-STORAGE SECTION.       01  doc-data.           05  location       PIC X(10) VALUE 'é'.       01  doc-utf8           PIC U BYTE-LENGTH 256.       01  document           REDEFINES doc-utf8                              PIC X(256).       01  doc-len            PIC 9(4).       01  ddoc-utf8          PIC U DYNAMIC LENGTH.       PROCEDURE DIVISION.           move low-values to document           JSON GENERATE document                FROM location                COUNT doc-len           DISPLAY doc-len           DISPLAY "'" doc-utf8(1:function ulength(document(1:doc-len)))             "'" upon console           GOBACK.       END PROGRAM 'UTF8'.

As things currently stand, the COUNT phrase returns the number of JSON character encoding units (where an encoding unit is 1 byte for generating utf-8 into an alphanumeric item, and 2 bytes for generating utf-16 into national items).
I'm guessing it might be useful to also allow users to have COUNT (or something like it) to return the number of generated utf-8 "characters" (code points)?

------------------------------
Jeffery Shimoda

Original Message:
Sent: Wed May 12, 2021 02:08 AM
From: Frank Swarbrick
Subject: JSON and UTF-8

There actually seem to be two issues. The first is unrelated to UTF-8, and in fact I already have a ticket in to Level II for it. That is, when the NAME IS OMITTED clause is specified, there is a "trailing space" in the result of JSON GENERATE.

That still leaves the multi-byte UTF8 issue. The COUNT result is getting the total number of bytes, rather than the total number of "characters". So when reference modification, which is based on characters rather than bytes, is used it does not work correctly.

I've attached a program used for additional testing.

Result when NAME IS OMITTED is specified and the destination field is set to all UTF-8 'X' characters:

0065
'{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} X'
'{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} XXXXXXXXXXXXXXX'
'{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} X'
'{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} XXXXXXXXXXXXXXX'

Result when NAME IS OMITTED is not specified, and the destination field is set to all UTF-8 'X' characters:

0077
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}X'
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}XXX'
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}X'
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}XXX'

Result when NAME IS OMITTED is not specified, and the destination field is set to all UTF-8 space characters:

0077
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}} '
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}'
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}} '
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}'

As you can see, the only reliable method is to 1) pre-initialize the destination field with all UTF-8 spaces, 2) use the TRIM function to eliminate the trailing spaces.

------------------------------
Frank Swarbrick

Original Message:
Sent: Fri May 07, 2021 10:40 AM
From: Jeffery Shimoda
Subject: JSON and UTF-8

Hi Frank,
I agree it would be very useful to have the JSON GENERATE and JSON PARSE statements directly support PIC U items (with and without DYNAMIC LENGTH). There is certainly a dissonance right now between the UTF-8/DYNAMIC LENGTH support, and the JSON GENERATE/PARSE statements. The dev team is aware of it and we would like to rectify that.

Regarding your observation about the extra byte at the end, I compiled and ran your program and I'm finding a slightly different problem - I found a space, not a low-value byte. Also, I noticed that the value of doc-len is output as 0067, when in fact I believe it should be 0066. It seems the root cause of the problem is the combination of having 'e' (with the accent character) along with using the NAME IS OMITTED phrase.

I created a cutdown version of your test below that further simplifies the json.

       IDENTIFICATION DIVISION.       PROGRAM-ID.  'UTF8'.       ENVIRONMENT DIVISION.       CONFIGURATION SECTION.       DATA DIVISION.       WORKING-STORAGE SECTION.       01  doc-data.           05  location       PIC X(10) VALUE 'é'.       01  doc-utf8           PIC U BYTE-LENGTH 256.       01  document           REDEFINES doc-utf8                              PIC X(256).       01  doc-len            PIC 9(4).       01  ddoc-utf8          PIC U DYNAMIC LENGTH.       PROCEDURE DIVISION.           move low-values to document           JSON GENERATE document      *         FROM doc-data                FROM location                COUNT doc-len      *         NAME doc-data IS OMITTED           DISPLAY doc-len           DISPLAY "'" function hex-of(document(1:doc-len)) "'"           DISPLAY "'" doc-utf8(1:doc-len) "'" upon console           GOBACK.       END PROGRAM 'UTF8'.

When I remove the NAME IS OMITTED phrase and generate directly from LOCATION, the doc-len is 19. When using the NAME IS OMITTED phrase generating from DOC-DATA, the doc-len is 20, and the data contains an superfluous space (utf-8 hex code 0x20) at the end. I'm going to take a look at this because I think it should be 19 in both cases.

------------------------------
Jeffery Shimoda

Original Message:
Sent: Fri April 30, 2021 06:36 PM
From: Frank Swarbrick
Subject: JSON and UTF-8

So it annoyed me from initial implementation that JSON documents had to be in UTF-8 instead of EBCDIC. Upon hearing about Enterprise COBOL supporting a UTF-8 datatype I thought that this would resolve most of my concerns. But I still can't do what it seems to me should be an obvious thing to do. That is, one should be able to do a JSON GENERATE to a UTF-8 data item. And of course conversely do a JSON PARSE of a UTF-8 data item (though I've not even bothered to try this one.)

Without this feature here's an example of my attempts to do a simple display of a generated JSON document.

       IDENTIFICATION DIVISION.
       PROGRAM-ID.  'UTF8'.

       ENVIRONMENT DIVISION. 
       CONFIGURATION SECTION. 
       REPOSITORY. 
           FUNCTION DISPLAY-OF INTRINSIC.
 
       DATA DIVISION. 
       WORKING-STORAGE SECTION. 
       01  doc-data.
           05  userId         PIC X(20) VALUE 'FJS'.
           05  userName       PIC X(30) VALUE 'Frank Swarbrick'.
           05  location       PIC X(10) VALUE 'Café'.

       01  doc-utf8           PIC U BYTE-LENGTH 256.
       01  document           REDEFINES doc-utf8 
                              PIC X(256).
       01  doc-len            PIC 9(4).
       01  ddoc-utf8          PIC U DYNAMIC LENGTH.

       PROCEDURE DIVISION.
           JSON GENERATE document 
                FROM doc-data
                COUNT doc-len 
                NAME doc-data IS OMITTED

           DISPLAY doc-len
           DISPLAY document(1:doc-len)
           DISPLAY doc-utf8(1:doc-len)
      *    DISPLAY DISPLAY-OF(doc-utf8(1:doc-len))
           DISPLAY DISPLAY-OF(doc-utf8)(1:doc-len)
           MOVE doc-utf8(1:doc-len) to ddoc-utf8
           DISPLAY DISPLAY-OF(ddoc-utf8)
           GOBACK.

The first two attempts "work" in as much as they display the document in UTF-8. So if you use "DISPLAY UTF-8" in a browse session, the results are correct. But converting to EBCDIC is problematic, because the 'é' character is two bytes in UTF-8 but only one byte in EBCDIC. So with the code I have here using the DISPLAY-OF function there ends up being an extra byte at the end (low-values in this case, because that's what the field get's "initialized" to.

I'd love to hear someone from the IBM COBOL compiler team address this issue.

------------------------------
Frank Swarbrick
------------------------------

COBOL

COBOL

JSON and UTF-8

Frank SwarbrickFri April 30, 2021 06:36 PM

Jeffery ShimodaFri May 07, 2021 01:14 PM

Frank SwarbrickWed May 12, 2021 09:04 AM

Jeffery ShimodaThu May 13, 2021 04:33 PM

Frank SwarbrickFri May 14, 2021 02:11 PM

Jeffery ShimodaMon May 17, 2021 04:10 PM

Frank SwarbrickTue May 18, 2021 12:27 PM

1. JSON and UTF-8

2. RE: JSON and UTF-8

3. RE: JSON and UTF-8

4. RE: JSON and UTF-8

5. RE: JSON and UTF-8

6. RE: JSON and UTF-8

7. RE: JSON and UTF-8

Additional
Resources

Office

Quick Links

COBOL

COBOL

JSON and UTF-8

Frank SwarbrickFri April 30, 2021 06:36 PM

Jeffery ShimodaFri May 07, 2021 01:14 PM

Frank SwarbrickWed May 12, 2021 09:04 AM

Jeffery ShimodaThu May 13, 2021 04:33 PM

Frank SwarbrickFri May 14, 2021 02:11 PM

Jeffery ShimodaMon May 17, 2021 04:10 PM

Frank SwarbrickTue May 18, 2021 12:27 PM

1. JSON and UTF-8

2. RE: JSON and UTF-8

3. RE: JSON and UTF-8

4. RE: JSON and UTF-8

5. RE: JSON and UTF-8

6. RE: JSON and UTF-8

7. RE: JSON and UTF-8

Related Content

RE: JSON and UTF-8

date/time arithmetic in pure COBOL

Pitfalls of Invalid Programs and Data

Debug Tool support for COBOL 5.1: LENGTH OF and ADDRESS OF in AUTOMONITOR output

Newer COBOL features, what should next?

Additional Resources

Office

Quick Links

Additional
Resources