I'm not exactly sure what's going on with your test, but the accented e, when I cut and paste it from the web, is x'C3A9'. When I paste it in to an ISPF edit session it gets converted to EBCDIC x'51'. I don't have sshd running on my z/OS so I can't test from that perspective. There is some discussion of the C383C2A9 issue if you google "utf8 C383C2A9".
Anyway, I think you have it pretty much in hand. I don't actually have a use case for this, as we're not using UTF-8 at the moment. It was just something I wondered how it might work, so I tested it.
Original Message:
Sent: Sun May 16, 2021 10:56 PM
From: Jeffery Shimoda
Subject: JSON and UTF-8
Hi Frank,
I was playing around with this a bit more. You are right - when reference modifying doc-utf8 the length is the number of utf-8 encoding units (my apologies for the confusion...I was likely confusing it with reference modifying a 'fixed character' UTF-8 item, where the length is the number of 'characters').
Also, I'm going to bring up the example where you reference modified doc-utf8 within DISPLAY-OF to the team - it does seem like that could avoid an extra call to ULENGTH.
One thing that I wanted to clarify with you - what is your hex encoding of the 'e with an accent' character? In your original post you mentioned it was one byte, but I am seeing it as two bytes in my code (this could be an artifact of copying/pasting from the web?).
When inspecting my source code, the hex encoding of é is X'66B4'. When I convert that into utf-8, I get X'C383C2A9', which seems to be two distinct utf code points, each containing 2 encoding units.
Then, when I pass that back into DISPLAY-OF, it gets converted back into X'66B4' (as far as i can tell), then displayed.
On my terminal (TERM="dtterm" in USS via ssh) this gets displayed as a single é followed immediately by what appears to be a space character, but I believe it is not a real space character - I think my terminal just cannot properly handle the double byte character.
Here's my program for this experiment.
IDENTIFICATION DIVISION. PROGRAM-ID. 'JEFF1'. ENVIRONMENT DIVISION. CONFIGURATION SECTION. REPOSITORY. FUNCTION ALL INTRINSIC. DATA DIVISION. WORKING-STORAGE SECTION. 01 U1 PIC U BYTE-LENGTH 8. PROCEDURE DIVISION. MOVE X"66B4" to U1 DISPLAY 'HEX-OF(U1)=' HEX-OF(U1) DISPLAY 'HEX-OF(U1(1:8))=' HEX-OF(U1(1:8)) DISPLAY 'DISPLAY-OF(U1)="' DISPLAY-OF(U1) '"' DISPLAY 'HEX-OF(DISPLAY-OF(U1))=' HEX-OF(DISPLAY-OF(U1)) DISPLAY 'DISPLAY-OF(U1)(1:2)="' DISPLAY-OF(U1)(1:2) '"' DISPLAY 'HEX-OF(DISPLAY-OF(U1)(1:2))=' HEX-OF(DISPLAY-OF(U1)(1:2)) GOBACK. END PROGRAM 'JEFF1'.
Here's the output of this program:
HEX-OF(U1)=C383C2A920202020HEX-OF(U1(1:8))=C383C2A920202020DISPLAY-OF(U1)="é "HEX-OF(DISPLAY-OF(U1))=66B440404040DISPLAY-OF(U1)(1:2)="é"HEX-OF(DISPLAY-OF(U1)(1:2))=66B4
------------------------------
Jeffery Shimoda
Original Message:
Sent: Thu May 13, 2021 06:18 PM
From: Frank Swarbrick
Subject: JSON and UTF-8
I think we're getting closer, but I'm still not quite clear on the current behavior. Using UWIDTH in your example doesn't quite work, in that the final '}' is truncated from the result. Take a look at this. I've commented out cases that give incorrect results.
IDENTIFICATION DIVISION. PROGRAM-ID. 'UTF8'. ENVIRONMENT DIVISION. CONFIGURATION SECTION. REPOSITORY. FUNCTION ALL INTRINSIC. DATA DIVISION. WORKING-STORAGE SECTION. 01 doc-data. 05 l PIC X(10) VALUE 'é'. 01 doc-utf8 PIC U BYTE-LENGTH 40. 01 document REDEFINES doc-utf8 PIC X(40). 01 doc-len PIC 9(4). 01 doc-width PIC 9(4). PROCEDURE DIVISION. move low-values to document JSON GENERATE document FROM l COUNT doc-len COMPUTE doc-width = ULENGTH(document(1:doc-len)) DISPLAY doc-len '/' doc-width DISPLAY doc-utf8 DISPLAY doc-utf8(1:doc-len) **** DISPLAY doc-utf8(1:doc-width) **** DISPLAY doc-utf8(1:doc-width) **** UPON CONSOLE **** DISPLAY "'" doc-utf8(1:doc-width) "'" **** UPON CONSOLE DISPLAY doc-utf8(1:doc-len) UPON CONSOLE DISPLAY "'" doc-utf8(1:doc-len) "'" UPON CONSOLE DISPLAY DISPLAY-OF(doc-utf8)(1:doc-width) DISPLAY "'" DISPLAY-OF(doc-utf8)(1:doc-width) "'" GOBACK. END PROGRAM 'UTF8'.
So the only case where doc-width (UWIDTH) is needed is in the case where the DISPLAY-OF function is being used. And actually, while the following currently is not allowed, I think if it were allowed then it also would work:
DISPLAY DISPLAY-OF(doc-utf8(1:doc-len))
Currently this results in a compile time error:
IGYPA3092-S Argument-1 "DOC-UTF8 (ALPHANUMERIC REFERENCE MODIFIED ITEM)" for function "ALPHANUMERIC FUNCTION DISPLAY-OF" is not an item of class national. The statement was discarded.
So it looks like reference modification is still "encoding-unit" based, rather than "character" based (which is why your example truncated the last character). In the case of DISPLAY-OF I am converting the entire declared string (including the trailing low-value bytes) first, and then using the UWIDTH result to ref-mod it. But if DISPLAY-OF were changed to accept a ref-modified UTF-8 field as input I don't think UWIDTH would be needed at all (for what I am trying to do here).
------------------------------
Frank Swarbrick
Original Message:
Sent: Thu May 13, 2021 12:45 PM
From: Jeffery Shimoda
Subject: JSON and UTF-8
Hi Frank
I think I see the problem you are pointing out now.
To work around this, you could use ULENGTH function to obtain the number of utf-8 "characters" (aka UTF code points) and then use that to reference-modify doc-utf8.
See the following example.
IDENTIFICATION DIVISION. PROGRAM-ID. 'UTF8'. DATA DIVISION. WORKING-STORAGE SECTION. 01 doc-data. 05 location PIC X(10) VALUE 'é'. 01 doc-utf8 PIC U BYTE-LENGTH 256. 01 document REDEFINES doc-utf8 PIC X(256). 01 doc-len PIC 9(4). 01 ddoc-utf8 PIC U DYNAMIC LENGTH. PROCEDURE DIVISION. move low-values to document JSON GENERATE document FROM location COUNT doc-len DISPLAY doc-len DISPLAY "'" doc-utf8(1:function ulength(document(1:doc-len))) "'" upon console GOBACK. END PROGRAM 'UTF8'.
As things currently stand, the COUNT phrase returns the number of JSON character encoding units (where an encoding unit is 1 byte for generating utf-8 into an alphanumeric item, and 2 bytes for generating utf-16 into national items).
I'm guessing it might be useful to also allow users to have COUNT (or something like it) to return the number of generated utf-8 "characters" (code points)?
------------------------------
Jeffery Shimoda
Original Message:
Sent: Wed May 12, 2021 02:08 AM
From: Frank Swarbrick
Subject: JSON and UTF-8
There actually seem to be two issues. The first is unrelated to UTF-8, and in fact I already have a ticket in to Level II for it. That is, when the NAME IS OMITTED clause is specified, there is a "trailing space" in the result of JSON GENERATE.
That still leaves the multi-byte UTF8 issue. The COUNT result is getting the total number of bytes, rather than the total number of "characters". So when reference modification, which is based on characters rather than bytes, is used it does not work correctly.
I've attached a program used for additional testing.
Result when NAME IS OMITTED is specified and the destination field is set to all UTF-8 'X' characters:
0065
'{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} X'
'{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} XXXXXXXXXXXXXXX'
'{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} X'
'{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} XXXXXXXXXXXXXXX'
Result when NAME IS OMITTED is not specified, and the destination field is set to all UTF-8 'X' characters:0077
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}X'
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}XXX'
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}X'
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}XXX'
Result when NAME IS OMITTED is not specified, and the destination field is set to all UTF-8 space characters:0077
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}} '
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}'
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}} '
'{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}'
As you can see, the only reliable method is to 1) pre-initialize the destination field with all UTF-8 spaces, 2) use the TRIM function to eliminate the trailing spaces.
------------------------------
Frank Swarbrick
Original Message:
Sent: Fri May 07, 2021 10:40 AM
From: Jeffery Shimoda
Subject: JSON and UTF-8
Hi Frank,
I agree it would be very useful to have the JSON GENERATE and JSON PARSE statements directly support PIC U items (with and without DYNAMIC LENGTH). There is certainly a dissonance right now between the UTF-8/DYNAMIC LENGTH support, and the JSON GENERATE/PARSE statements. The dev team is aware of it and we would like to rectify that.
Regarding your observation about the extra byte at the end, I compiled and ran your program and I'm finding a slightly different problem - I found a space, not a low-value byte. Also, I noticed that the value of doc-len is output as 0067, when in fact I believe it should be 0066. It seems the root cause of the problem is the combination of having 'e' (with the accent character) along with using the NAME IS OMITTED phrase.
I created a cutdown version of your test below that further simplifies the json.
IDENTIFICATION DIVISION. PROGRAM-ID. 'UTF8'. ENVIRONMENT DIVISION. CONFIGURATION SECTION. DATA DIVISION. WORKING-STORAGE SECTION. 01 doc-data. 05 location PIC X(10) VALUE 'é'. 01 doc-utf8 PIC U BYTE-LENGTH 256. 01 document REDEFINES doc-utf8 PIC X(256). 01 doc-len PIC 9(4). 01 ddoc-utf8 PIC U DYNAMIC LENGTH. PROCEDURE DIVISION. move low-values to document JSON GENERATE document * FROM doc-data FROM location COUNT doc-len * NAME doc-data IS OMITTED DISPLAY doc-len DISPLAY "'" function hex-of(document(1:doc-len)) "'" DISPLAY "'" doc-utf8(1:doc-len) "'" upon console GOBACK. END PROGRAM 'UTF8'.
When I remove the NAME IS OMITTED phrase and generate directly from LOCATION, the doc-len is 19. When using the NAME IS OMITTED phrase generating from DOC-DATA, the doc-len is 20, and the data contains an superfluous space (utf-8 hex code 0x20) at the end. I'm going to take a look at this because I think it should be 19 in both cases.
------------------------------
Jeffery Shimoda
Original Message:
Sent: Fri April 30, 2021 06:36 PM
From: Frank Swarbrick
Subject: JSON and UTF-8
So it annoyed me from initial implementation that JSON documents had to be in UTF-8 instead of EBCDIC. Upon hearing about Enterprise COBOL supporting a UTF-8 datatype I thought that this would resolve most of my concerns. But I still can't do what it seems to me should be an obvious thing to do. That is, one should be able to do a JSON GENERATE to a UTF-8 data item. And of course conversely do a JSON PARSE of a UTF-8 data item (though I've not even bothered to try this one.)
Without this feature here's an example of my attempts to do a simple display of a generated JSON document.
IDENTIFICATION DIVISION.
PROGRAM-ID. 'UTF8'.
ENVIRONMENT DIVISION.
CONFIGURATION SECTION.
REPOSITORY.
FUNCTION DISPLAY-OF INTRINSIC.
DATA DIVISION.
WORKING-STORAGE SECTION.
01 doc-data.
05 userId PIC X(20) VALUE 'FJS'.
05 userName PIC X(30) VALUE 'Frank Swarbrick'.
05 location PIC X(10) VALUE 'Café'.
01 doc-utf8 PIC U BYTE-LENGTH 256.
01 document REDEFINES doc-utf8
PIC X(256).
01 doc-len PIC 9(4).
01 ddoc-utf8 PIC U DYNAMIC LENGTH.
PROCEDURE DIVISION.
JSON GENERATE document
FROM doc-data
COUNT doc-len
NAME doc-data IS OMITTED
DISPLAY doc-len
DISPLAY document(1:doc-len)
DISPLAY doc-utf8(1:doc-len)
* DISPLAY DISPLAY-OF(doc-utf8(1:doc-len))
DISPLAY DISPLAY-OF(doc-utf8)(1:doc-len)
MOVE doc-utf8(1:doc-len) to ddoc-utf8
DISPLAY DISPLAY-OF(ddoc-utf8)
GOBACK.
The first two attempts "work" in as much as they display the document in UTF-8. So if you use "DISPLAY UTF-8" in a browse session, the results are correct. But converting to EBCDIC is problematic, because the '
é' character is two bytes in UTF-8 but only one byte in EBCDIC. So with the code I have here using the DISPLAY-OF function there ends up being an extra byte at the end (low-values in this case, because that's what the field get's "initialized" to.
I'd love to hear someone from the IBM COBOL compiler team address this issue.
------------------------------
Frank Swarbrick
------------------------------