COBOL

COBOL

COBOL

COBOL is responsible for the efficient, reliable, secure, and unseen day-to-day operations of the world's economy.

 View Only
  • 1.  JSON and UTF-8

    Posted Fri April 30, 2021 06:36 PM
    So it annoyed me from initial implementation that JSON documents had to be in UTF-8 instead of EBCDIC.  Upon hearing about Enterprise COBOL supporting a UTF-8 datatype I thought that this would resolve most of my concerns.  But I still can't do what it seems to me should be an obvious thing to do.  That is, one should be able to do a JSON GENERATE to a UTF-8 data item.  And of course conversely do a JSON PARSE of a UTF-8 data item (though I've not even bothered to try this one.)

    Without this feature here's an example of my attempts to do a simple display of a generated JSON document.

           IDENTIFICATION DIVISION.
           PROGRAM-ID.  'UTF8'.

           ENVIRONMENT DIVISION
           CONFIGURATION SECTION
           REPOSITORY
               FUNCTION DISPLAY-OF INTRINSIC.
     
           DATA DIVISION
           WORKING-STORAGE SECTION
           01  doc-data.
               05  userId         PIC X(20) VALUE 'FJS'.
               05  userName       PIC X(30) VALUE 'Frank Swarbrick'.
               05  location       PIC X(10) VALUE 'Café'.

           01  doc-utf8           PIC U BYTE-LENGTH 256.
           01  document           REDEFINES doc-utf8 
                                  PIC X(256).
           01  doc-len            PIC 9(4).
           01  ddoc-utf8          PIC DYNAMIC LENGTH.

           PROCEDURE DIVISION.
               JSON GENERATE document 
                    FROM doc-data
                    COUNT doc-len 
                    NAME doc-data IS OMITTED

               DISPLAY doc-len
               DISPLAY document(1:doc-len)
               DISPLAY doc-utf8(1:doc-len)
          *    DISPLAY DISPLAY-OF(doc-utf8(1:doc-len))
               DISPLAY DISPLAY-OF(doc-utf8)(1:doc-len)
               MOVE doc-utf8(1:doc-len) to ddoc-utf8
               DISPLAY DISPLAY-OF(ddoc-utf8)
               GOBACK.

    The first two attempts "work" in as much as they display the document in UTF-8.  So if you use "DISPLAY UTF-8" in a browse session, the results are correct.  But converting to EBCDIC is problematic, because the 'é' character is two bytes in UTF-8 but only one byte in EBCDIC.  So with the code I have here using the DISPLAY-OF function there ends up being an extra byte at the end (low-values in this case, because that's what the field get's "initialized" to.

    I'd love to hear someone from the IBM COBOL compiler team address this issue.


    ------------------------------
    Frank Swarbrick
    ------------------------------


  • 2.  RE: JSON and UTF-8

    Posted Fri May 07, 2021 01:14 PM
    Hi Frank,
    I agree it would be very useful to have the JSON GENERATE and JSON PARSE statements directly support PIC U items (with and without DYNAMIC LENGTH). There is certainly a dissonance right now between the UTF-8/DYNAMIC LENGTH support, and the JSON GENERATE/PARSE statements.  The dev team is aware of it and we would like to rectify that.

    Regarding your observation about the extra byte at the end, I compiled and ran your program and I'm finding a slightly different problem - I found a space, not a low-value byte. Also, I noticed that the value of doc-len is output as 0067, when in fact I believe it should be 0066. It seems the root cause of the problem is the combination of having 'e' (with the accent character) along with using the NAME IS OMITTED phrase.

    I created a cutdown version of your test below that further simplifies the json.

           IDENTIFICATION DIVISION.
           PROGRAM-ID.  'UTF8'.
           ENVIRONMENT DIVISION.
           CONFIGURATION SECTION.
           DATA DIVISION.
           WORKING-STORAGE SECTION.
    
           01  doc-data.
               05  location       PIC X(10) VALUE 'é'.
           01  doc-utf8           PIC U BYTE-LENGTH 256.
           01  document           REDEFINES doc-utf8
                                  PIC X(256).
           01  doc-len            PIC 9(4).
           01  ddoc-utf8          PIC U DYNAMIC LENGTH.
    
           PROCEDURE DIVISION.
               move low-values to document
               JSON GENERATE document
    
          *         FROM doc-data
                    FROM location
    
                    COUNT doc-len
          *         NAME doc-data IS OMITTED
    
               DISPLAY doc-len
               DISPLAY "'" function hex-of(document(1:doc-len)) "'"
               DISPLAY "'" doc-utf8(1:doc-len) "'" upon console
    
               GOBACK.
           END PROGRAM 'UTF8'.
    ​

    When I remove the NAME IS OMITTED phrase and generate directly from LOCATION, the doc-len is 19. When using the NAME IS OMITTED phrase generating from DOC-DATA, the doc-len is 20, and the data contains an superfluous space (utf-8 hex code 0x20) at the end. I'm going to take a look at this because I think it should be 19 in both cases.



    ------------------------------
    Jeffery Shimoda
    ------------------------------



  • 3.  RE: JSON and UTF-8

    Posted Wed May 12, 2021 09:04 AM
      |   view attached
    There actually seem to be two issues.  The first is unrelated to UTF-8, and in fact I already have a ticket in to Level II for it.  That is, when the NAME IS OMITTED clause is specified, there is a "trailing space" in the result of JSON GENERATE.

    That still leaves the multi-byte UTF8 issue.  The COUNT result is getting the total number of bytes, rather than the total number of "characters".  So when reference modification, which is based on characters rather than bytes, is used it does not work correctly.  

    I've attached a program used for additional testing.

    Result when NAME IS OMITTED is specified and the destination field is set to all UTF-8 'X' characters:
    0065
    '{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} X'
    '{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} XXXXXXXXXXXXXXX'
    '{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} X'
    '{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"} XXXXXXXXXXXXXXX'
    Result when NAME IS OMITTED is not specified, and the destination field is set to all UTF-8 'X' characters:
    0077
    '{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}X'
    '{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}XXX'
    '{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}X'
    '{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}XXX'
    Result when NAME IS OMITTED is not specified, and the destination field is set to all UTF-8 space characters:
    0077
    '{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}} '
    '{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}'
    '{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}} '
    '{"doc-data":{"userId":"FJS","userName":"Frank Swarbrick","location":"Café"}}'

    As you can see, the only reliable method is to 1) pre-initialize the destination field with all UTF-8 spaces, 2) use the TRIM function to eliminate the trailing spaces.

    ------------------------------
    Frank Swarbrick
    ------------------------------

    Attachment(s)

    txt
    utf8.cob.txt   1 KB 1 version


  • 4.  RE: JSON and UTF-8

    Posted Thu May 13, 2021 04:33 PM
    Hi Frank

    I think I see the problem you are pointing out now.
    To work around this, you could use ULENGTH function to obtain the number of utf-8 "characters" (aka UTF code points) and then use that to reference-modify doc-utf8.
    See the following example.
           IDENTIFICATION DIVISION.
           PROGRAM-ID.  'UTF8'.
           DATA DIVISION.
           WORKING-STORAGE SECTION.
           01  doc-data.
               05  location       PIC X(10) VALUE 'é'.
           01  doc-utf8           PIC U BYTE-LENGTH 256.
           01  document           REDEFINES doc-utf8
                                  PIC X(256).
           01  doc-len            PIC 9(4).
           01  ddoc-utf8          PIC U DYNAMIC LENGTH.
           PROCEDURE DIVISION.
               move low-values to document
               JSON GENERATE document
                    FROM location
                    COUNT doc-len
               DISPLAY doc-len
               DISPLAY "'" doc-utf8(1:function ulength(document(1:doc-len)))
                 "'" upon console
               GOBACK.
           END PROGRAM 'UTF8'.
    ​

    As things currently stand, the COUNT phrase returns the number of JSON character encoding units (where an encoding unit is 1 byte for generating utf-8 into an alphanumeric item, and 2 bytes for generating utf-16 into national items).
    I'm guessing it might be useful to also allow users to have COUNT (or something like it) to return the number of generated utf-8 "characters" (code points)?

    ------------------------------
    Jeffery Shimoda
    ------------------------------



  • 5.  RE: JSON and UTF-8

    Posted Fri May 14, 2021 02:11 PM
    I think we're getting closer, but I'm still not quite clear on the current behavior.  Using UWIDTH in your example doesn't quite work, in that the final '}' is truncated from the result.  Take a look at this.  I've commented out cases that give incorrect results.
     IDENTIFICATION DIVISION.                             
     PROGRAM-ID.  'UTF8'.                                 
     ENVIRONMENT DIVISION.                                
     CONFIGURATION SECTION.                               
     REPOSITORY.                                          
         FUNCTION ALL INTRINSIC.                          
     DATA DIVISION.                                       
     WORKING-STORAGE SECTION.                             
     01  doc-data.                                        
         05  l              PIC X(10) VALUE 'é'.          
     01  doc-utf8           PIC U BYTE-LENGTH 40.         
     01  document           REDEFINES doc-utf8            
                            PIC X(40).                    
     01  doc-len            PIC 9(4).                     
     01  doc-width          PIC 9(4).                     
     PROCEDURE DIVISION.                                  
         move low-values to document                      
         JSON GENERATE document                           
              FROM l                                      
              COUNT doc-len                               
         COMPUTE doc-width = ULENGTH(document(1:doc-len)) 
         DISPLAY doc-len '/' doc-width                    
         DISPLAY doc-utf8                                 
         DISPLAY doc-utf8(1:doc-len)                      
    **** DISPLAY doc-utf8(1:doc-width)                    
                                                          
    **** DISPLAY doc-utf8(1:doc-width)                    
    ****         UPON CONSOLE                             
    **** DISPLAY "'" doc-utf8(1:doc-width) "'"            
    ****         UPON CONSOLE                             
         DISPLAY doc-utf8(1:doc-len)                      
                 UPON CONSOLE                             
         DISPLAY "'" doc-utf8(1:doc-len) "'"              
                 UPON CONSOLE                             
                                                          
         DISPLAY DISPLAY-OF(doc-utf8)(1:doc-width)        
         DISPLAY "'" DISPLAY-OF(doc-utf8)(1:doc-width) "'"
                                                          
         GOBACK.                                          
     END PROGRAM 'UTF8'.                                  
    ​
    So the only case where doc-width (UWIDTH) is needed is in the case where the DISPLAY-OF function is being used.  And actually, while the following currently is not allowed, I think if it were allowed then it also would work:
    DISPLAY DISPLAY-OF(doc-utf8(1:doc-len))​

    Currently this results in a compile time error:
    IGYPA3092-S   Argument-1 "DOC-UTF8 (ALPHANUMERIC REFERENCE MODIFIED    
                  ITEM)" for function "ALPHANUMERIC FUNCTION DISPLAY-OF" is
                  not an item of class national.  The statement was        
                  discarded.                                               ​

    So it looks like reference modification is still "encoding-unit" based, rather than "character" based (which is why your example truncated the last character).  In the case of DISPLAY-OF I am converting the entire declared string (including the trailing low-value bytes) first, and then using the UWIDTH result to ref-mod it.  But if DISPLAY-OF were changed to accept a ref-modified UTF-8 field as input I don't think UWIDTH would be needed at all (for what I am trying to do here).





    ------------------------------
    Frank Swarbrick
    ------------------------------



  • 6.  RE: JSON and UTF-8

    Posted Mon May 17, 2021 04:10 PM
    Hi Frank,

    I was playing around with this a bit more. You are right - when reference modifying doc-utf8 the length is the number of utf-8 encoding units (my apologies for the confusion...I was likely confusing it with reference modifying a 'fixed character' UTF-8 item, where the length is the number of 'characters').

    Also, I'm going to bring up the example where you reference modified doc-utf8 within DISPLAY-OF to the team - it does seem like that could avoid an extra call to ULENGTH.

    One thing that I wanted to clarify with you - what is your hex encoding of the 'e with an accent' character? In your original post you mentioned it was one byte, but I am seeing it as two bytes in my code (this could be an artifact of copying/pasting from the web?).

    When inspecting my source code, the hex encoding of é is X'66B4'. When I convert that into utf-8, I get X'C383C2A9', which seems to be two distinct utf code points, each containing 2 encoding units.

    Then, when I pass that back into DISPLAY-OF, it gets converted back into X'66B4' (as far as i can tell), then displayed.

    On my terminal (TERM="dtterm" in USS via ssh) this gets displayed as a single é followed immediately by what appears to be a space character, but I believe it is not a real space character - I think my terminal just cannot properly handle the double byte character.

    Here's my program for this experiment.
           IDENTIFICATION DIVISION.
           PROGRAM-ID.  'JEFF1'.
           ENVIRONMENT DIVISION.
           CONFIGURATION SECTION.
           REPOSITORY.
               FUNCTION ALL INTRINSIC.
           DATA DIVISION.
           WORKING-STORAGE SECTION.
           01  U1 PIC U BYTE-LENGTH 8.
           PROCEDURE DIVISION.
               MOVE X"66B4" to U1
               DISPLAY 'HEX-OF(U1)=' HEX-OF(U1)
               DISPLAY 'HEX-OF(U1(1:8))=' HEX-OF(U1(1:8))
    
               DISPLAY 'DISPLAY-OF(U1)="' DISPLAY-OF(U1) '"'
               DISPLAY 'HEX-OF(DISPLAY-OF(U1))=' HEX-OF(DISPLAY-OF(U1))
    
               DISPLAY 'DISPLAY-OF(U1)(1:2)="' DISPLAY-OF(U1)(1:2) '"'
               DISPLAY 'HEX-OF(DISPLAY-OF(U1)(1:2))='
                 HEX-OF(DISPLAY-OF(U1)(1:2))
    
               GOBACK.
           END PROGRAM 'JEFF1'.
    ​

    Here's the output of this program:

    HEX-OF(U1)=C383C2A920202020
    HEX-OF(U1(1:8))=C383C2A920202020
    DISPLAY-OF(U1)="é    "
    HEX-OF(DISPLAY-OF(U1))=66B440404040
    DISPLAY-OF(U1)(1:2)="é"
    HEX-OF(DISPLAY-OF(U1)(1:2))=66B4


    ------------------------------
    Jeffery Shimoda
    ------------------------------



  • 7.  RE: JSON and UTF-8

    Posted Tue May 18, 2021 12:27 PM
    I'm not exactly sure what's going on with your test, but the accented e, when I cut and paste it from the web, is x'C3A9'.  When I paste it in to an ISPF edit session it gets converted to EBCDIC x'51'.  I don't have sshd running on my z/OS so I can't test from that perspective.  There is some discussion of the C383C2A9 issue if you google "utf8 C383C2A9".
    Anyway, I think you have it pretty much in hand.  I don't actually have a use case for this, as we're not using UTF-8 at the moment.  It was just something I wondered how it might work, so I tested it.



    ------------------------------
    Frank Swarbrick
    ------------------------------