Message Image  

Reading large files of multi-line messages

 View Only
Tue July 14, 2020 10:15 AM

In Integration Bus, the FileInput node can be used to read very large files of data, by streaming in messages one at a time. The Record detection property on the Records and Elements tab is used to specify how the node breaks up the file into messages. If the messages are a fixed length then set the property to ‘Fixed Length’. If the messages are separated by a unique character string then set the property to ‘Delimited’. If neither of these techniques can be used because the messages are variable length and not separated by a unique string, then set the property to ‘Parsed Record Sequence’. This means that the parser specified on the Input Message Parsing tab is used to read the messages one at a time. This technique may be used with the XMLNSC, DFDL and MRM parsers, all of which have the ability to stream messages in this manner.

When the messages are non-XML text records then DFDL should be used as the parser, and a DFDL schema created to model the message. This is straightforward to create if each message is one record, but when a message consists of multiple records then the schema needs more careful design so that the parser knows how to find the end of a message without running into the next one. Take the following as an example:

HDR,001,abc,83737
DTL,001,abcde
DTL,002,xyz
HDR,002,abc,83738
DTL,001,abcde
HDR,003,abc,837379
DTL,001,abcde
DTL,002,lmnop
DTL,003,xyz

Each message consists of a header record starting ‘HDR’ and an unbounded number of detail records starting ‘DTL’. Each record ends with a new line, there is no trailer record to identify the end of the message, and there is no record count in the header record. How can you design the DFDL schema so the parser knows when it has finished a message?

There are two ways to do this in DFDL:

  1. Model ‘HDR’ and ‘DTL’ as initiators and use dfdl:initiatedContent=’yes’ on the parent sequence.
    Define initiators

    Set Initiated Content
  2. Model ‘HDR’ and ‘DTL’ as ‘RecordType’ elements and use a discriminator on the elements.
    Define discriminators

In both cases, the parser reads one HDR record then continually reads DTL records, until it gets an error because it reads the HDR record from the next message and this does not match a DTL record. The parser then backtracks and ends the parsing for that message. Next time the parser is positioned correctly at the start of the HDR record for the next message, and the cycle repeats.

It’s also useful to provide a message that models the whole file. Although not used by the FileInput node, it can be used by the DFDL editor to test a smaller file using the Test Parse feature. Simply create a message that contains an unbounded element reference.

Whole file message

5 comments on"Reading large files of multi-line messages"

  1. stevehanson March 13, 2017

    See the last paragraph of my article. Use the ‘whole file’ model when specifying the Message in the FileInput node, and specify a ‘Record Detection’ value of ‘Whole File’. The entire file will be read in before the message tree is propagated down the message flow. Not recommended for large files.

    Reply (Edit)
  2. tanmoy barman March 07, 2017

    We have n rows and m columns in a .csv file.
    We want to process the file and sent the whole record set to the compute node and do some processing on the data set.

    I am able to process the file for only one row at a time but in this case the message flow is called n no of times.

    I want a solution where the flow is called only a single time.

    Please advice.

    Reply (Edit)
  3. stevehanson January 16, 2017

    Your file consists of multiple messages each of which consists of a 1) header with an initiator, 2) a varying number of body records without an initiator, and 3) a trailer with an initiator. Because the body records do not have an initiator you can’t use the dfdl:initiatedContent=”yes”. I would use a mixture of initiators and discriminators, as follows:

    1) Header record. Model this as a record with a child element ‘ID’ which has dfdl:initiator=”Cust_ID”.

    3) Trailer record. Model this as a record with a child element ‘ID’ which has dfdl:initiator=”End of Cust_ID”.

    2) Body record. Model this as a record (maxoccurs=”unbounded”) with a child element ‘Line’ of type xs:string with a discriminator {fn:not(fn:contains(., ‘Cust_ID’))}.

    The DFDL parser will parse the header using Header, then parse try to parse each record as a Body, but will fail when it encounters the trailer as it contains ‘Cust_ID’. It will then backtrack and re-parse using Trailer.

    Reply (Edit)
  4. Crenie January 13, 2017

    Hi, I have a document that looks like this:
    Cust_ID=xxx
    This is line one.
    Line 2
    line 3
    End of Cust_ID=XXX
    Cust_ID=YYY
    This is line one.
    Line 2
    line 3
    End of Cust_ID=YYY

    What is the best way to parse such a document ?

    Reply (Edit)
  5. Crenie January 13, 2017

    Thanks for this post. Very informative and confirms what I had been thinking.

    Reply (Edit)