Informix

 View Only
Expand all | Collapse all

14.1 WE - fastest way to load 61Mio lines CSV file

  • 1.  14.1 WE - fastest way to load 61Mio lines CSV file

    Posted Wed November 10, 2021 05:25 AM
    Hi all,

    i have a question about loading CSV files into informix. Currently we´re getting really huge CSV files from ERP systems (61Mio lines in 1 file) - and it takes hours to load it via dbload.
    How you would to it ? Base is Informix 14.10 Workgroup Edition with RHEL8 64bit

    ------------------------------
    Marc Demhartner
    ------------------------------

    #Informix


  • 2.  RE: 14.1 WE - fastest way to load 61Mio lines CSV file

    IBM Champion
    Posted Wed November 10, 2021 05:37 AM
    Hi Marc,

    You should an external table. You can even parallelize the loading by split the csv file.
    Check the documentation for the syntax to use. But the external route is your best way. The target table should be of type RAW if possible ( that way it will not use the logical logs). After loading, change the table from RAW to Standard.

    Khaled Bentebal de mon portable






  • 3.  RE: 14.1 WE - fastest way to load 61Mio lines CSV file

    Posted Wed November 10, 2021 05:38 AM
    Hi Marc.
    Csv loading is never a "simple trick".
    Dbload is a very good utility that has several validations to avoid loading incorrect data.
    But... you can also try bewer methods eg: external tables (supposing your data is delimited). That might be a good option.

    I am just curious when you say "several hours". How is your dbload storing that? Directly into a permanent table? Do you have any cluster or only single-node instance?

    Alexandre Marini





  • 4.  RE: 14.1 WE - fastest way to load 61Mio lines CSV file

    Posted Wed November 10, 2021 05:47 AM
    Thanks for your feedback - yes, we´re loading it to a fresh created "loading table" tl_marc e.g. without index - CSV file is 100% valid, we´re converting it on OS before loading it into the DB.

    Yes, its a single-node instance. 

    @khaled Bentebal WE does not allow parallelize. 

    Can you explain that "external table" thing? 


    ------------------------------
    Marc Demhartner
    ------------------------------



  • 5.  RE: 14.1 WE - fastest way to load 61Mio lines CSV file

    Posted Wed November 10, 2021 06:25 AM
    Hi,

    1) CSV file "C:\TMP\somedata.csv" contains:

    1|Dog|Haf |
    2|Cat|Mnau|
    3|Snake|Ssss|


    2) External table:

    create external table myanimalscsv (
    id integer,
    animal varchar(10),
    sound varchar(10)
    ) using (  DATAFILES ("DISK:/tmp/somedata.csv")  );

    select * from myanimalscsv;  // is working


    3) Raw table:
    create raw table myanimals (
     id integer,
     animal varchar(10),
     sound varchar(10)
     );

    Loading:
    insert into myanimals select * from myanimalscsv;

    Change to normal table:
    alter table myanimals type (standard);


    Leos

    ------ Původní zpráva ------





  • 6.  RE: 14.1 WE - fastest way to load 61Mio lines CSV file

    IBM Champion
    Posted Wed November 10, 2021 07:24 AM
    Marc:

    Myschema can generate external table unload and load scripts for your tables (note that I'm using the latest November 4th release of myschema, the October 28th release first included changing the table to RAW before the load and back to STANDARD after). Here's an example:


    art@Elbereth:~/GoogleDrive/MyConsulting/test$ myschema -d art -t extents --myexport-express --myexport-delimiter >/dev/null
    art@Elbereth:~/GoogleDrive/MyConsulting/test$ ls -laF
    total 28
    drwxrwxr-x  3 art art  4096 Nov 10 07:15 ./
    drwxrwxr-x 75 art art 12288 Oct 29 06:27 ../
    -rw-rw-r--  1 art art   431 Nov 10 07:15 art.extents.myexport.sql
    -rw-rw-r--  1 art art   502 Nov 10 07:15 art.extents.myimport.sql
    art@Elbereth:~/GoogleDrive/MyConsulting/test$ cat art.extents.myimport.sql
    CREATE EXTERNAL TABLE extents_myimport (  
           dbsname "informix".lvarchar(10000),
           tabname "informix".lvarchar(10000),
           chunk INTEGER,
           offset INTEGER,
           size INTEGER,
           truth "informix".boolean
    ) USING ( FORMAT "DELIMITED",  
       DATAFILES ( "disk:/home/art/GoogleDrive/MyConsulting/test/exten01930.unl" ),  
       DELIMITER ",", EXPRESS, ESCAPE ON );  
    ALTER TABLE extents TYPE( RAW );
    INSERT INTO extents  
       SELECT * FROM extents_myimport;
    DROP TABLE extents_myimport;
    ALTER TABLE extents TYPE( STANDARD );

    You would just have to change the DATAFILES option to list the actual path of your CSV file.
    If you are not familiar with myschema, you can always download the latest release of my utils2_ak package which includes myschema, from my web site at: My Utilities

    .
    BTW, you can speed up the dbload run somewhat by altering the table to RAW before the load. It won't be as fast as external tables, but may be fast enough for you.

    Art



    ------------------------------
    Art S. Kagel, President and Principal Consultant
    ASK Database Management Corp.
    www.askdbmgt.com
    ------------------------------



  • 7.  RE: 14.1 WE - fastest way to load 61Mio lines CSV file

    IBM Champion
    Posted Wed November 10, 2021 07:36 AM

    Marc

     

    As people have suggested you can use external tables.

     

    Personally I would use High Performance Loader, but mainly cos I have it all scripted and ready to run.

     

    Out of box thinking ......

     

    ·         Add a timestamp and use the TimeSeries loader

    ·         Write a simple UDR, nmap the file, scan and load.

     

    Cheers

    Paul

     






  • 8.  RE: 14.1 WE - fastest way to load 61Mio lines CSV file

    IBM Champion
    Posted Wed November 10, 2021 08:15 AM
    I'm with Paul - if this is a load you are doing frequently and need it to be as fast as possible, it is worth testing HPL.  My testing against Informix 12 (have not repeated against 14) shows better performance loading with HPL than with external tables, and better still if you can split the source file.  Of course, you may have very different results with your data.  Be sure to make the target table RAW (as in the external table example) for the best performance.

    But if you haven't used HPL before, there is a bit of a learning curve.  The "onpladm" utility certainly makes it easier though.


    ------------------------------
    Mike Walker
    xDB Systems, Inc
    www.xdbsystems.com
    ------------------------------



  • 9.  RE: 14.1 WE - fastest way to load 61Mio lines CSV file

    IBM Champion
    Posted Thu November 11, 2021 03:28 AM

    Hi Marc,

    very nice challenge :-)

    It depends on the process requirements: Single Table / Transaction needed / HDR Cluster installation / constraint checks / insert trigger / isolation level ...a.s.o


    For the RAW table load the typical performance is from 200.000 records/second (ARM Jetson NVIDIA) to 450.000 records/second (a modern Intel x86 with enough cores for WE). All based on SD/Flash drives for the chunks and files.

    My estimation for this task is between 2 and 10 minutes. So you will feel the great speed of Informix with "wait less work" :-)

    Have fun
    Henri



    ------------------------------
    Henri Cujass
    leolo IT, CTO
    Germany
    IBM Champion 2021
    ------------------------------



  • 10.  RE: 14.1 WE - fastest way to load 61Mio lines CSV file

    Posted Mon November 15, 2021 10:52 AM
    Hi
    HPL is my favourite for this case, (create table without indexes, load data only with HPL, than create the indexes with max PDQPRIORITY)
    if you using AIX with Exceed, then you can use the HPL with ipload (same but graphical and easy to use :) )

    ------------------------------
    John Smith
    ------------------------------