Pentaho

 View Only
  • 1.  Pentaho performs very slow when unzipping files in a mounted samba share

    Posted 07-05-2022 14:05
    Hi,

    I am migrating tasks from Windows to Linux (carte).

    I have encountered a slowness problem while unzipping ZIP files on a mounted Samba partition.
    Pentaho performs far worse than if it were done in console mode.

    This is the log of a zip file that has 2 files inside
    2022/07/05 09:27:08 - S_Unzip_File - Starting job entry
    2022/07/05 09:27:08 - S_Unzip_File - Target folder [/mnt/driveN/Ficheros/504] exists
    2022/07/05 09:27:08 - S_Unzip_File - The Zip file [/mnt/driveN/FicherosFtp/EMPRESA/504] exists
    2022/07/05 09:27:08 - S_Unzip_File - Processing file [file:///mnt/driveN/FicherosFtp/EMPRESA/504/3f9353c9-cec5-4568-8b6e-d23708506713_2022057091756.zip] ...
    2022/07/05 09:27:08 - S_Unzip_File - Processing zipped entry [zip:file:///mnt/driveN/FicherosFtp/EMPRESA/504/3f9353c9-cec5-4568-8b6e-d23708506713_2022057091756.zip!/CABECERA.json] from file [file:///mnt/driveN/FicherosFtp/EMPRESA/504/3f9353c9-cec5-4568-8b6e-d23708506713_2022057091756.zip] ...
    2022/07/05 09:27:08 - S_Unzip_File - We can find a file called [/mnt/driveN/Ficheros/504//CABECERA_20220705_092708151.json]. It will be extracted
    2022/07/05 09:27:08 - S_Unzip_File - Extracting entry [zip:file:///mnt/driveN/FicherosFtp/EMPRESA/504/3f9353c9-cec5-4568-8b6e-d23708506713_2022057091756.zip!/CABECERA.json] to [/mnt/driveN/Ficheros/504//CABECERA_20220705_092708151.json]
    2022/07/05 09:27:43 - S_Unzip_File - Processing zipped entry [zip:file:///mnt/driveN/FicherosFtp/EMPRESA/504/3f9353c9-cec5-4568-8b6e-d23708506713_2022057091756.zip!/DETALLE.json] from file [file:///mnt/driveN/FicherosFtp/EMPRESA/504/3f9353c9-cec5-4568-8b6e-d23708506713_2022057091756.zip] ...
    2022/07/05 09:27:43 - S_Unzip_File - We can find a file called [/mnt/driveN/Ficheros/504//DETALLE_092743526_092743526.json]. It will be extracted
    2022/07/05 09:27:43 - S_Unzip_File - Extracting entry [zip:file:///mnt/driveN/FicherosFtp/EMPRESA/504/3f9353c9-cec5-4568-8b6e-d23708506713_2022057091756.zip!/DETALLE.json] to [/mnt/driveN/Ficheros/504//DETALLE_092743526_092743526.json]
    2022/07/05 09:28:18 - S_Unzip_File - File [file:///mnt/driveN/FicherosFtp/EMPRESA/504/3f9353c9-cec5-4568-8b6e-d23708506713_2022057091756.zip] was moved to [/mnt/driveN/Ficheros/504]2022/07/05 09:28:18 - S_Unzip_File - =======================================
    2022/07/05 09:28:18 - S_Unzip_File - Nr errors : 0
    2022/07/05 09:28:18 - S_Unzip_File - Nr unzipped files : 1
    2022/07/05 09:28:18 - S_Unzip_File - =======================================

    If I simulate the same thing within the Carte container the performance is much better

    pentaho@75e631ccf11c:/mnt/driveN/FicherosFtp/EMPRESA/504/juan$ unzip -l 3f9353c9-cec5-4568-8b6e-d23708506713_2022057091756.zip
    Archive: 3f9353c9-cec5-4568-8b6e-d23708506713_2022057091756.zip
      Length Date Time Name
    --------- ---------- ----- ----
       317560 2022-07-05 09:17 CABECERA.json
       124047 2022-07-05 09:17 DETALLE.json
    --------- -------
       441607 2 files
    pentaho@75e631ccf11c:/mnt/driveN/FicherosFtp/EMPRESA/504/juan$ time unzip 3f9353c9-cec5-4568-8b6e-d23708506713_2022057091756.zip -d /mnt/driveN/Ficheros/504/juan/
    Archive: 3f9353c9-cec5-4568-8b6e-d23708506713_2022057091756.zip
      inflating: /mnt/driveN/Ficheros/504/juan/CABECERA.json
      inflating: /mnt/driveN/Ficheros/504/juan/DETALLE.json

    real 0m0.040s
    user 0m0.005s
    sys 0m0.003s

    In console, it takes 0.04 seconds while pentaho takes (09:28:18 - 09:27:08) = 70 seconds
    So basically for this case Pentaho performs 1750 times worse

    For cases where there are more files within the zip is even worse.

    Any idea of what may be going on?

    Thanks a lot for your time


    ------------------------------
    Juan Sierra Pons
    Systems Engineer
    Juan Sierra Pons
    ------------------------------


  • 2.  RE: Pentaho performs very slow when unzipping files in a mounted samba share

    Posted 07-06-2022 01:41
    Hi,

    I have been able to reproduce it on Kettle so it is not something related with Carte nor Docker

    Also I have tested samba performance and all is OK. Creating a 500M files on the samba share only takes 4 secs

    XXXXX@XXXXXX:/SERVER/driveN/Ficheros/juanTests$ time dd if=/dev/zero of=./test bs=512 count=1000000
    1000000+0 records in
    1000000+0 records out
    512000000 bytes (512 MB, 488 MiB) copied, 4.01275 s, 128 MB/s

    real 0m4.018s
    user 0m0.722s
    sys 0m1.329s

    My suspicion is that is should be something related with the VFS
    2022/07/05 09:27:08 - S_Unzip_File - Extracting entry [zip:file:///mnt/driveN/FicherosFtp/EMPRESA/504/3f9353c9-cec5-4568-8b6e-d23708506713_2022057091756.zip!/CABECERA.json] to [/mnt/driveN/Ficheros/504//CABECERA_20220705_092708151.json]

    Best regards

    ------------------------------
    Juan Sierra Pons
    Systems Engineer
    Juan Sierra Pons
    ------------------------------



  • 3.  RE: Pentaho performs very slow when unzipping files in a mounted samba share
    Best Answer

    Posted 07-06-2022 04:54

    Even that is and old post, it seems that we are still there: https://forums.pentaho.com/threads/98127-Using-Samba-smbclient-in-Kettle/

    I have been able to workaround the slowness problem by splitting the unzip step.

    The red path is the slow one.
    By splitting the step and unzipping the files locally and then move the files to the samba share the performance is the expected one



    ------------------------------
    Juan Sierra Pons
    Systems Engineer
    Juan Sierra Pons
    ------------------------------



  • 4.  RE: Pentaho performs very slow when unzipping files in a mounted samba share

    Posted 07-30-2022 16:22
    Juan
    Are you still experiencing this issue? Looks like you are unzipping a json file into a samba drive? Is the file usually 500MB in size? Are you processing multiple files at once or just one?

    ------------------------------
    Carlos Lopez
    Application Architecture Engineering - Expert
    Hitachi Vantara
    ------------------------------



  • 5.  RE: Pentaho performs very slow when unzipping files in a mounted samba share

    Posted 08-01-2022 03:51

    Hi @Carlos Lopez,

    I am still experience this, it seems a Apache Commons VFS library as stated by Matt Casters in this old post https://forums.pentaho.com/threads/98127-Using-Samba-smbclient-in-Kettle/

    The workaround I posted works perfectly but it is not nice :( just a workaround.

    I have found a a similar problem downloading files using SFTP download step into a samba mounted file system.
    First, Carte creates the empty file  and then starts filling it. this is not efficient and takes a lot of time.

    So it seems the same problem, unzipping and downloading into a mounted samba filesystem perform very bad. Both have in common the Apache Commons VFS so probably the solution should be there.

    Thanks for your time

    Best regard



    ------------------------------
    Juan Sierra Pons
    Systems Engineer
    Juan Sierra Pons
    ------------------------------



  • 6.  RE: Pentaho performs very slow when unzipping files in a mounted samba share

    Posted 02-27-2023 04:55

    Hi @Carlos Lopez 

    Is there any plan to upgrade the Apache Commons VFS library?

    It seems that this upgrade would fix these kind of slowness problems.

    http://web.archive.org/web/20221205205204/https://forums.pentaho.com/threads/98127-Using-Samba-smbclient-in-Kettle/

    There are some things going on in the Apache Commons VFS library that are not really efficient.
    We're planning to upgrade to a more recent version but the testing and migration takes a while. 



    ------------------------------
    Juan Sierra Pons
    Systems Engineer
    Juan Sierra Pons
    ------------------------------



  • 7.  RE: Pentaho performs very slow when unzipping files in a mounted samba share

    Posted 03-02-2023 11:44

    @Juan Sierra Pons it appears 9.3 is using the common-vfs-2.7.0.jar. The latest version on their site appears to be 2.9.0.

    Let me check their release notes; to see if they have improved performance on their latest version



    ------------------------------
    Carlos Lopez
    Application Architecture Engineering - Expert
    Hitachi Vantara
    ------------------------------