Pentaho

 View Only
  • 1.  Add Checksum output problem when changing Java or Pentaho version

    Posted 08-14-2022 06:18
    Hello, I have a problem with different MD5 hash output from Add Checksum. I use the md5 hash to find changes in data from source systems that I regularly load into my data warehouse. I am using old Pentaho version 6 and Java version 1.8.0_231, but I want to upgrade them to the newest version. After the update, just one of them, I get a lot of changed data rows (I have a very big DWH) because of md5 hash applied on the same data differ but values in rows did not change. What causes it and what is the best approach to do upgrade and do not load millions of "changed" rows?

    ------------------------------
    Adam Makara
    Systems Engineer
    DWH
    ------------------------------


  • 2.  RE: Add Checksum output problem when changing Java or Pentaho version

    Posted 08-14-2022 21:21

    Hi Adam

    I'd be very carefully checking that the data is coming through in exectaly the same way in the old install and the new.   If you are including floats in the data for the hash, then CPU/OS factors may vary slightly. You might try hashing a row after forcing them to definite values  and see if the difference still exists.



    ------------------------------
    Andrew Cave
    Systems Engineer
    BizCubed Pty Ltd
    Australia
    ------------------------------



  • 3.  RE: Add Checksum output problem when changing Java or Pentaho version

    Posted 08-15-2022 09:57

    I have exactly the same issue, but with SHA-256. It happens when using version 9.3, so I decided not to update and stay with 9.2 for now.

    Step: Add a checksum
    Type: SHA-256
    ResultType: Hexadecimal
    Field Separator: -
    Number of fields: 7

    Same as Adam, I have a table with about 6M records. My incoming data doesn't contain a UID, so I use the checksum to calculate one.



    ------------------------------
    Gert Wieland
    Application Services Manager
    UHN
    ------------------------------



  • 4.  RE: Add Checksum output problem when changing Java or Pentaho version

    Posted 02-01-2024 05:39
    Hey Adam, I feel your pain with the MD5 hash chaos. When you tinker with Java or Pentaho versions, it messes with how MD5 calculates, causing unnecessary changes in your data warehouse. I upgraded too and faced the same hash head-scratcher.
     
    Here's what worked for me: Add a checksum column before the upgrade, populate it, then do the upgrade dance. Post-upgrade, compare the new checksum with the old MD5. If they mismatch, there's your culprit.
     
    Also, check if there are any algorithm changes in the newer Java version affecting MD5. Pentaho upgrades might mess with hashing, so test on a smaller dataset first.
     
    Lastly, consider using a more robust hashing method like SHA-256. It's a bit heavier, but it might save you from these MD5 headaches in the future.
     
    Happy hashing, let me know.


    ------------------------------
    Valeri Bakop
    Others
    Freelance
    ------------------------------