Hey Adam, I feel your pain with the MD5 hash chaos. When you tinker with Java or Pentaho versions, it messes with how MD5 calculates, causing unnecessary changes in your data warehouse. I upgraded too and faced the same hash head-scratcher.
Here's what worked for me: Add a checksum column before the upgrade, populate it, then do the upgrade dance. Post-upgrade, compare the new checksum with the old MD5. If they mismatch, there's your culprit.
Also, check if there are any algorithm changes in the newer Java version affecting MD5. Pentaho upgrades might mess with hashing, so test on a smaller dataset first.
Lastly, consider using a more robust hashing method like SHA-256. It's a bit heavier, but it might save you from these MD5 headaches in the future.
Happy hashing, let me know.
------------------------------
Valeri Bakop
Others
Freelance
------------------------------
Original Message:
Sent: 08-14-2022 06:18
From: Adam Makara
Subject: Add Checksum output problem when changing Java or Pentaho version
Hello, I have a problem with different MD5 hash output from Add Checksum. I use the md5 hash to find changes in data from source systems that I regularly load into my data warehouse. I am using old Pentaho version 6 and Java version 1.8.0_231, but I want to upgrade them to the newest version. After the update, just one of them, I get a lot of changed data rows (I have a very big DWH) because of md5 hash applied on the same data differ but values in rows did not change. What causes it and what is the best approach to do upgrade and do not load millions of "changed" rows?
------------------------------
Adam Makara
Systems Engineer
DWH
------------------------------