View Only

 String encoding within a transformation - UTF-16LE problem

A Pentaho User's profile image
A Pentaho User posted 01-19-2023 11:47

Have a need to be able to SHA256 hash UTF-16 encoded content (due to character set requirements) within a transformation. These hashes need to be independently validated on other systems that also use UTF-16 so processing within Pentaho needs to be UTF-16 during the hashing step.

The problem I have come across is that Pentaho seems to process everything string wise in UTF-8. I have tried using the 'Select values' step to alter the meta data to encode the strings as UTF-16LE (the particular scheme I need to use) but when the output is passed to the 'Add a checksum' step to generate the SHA256 hash, it gets converted to UTF-8 and then hashed.

How do I know this? In the attached KTR file, the hashes generated are the same as an independent UTF-8 hash for both flows irrespective of whether 'Select values' uses UTF-8 or UTF-16LE encoding. Is there a way to generate a SHA256 hash on UTF-16LE content without shelling out to the underlying OS and running iconv and sha256sum? If not, what is the point of 'Select values' and altering the format if Pentaho simply converts it back to UTF-8 automatically?

Output from the transformation - the UTF16LE one should be a different hash:

ID	TXT	    HASH_UTF8	                                                        HASH_UTF16LE
1	Wibble	c8fe3173e2bb48858c0c0930caa43df7cc216121d62d9076689cf4d700104466	c8fe3173e2bb48858c0c0930caa43df7cc216121d62d9076689cf4d700104466

Output from a shell - you can see that the first one is UTF-8 and matches the output from Pentaho; UTF16-LE encoding generates (as expected) a different hash:

$ echo -n 1Wibble | sha256sum

$ echo -n 1Wibble | iconv -f UTF-8 -t UTF-16LE | sha256sum

Thanks in advance,


Attachment  View in library
John Craig's profile image
John Craig

A possibility, may not be a very good one, is to use either a User-defined Java Class or Java Expression step to covert the text to an array of bytes and then create the sha256sum hash on that. I haven't tried this, but if you specify String.getBytes( "UTF-16LE" ) that should get you the input you need to the sha256sum algorithm. Now, will that work with the MessageDigest.digest method? I'm not at all sure. This algorithm may choke on the 0x00 bytes you'll have in the UTF-16LE byte array. All the info I found in a quick search shows converting to UTF-8 before calling the digest method. See for example:

But, that may just be a convention. I'm sorry not to take time to test is out, but I'll leave that to you.

Hope this may help!