We are new to HCI development, we developed a sample OCR stage plugin and want to share with community. This plugin has specific use case, it will process scanned image files and convert to text using tesseract (OCR) library. Metadata is used to find specific text/ fields within the document and attach it with HCI document meta data field. It can also be configured as per the user needs by providing regular expression. Entire source code/ JAR and setup document can be downloaded from <a href="https://github.com/hemantaudax/hci-ocr" target="_blank">here</a> (GitHub). Hope this helps team, feel free to share your suggestions. Note: For better results use scanned images with 300 DPI. <a data-tag-text="HitachiContentIntelligenceHCI" data-sign="#" class="user-content-hashtag" href="https://hitachi.connectedcommunity.org/search?s=tags%3A%22Hitachi Content Intelligence HCI%22&executesearch=true" data-tag-key="6275580e-f844-4bfc-8a6c-e678ae337314">#HitachiContentIntelligenceHCI</a>

Hitachi Content Platform

Sample HCI Plugin (OCR)

Data Conversion posted 02-01-2018 18:06

We are new to HCI development, we developed a sample OCR stage plugin and want to share with community.

This plugin has specific use case, it will process scanned image files and convert to text using tesseract (OCR) library. Metadata is used to find specific text/ fields within the document and attach it with HCI document meta data field. It can also be configured as per the user needs by providing regular expression.

Entire source code/ JAR and setup document can be downloaded from here (GitHub). Hope this helps team, feel free to share your suggestions.

Note: For better results use scanned images with 300 DPI.

#HitachiContentIntelligenceHCI

Alan Bryant posted 02-01-2018 18:16

This is a great start!

You may want to package the additional files within the plugin jar as resources and extract them to the temp directory you get from PluginCallback.getTempDirectory() so your users don't have to download them separately. That should avoid all the extra steps, and people would be able to just use the plugin normally.

Data Conversion posted 02-01-2018 18:23

Thanks a lot Alan for the suggestion, we will do this in next iteration (hopefully soon ).

Christie Nguyen posted 02-01-2018 23:08

Very useful tool for Image Analytics use case. Thanks for sharing!

Eckhard Roeser posted 02-02-2018 07:51

Excellent news!!!! Got many questions from customers about having OCR capabilities in HCI. Will download and test.

Thanks for this!

Eckhard Roeser posted 02-02-2018 07:57

Agree with Allen. Would be great to get one single package for the download with jar file, setup doc and Readme included (or what else is required). Would make things much easier. But anyway, a very good start.

Data Conversion posted 02-09-2018 06:45

Hi Alan,

We are facing some issues while fetching resources from jar. Following code is working fine while using as plan java executable jar. But it is always returning empty while uploading it as plugin jar.

Enumeration<URL> resources = ExampleStagePlugin.class.getClassLoader().getResources("linux-x86-64");

Alan Bryant posted 02-09-2018 11:40

I think resources have to be files, not directories... so you could put each file in and extract them all, or you could put them in as a zip or tar archive and extract that after you pull it out as a resource.

Data Conversion posted 02-12-2018 10:36

Thank Alan,

Now plugin will automatically copy the library files into HCI temporary location. Now user only have to copy language file manually as Language data size is big if we copy it by plugin code then it will impact performance.

Sandeep Rakshe posted 02-21-2018 09:31

Hi Deepak,

I already started testing OCR on HCI our environment, I do have some queries on testing front. Can we discuss ? please revert me on email id- mailto:sandeep.rakshe@hitachivantara.com

Eckhard Roeser posted 04-30-2018 04:36

Hi Deepak,

have tested the plugin now a bit. It works great for jpg and bmp. Investigating these types of documents will definitely be a very good use case.

But what can we do with images inserted/contained in I.E., PDFs or DOCs? I have tested the plugin against a scanned file which was stored as a PDF. But the stage was not working with this kind of document and gave me an error back. This is one other use case (scanned documents stored as PDF).

Another one is for a proper investigation of images inserted into other text based documents (together with regular text), where we will need to have a kind of extraction stage, which can find images within text documents, extracts them (temporarily) and applies the OCR stage on them.

Have you already considered to develop such a stage as well?

Thanks and rgds,

Eckhard

Data Conversion posted 05-11-2018 07:31

Hi Eckhard,

Thanks for the feedback, we can definitely add capability to read images from the scanned PDFs. I would love to work with you to understand specific use case and take this to another level.

Let me know your convenient time to discuss.

Thanks,

Deepak

Eckhard Roeser posted 05-11-2018 15:04

Hi Deepak,

good to hear. We do have currently a use case in Denmark where we will need to deal with exactly that type documents. There is a POC coming within the next 3 weeks and we would really very much appreciate to have something ready to work with. The customer is a bank and they have lots of scanned contracts where we have to pick a lot of data from and get them merged with other data obtained from other sources. So, if you could do some efforts to provide us with some code within the next 3 weeks it would be very much appreciated.

Thanks and brgds,

Ecky

Eckhard Roeser posted 05-14-2018 06:57

Hi Deepak,

we can have a chat today if you like. Can you please send me your details via email? I cannot find you in HV email directory.

My email is mailto:eckhard.roeser@hitachivantara.com

Thanks.

Hitachi Content Platform​

Sample HCI Plugin (OCR)

Related Content

Getting error when trying to convert Shared to Team Folder.

HCI: Logging and Monitoring at the Document/Object level

HCI to audit HCP access and internal logs

HCI: Logging and Monitoring at the Document/Object level

Performance Monitoring w/ ELK - Part II: Monitoring HCP Access Logs

Hitachi Content Platform