Hitachi Content Platform​

 Sample HCI Plugin (OCR)

  • Object Storage
  • Hitachi Content Intelligence HCI
Data Conversion's profile image
Data Conversion posted 02-01-2018 18:06

We are new to HCI development, we developed a sample OCR stage plugin and want to share with community.

 

This plugin has specific use case, it will process scanned image files and convert to text using tesseract (OCR) library. Metadata is used to find specific text/ fields within the document and attach it with HCI document meta data field. It can also be configured as per the user needs by providing regular expression.

Entire source code/ JAR  and setup document can be downloaded from here (GitHub). Hope this helps team, feel free to share your suggestions.

Note: For better results use scanned images with 300 DPI.


#HitachiContentIntelligenceHCI
Alan Bryant's profile image
Alan Bryant

This is a great start!

You may want to package the additional files within the plugin jar as resources and extract them to the temp directory you get from PluginCallback.getTempDirectory() so your users don't have to download them separately. That should avoid all the extra steps, and people would be able to just use the plugin normally.

Data Conversion's profile image
Data Conversion

Thanks a lot Alan for the suggestion, we will do this in next iteration (hopefully soon ).

Christie Nguyen's profile image
Christie Nguyen

Very useful tool for Image Analytics use case.  Thanks for sharing!

Eckhard Roeser's profile image
Eckhard Roeser

Excellent news!!!! Got many questions from customers about having OCR capabilities in HCI. Will download and test.

Thanks for this!

Eckhard Roeser's profile image
Eckhard Roeser

Agree with Allen. Would be great to get one single package for the download with jar file, setup doc and Readme included (or what else is required). Would make things much easier. But anyway, a very good start.

Data Conversion's profile image
Data Conversion

Hi Alan,

We are facing some issues while fetching resources from jar. Following code is working fine while using as plan java executable jar. But it is always returning empty while uploading it as plugin jar.

Enumeration<URL> resources = ExampleStagePlugin.class.getClassLoader().getResources("linux-x86-64");

Alan Bryant's profile image
Alan Bryant

I think resources have to be files, not directories... so you could put each file in and extract them all, or you could put them in as a zip or tar archive and extract that after you pull it out as a resource.

Data Conversion's profile image
Data Conversion

Thank Alan,

Now plugin will automatically copy the library files into HCI temporary location. Now user only have to copy language file manually as Language data size is big if we copy it by plugin code  then it will impact performance.

Sandeep Rakshe's profile image
Sandeep Rakshe

Hi Deepak,

I already started testing OCR on HCI our environment, I do have some queries on testing front. Can we discuss ? please revert me on email id- mailto:sandeep.rakshe@hitachivantara.com

Eckhard Roeser's profile image
Eckhard Roeser

Hi Deepak,

have tested the plugin now a bit. It works great for jpg and bmp. Investigating these types of documents will definitely be a very good use case.

But what can we do with images inserted/contained in I.E., PDFs or DOCs? I have tested the plugin against a scanned file which was stored as a PDF. But the stage was not working with this kind of document and gave me an error back. This is one other use case (scanned documents stored as PDF).

Another one is for a proper investigation of images inserted into other text based documents (together with regular text), where we will need to have a kind of extraction stage, which can find images within text documents, extracts them (temporarily) and applies the OCR stage on them.

Have you already considered to develop such a stage as well?

Thanks and rgds,

Eckhard

Data Conversion's profile image
Data Conversion

Hi Eckhard,

Thanks for the feedback, we can definitely add capability to read images from the scanned PDFs. I would love to work with you to understand specific use case and take this to another level.

Let me know your convenient time to discuss.

Thanks,

Deepak

Eckhard Roeser's profile image
Eckhard Roeser

Hi Deepak,

good to hear. We do have currently a use case in Denmark where we will need to deal with exactly that type documents. There is a POC coming within the next 3 weeks and we would really very much appreciate to have something ready to work with. The customer is a bank and they have lots of scanned contracts where we have to pick a lot of data from and get them merged with other data obtained from other sources. So, if you could do some efforts to provide us with some code within the next 3 weeks it would be very much appreciated.

Thanks and brgds,

Ecky

Eckhard Roeser's profile image
Eckhard Roeser

Hi Deepak,

we can have a chat today if you like. Can you please send me your details via email? I cannot find you in HV email directory.

My email is mailto:eckhard.roeser@hitachivantara.com

Thanks.