Hitachi Content Platform​

 Isolate Pre-Processing pipeline to specific instance.

  • Object Storage
  • Hitachi Content Intelligence HCI
Clifford Grimm's profile image
Clifford Grimm posted 02-27-2020 16:33

I am planning on having a rather complex pre-processing pipeline that will have a custom stage. This custom stage will be reading a file available in the /opt/hci folder to effectively provide it run-time config during execution. As an example, lets say the file has some instructions as to how to parse various document paths to extract metadata; the reason for the file is to treat the stage as "template" executor, and/or not do any hard coding or unnecessarily error prone configuring when the step is created.

 

So the question is, whether there is a way to "assign" the execution of a pre-processing pipeline to a specific instance or group of instances with more granularity than just what nodes the workflow agent runs on. Being able to do this will help minimize the adhoc configuration on the HCI instances to contain this stage configuration file and might help minimize the manual work required to "scale" HCI with more instances for other scaling needs.


#HitachiContentIntelligenceHCI
Alan Bryant's profile image
Alan Bryant

Why not have that config file be regular stage configuration? StagePlugins are generally supposed to be configurable. The use case of having "run-time config" should be covered without external files.

Clifford Grimm's profile image
Clifford Grimm

I realize I didn't really give a very solid reason/example for doing such a thing. I certainly understand your point here.

 

With the risk of making this discussion more complex/confusing, one of the pain point of pipeline development is that it is not possible to copy individual steps with its configuration and reuse it. While most steps have very simple configuration where creating a new step and configuring it is feasible, those with far more configuration can be painful.

 

Still trying to keep this simple, let's say there is a step whose task is to generate some new fields based on other fields utilizing the lengthy configuration as a guide. This step also has a boolean configuration to trigger to "update" some stateful information (like a "batch number"). Upon initial usage in the pipeline, the "update" boolean would be false and just use the current value, but further down the line there may be a logic decision where the same step needs to be called to regenerate field values by having the "update" boolean set to "true" resulting in incrementing and updating the "batch number".

 

Creating the subsequent "update" step with all the complex configuration would be very painful just to change the "update" boolean from false to true. But having the configuration elsewhere allows for changing it once for all steps that are configured to use it without finding and changing all the steps.

 

Then consider if you need to change the complex configuration which requires to find all the places it is used in pipelines defined in the deployment.

 

Regardless, I have decided to put the information into an external database instead which will avoid having to be concerned about where the pipeline stage is running and ensuring synchronization of flat files across the instances.

 

Jordan Diehl's profile image
Jordan Diehl

We do not support the granularity which you originally described. The Job Driver, which runs the Pre-Processing pipeline, has a single instance running per workflow and picks where it will run from the instance pool on which the workflow agent is configured to run. So the solution you have described sounds like the right thing to do.