Hitachi Content Platform​

 Scope of StagePlugin Session data

  • Object Storage
  • Hitachi Content Intelligence HCI
Clifford Grimm's profile image
Clifford Grimm posted 03-02-2020 15:39

When developing a state, there are hooks to be able to set/user session data for the step. What I am looking to understand is the scope of the session information when taking into consideration that a stage can be used in pre-processing or Workflow-Agent modes. Then also consider if a stage is introduced in a pipeline multiple times. Under all these situation, what is considered a "session"? Is a session owned by a single step that is configured? Is the session for all steps in a pipeline that are of the same time? Then when executing in Workflow-Agent mode, will each executing instance of a step regardless of what and how many HCI batches and/or instances it could be executed on have its own session? And considering all this, what considerations should be made around locking, if any, to ensure the session is properly updated.

 

For those that are not aware what this is, there is a PluginSession interface that can be implemented in a class that allows or storing various information that can be used by the step. The session data can be initially configured via the startSession method that can be implemented as part of the StagePlugin. Then during execution against each document, the session can be accessed to either read or record information in the session. This can be handy for things like establishing a connection to another resource like a database to obtain additional information the stage may require during execution.


#HitachiContentIntelligenceHCI
Jordan Diehl's profile image
Jordan Diehl

Plugin sessions were designed to hold resources that are expensive to set up and tear down like connections. (e.g. database connections). In the case of a workflow, the preprocessing and connector plugin sessions will be started when the workflow starts running, and remain open until the workflow stops running (so if it completes, pauses, or halts). The Workflow-Agent starts its own sessions on every partition that runs, and those sessions will be closed when the partition has completed.

 

The sessions are not intended to store state. It could be used to implement a connection to some kind of external database store/read the information there, but there are separate sessions being created on separate nodes/threads, so you might need to use some kind of external cluster lock as well if you attempt to store state this way. This seems to be the solution you have already described in your other question though, so you are already doing what we would have suggested.

Clifford Grimm's profile image
Clifford Grimm

Thanks Jordan. I am struggling with implementing a data migration mechanism that puts content onto HCP with substantially variable folder structure structure for which requires only a fixed number of entries. In very simplest example, I need to change folder structure when a certain number of objects have been written. For example, start off with Folder_001, then when it reaches 10,000 items, change name to Folder_002.

 

What I am contemplating is having a pre-processing pipeline that essentially has a step that starts with an initial value of 1 and increments the value for every document that passes through. The most efficient would be to store the value in the stage session area and on every call increment the in-memory value. From what I can tell, it seems to work pretty well using the pre-processing workflow. However, not really sure when "instances" of the step are created so that I make sure there is predictable behavior.

 

I am still going to be taking the external database approach for most stuff, but still need to consider if I can effectively do any caching in a stage session. I would hate to contemplate updating a database every time the counter is incremented.

 

So for session behaviors, my guess is the following situation is when/how-many separate instance/session would be created:

1) Each occurrence the step exists in a processing pipeline. So if it is in a pipeline twice, there will be separate "sessions" for each occurrence.

2) For each parallel job, there will a multiple of the be "N" number of occurrences based on item 1 above. So if there are 2 instances in a pipeline, and 2 parallel jobs, there would be a total of 4 instances of the step with separate sessions.

 

Remaining questions:

1) What is the lifecycle of any session? Is it destroyed and recreated from scratch when a task is paused, stopped, or sleeps between "Check for updates"? Any other times? Thus I am assuming when the task starts back up, new sessions will be started and in this example, the counter will restart from 1?

2) Does the task performance configuration for parallel jobs impact pipelines configured for pre-processing execution? Or only impacts workflow agent mode?

 

 

Jordan Diehl's profile image
Jordan Diehl

1) The the sessions for the pre-processing pipeline stages are created when the workflow task starts, and will only be stopped when the task is paused or stops. They will remain open during the sleeps between "Check for updates". So the workflow being paused, completed, or halted should be the only times when you will lose your in memory information in the session.

 

2) No the task performance parallel jobs setting does not impact the pre-processing pipelines. The pre-processing pipeline elements are executed serially and single-threaded.

 

One approach you could take is to have a bit of a combination of the two things we've talked about. You could use the counter in the pre-processing session to count up to 10,000, then increment a value on an external database. That would mean that if the workflow ever stopped you would still have the folder number saved externally. The main negative to this is that if the workflow were to ever stop, you would not know exactly how many objects were put into the folder.

 

So if you are just trying to spread the objects somewhat evenly across the directories you can just increment the folder value on session startup (to avoid ever putting more than 10,000 objects in the folder) and accept that stopping the workflow will always result in a folder with fewer than 10,000 objects. If you have a hard requirement to have exactly 10,000 objects in every folder then this won't work though.