The need for simple compliance search
Companies are often faced with the need to produce documents and emails as part of legal disputes. Maybe an employee deletes mails from the server (maybe even years ago), or an external partner claims the contents of a mailed document were different. Quite often the work is then passed to the IT team: find the necessary emails. This often means restore from backup, dig around in multiple archives, find and search through PST files and so on.
Lacking the proper tools this is almost guaranteed to be a waste of time and should not be fulfilled by IT in the first place. And with the next request it starts all over again...
The simple solution
A proper and lightweight tool is the combination of Hitachi Content Intelligence and Hitachi Content Platform to create a tailored search, in this case specifically for email. The high level architecture looks something like this:
Using the SMTP journaling mechanism Hitachi Content Platform (HCP) can ingest email directly from the mail server or gateway without the need for any 3rd party software. Hitachi Content Intelligence (HCI) then processes the ingested emails, extracts the relevant information and creates a searchable index.
The built in search app of HCI is the front end for the data consumers (e.g.: the ones in need of searching like the legal or compliance team) to carry out their own searches without the need to involve IT.
A good search design is crucial to make it fast, intuitive and last but not least accurate. Who says search has to be cumbersome? The search app is very flexible and can easily be adapted, without coding of course, towards email search, which this blog entry is all about.
How to set it up
This guide assumes the following:
- HCP and HCI are already up and running.
- Your email server / gateway is configured to forward incoming and outgoing email to HCP via SMTP.
- Authentication is gracefully ignored to keep it simple. HCI does support very fine grained access control to make sure sensitive data (as an index of your email data would be) can only be accessed by authorized users. For a deep dive on this topic have a look at this excellent summary: HCI: Authentication and Authorization
Step 0: The development dataset
I have used the widely known Enron email dataset, specifically the one provided by EDRM and Nuix (available here: https://www.edrm.net/resources/data-sets/edrm-enron-email-data-set/ ), which includes the attachments, is cleansed from PII data and is provided as a set of PST files.
For my purposes I extracted the PST files into seperate .eml files with readpst, part of libpst package. This mimics the real life scenario, since HCP will be set up to ingest each email as .eml object including any attachments. All email data shown in this blog is from this dataset.
Step 1: Think about the user experience
Now before actually doing anything like configuring the user interface or adapting index workflows pause to think about the different ways emails should be searched for. I came up with the following ways a user might search for email:
- free text search by sender (from), recipients (to) or carbon copy (cc)
- faceting for sender, recipient or carbon copy
- free text search by subject
- free text search by content (email body and attachments)
- searches refined by date or date range
- sort by date
- searches refined by attachment type
Those should cover most of the requests like "find the email Jane Doe sent to our competitor with the confidential Excel spreadsheet sometime last month". It would allow to quickly produce a complete set of conversation between a sender and recipient for a certain time or for a certain subject, or indeed between two companies based on domains. Simple searches for content would also be possible of course.
A simple search flow might be: look for a certain time range, pick a sender and recipient and then look for specific subjects or contents. We want to make it easy for the end user to create those kind of searches and will tailor the setup to enable just that.
Step 2: Prepare HCP
The first step is to set up HCP the proper way. There are multiple ways how HCP can store emails ingested through SMTP. Now first we need a namespace were SMTP is enabled and the correct options for storing are set. You can do so in the protocols section of the admin interface:
By choosing.eml with inline attachments each email will stored as a single entity and later referencing is easier and more granular.
Step 3: Set up the data connection to HCP
After you have verified emails are ingested on HCP, HCI needs to know where to look. HCI supports HCP right out of the box in multiple ways. For this use case the HCP MQE (Metadata Query Engine) data connection is used:
Choosing this connection the indexing job can only pick up the changes since the last run. Given the WORM nature we don't need to look at indexed objects twice, since we can be sure they have not been changed.
Step 4: Develop the pipelines
This is the part where we adapt the pipeline to create a meaningful index for the email search use case. The HCI default pipeline provided with the system is a very good starting point. Basically it will do the following:
- detect the MIME type
- expand any archives / documents like PST, zip, mbox, ...
- expand emails into seperate parts: this is where all the attachments are seperated from the email body and processed seperately
- extract text and metadata: this creates fields with the extraced text and special fields like Message_From and Creation_Date
- snippet extraction: this creates a short preview of the content for display in the search results
- date conversion: one of the most important stages as it will bring all dates to a common format (ISO 8601)
For this particular use case I added only two processing stages to the default pipeline:
- one mapping stage to create renamed copies of certain fields to make the end user experience better, e.g. Message_From to from (more on that later)
- one tagging stage to tag emails with attachments
I created a second processing pipeline to create additional tags for file type icons (this makes search results nicer to look at) and map the gazillion different content types to more meaningful values. For example "image/gif" and "image/jpeg" and so on are all mapped to type "Image". Again this makes it easier for the end user, who will prefer to work with meaningful terms instead of cryptic IT speak.
If you want to have a detailed look, the pipelines are available in the attached export bundle at the end of the post.
Step 5: Create and kick off the workflow
The workflow is where you tie it all together. You need to provide the input (the data connection we set up) the processing pipelines and the output, an empty index in our case.
Let the workflow run once. It is best to use a small but reasonable sample set for the initial development. I picked a couple of mailboxes from the Enron dataset with about 50.000 mails to start with. This will keep iterating (adapting the index for the use case) fast and should be enough emails so most types we expect in real life are included. If not (e.g. no emails with presentations), just pick some more.
After the workflow has finished the index is ready and we can set up the look and feel experience for the end users.
Step 6: Adapt the index schema
The index is set up as a collection of fields with a certain type. The type together with their attributes define how they are treated with regards to search.
For email search senders and recipients play a big role. Those are extracted from the email headers and stored in the fields Message_From and Message_To. Unfortunately this is not very well defined and the same address can be written in a multitude of ways. For example for John Doe you might see things like:
Message_From and Message_To are created as type "strings" by the system. Fields of this type are not broken up into separate tokens and in order to produce a hit the search term needs to be exact, including case. Now this is a pain since the exact format and spelling is typically unknown. Besides that typing
into the search field is not very satisfying.
Fortunately HCI has a solution for this. By changing the type from "strings" to "text_hci" we get a tokenized case insensitive search for the fields. Since the Message_From fields were copied to a field called from we can do the changes there and keep the original:
Now our entry into the search field can be as simple as
to produce a match on all of the above variants. Neat, right?
The last change to the schema is the Creation_Date field needs to be changed from type "tdates" to "tdate". An email can only have one date when it was sent anyway and having a single value allows for sorting by date, which is super useful.
Step 7: Adapt the end user experience
Now that everything in place we are just a few steps away from a super efficient search for our end users. The behavior of the search app is adapted within the Query Settings section of a particular index.
The first thing is to allow a download of the search results. The intention here is to provide a list of links into the archive for every hit in the final search.
Next we pretty up the user interface by changing the display names of fields we want to expose. This is achieved in the Fields section. Those Message_xx fields are also the ones we enable as facets. They will show up on the left side of the UI. I have chosen the Creation_Date and Content_Type fields as refinements. Finally the Creation_Date field should also be sortable. All those settings are available in the sections highlighted below:
The last adjustment ( Relevancy and Access Control have not been tampered with) is how the results are displayed in the search results which is configured in the Results section. There are basically two different kinds of search results in this case: email bodies and attachments. So the results are tailored towards those two cases. To summarize the following will be displayed: the subject as title, the URI to the original object in HCP, a snippet of the content, a file type icon (if it is an attachment), the date, the sender, the recipient, carbon copy and a list of attachments:
Results in the search UI will then look like this for both cases:
The UI now has a couple of features used to iteratively drill down the results. Here they are mapped out (left to right, top to bottom):
- select index (if you have more than one)
- search field
- download button
- enable refinements
- refinements (Content Type and Date)
- facets with separate search field
- results section
- ...more to expand the full contents of each result
On to the last task: finding emails.
Step 8: Test and try out the search UI
Now it's time to put the search features to good use. For all the different ways to search here is one sample way. Let's assume we know the following: one employee received a document sometime in October containing the words "fun run".
One way would be to first limit the time frame using the Date refinement using the date picker. Second we enter +"fun run" in the search field and get 5 results. Why the plus sign and why the quotes? That's down to syntax. The plus forces a logical AND and the quotes ensure the term is searched for in exactly that order. If in doubt about syntax HCI has you covered again with a simple builtin help function:
Quickly investigating the result set of 5 shows that we found the document in question and identified it as the "Enron Running Club" document, actually sent to a distribution list. The whole thing took less than a minute to drill down to a couple of hits and not even using the full breadth of capabilities we built. Pretty impressive!
I am sure by now you are itching to try this out yourself or see it in action. Get in touch with us so we can schedule a demo with your trusted Solution Consultant!
If you already have HCI you can just download the exported versions of the processing pipelines and index definitions. I have attached them for your convenience. Be sure to optimize them a bit more and remove unnecessary fields to keep the index lean and mean. The workflows and data connections are not included as they are unique to your specific environment.