This post originally published by Bryan Senseman on Wednesday, October 15, 2014
I'm a huge user of Mondrian, the high speed open source OLAP engine behind Pentaho Analyzer, Saiku, and more. While the core Mondrian engine is wonderful, there are times when it doesn't do what I need it to, or exactly what I expect it to. Take this special case that I've encountered at some of my OEM customers, a case I call "Abnormal Genealogy."
Mondrian Needs Certain Data To Show Hierarchies
Mondrian has multiple ways to define hierarchies; traditional, ragged and parent-child. This is the feature that allows you to roll up and roll down metrics for complex parent and child record combinations. Think organization hierarchies, product hierarchies and employee hierarchies. In many cases, you have the IDs and data to support this cleanly, the way Mondrian expects, where parents have IDs that are less than (e.g. 3 < 14) than their associated children IDs.
But having data the way Mondrian expects it isn't always the case, particularly with OEM customers. There can be cases where the IDs of the child are less than than IDs of the parent. For example, my latest client preloads their system with standard/common "items" and then allows the end customer to create new parents (groupings) of these items. As a result, the preloaded items end up with IDs that are less than the newly added, customer-specific parent items. This reasonable load pattern breaks the Mondrian assumption that parent IDs are always less than child IDs.
Parent-Child ID's Need to Follow Certain Rules
When I first saw this, I thought "This is an edge case." But then I saw it again. And again. Clearly, I had to come up with a repeatable way to solve this problem. Fortunately, the ubiquitous FoodMart database can be used to replicate this issue. It contains an employee hierarchy with parent-child relationships that are pre-populated to fit the Mondrian assumption: all parent IDs are less than child IDs.
Out of the box, FoodMart works correctly, but what happens if we simulate the replacement of a supervisor? We need to insert a new supervisor record in the employee table. This results in the new supervisor record receiving a higher employee_id than the employees which will report to it. The simple assignment of IDs for new records causes this peculiarly common case: abnormal genealogy!
To demonstrate this FoodMart scenario, let's look at a Pentaho Analyzer report showing Senior Management and some reporting data before the supervisor replacement:
Sheri Nowmer, inclusive of herself, has 10 Senior Managers assigned to her, which includes Roberta Damstra. All is well and good since the IDs of the parent records (supervisors) are less than the IDs of the child records (managed employees). Now, let's say Roberta retires to sunny Orlando and a new VP is hired, Mr. New Guy, on 1/1/1998. A few simple SQL edits and a Mondrian cache clearing and we should be good to go. We'll assign Mr. New Guy's employee_id with the value 1206. 1206 is much larger than any other employee_id in our database so it will easily stick out when we are reviewing the data. It also follows the usual pattern of a new employee record being assigned an ID which is greater than the maximum ID used.
Running the same report again after clearing the Analyzer and Mondrian caches, provides us with this view:
As expected, Roberta is no longer listed. However, Mr. New Guy also does not appear in this view. We confirm that we inserted his record so that he reports to Sheri Nowmer and that all of Roberta's employees now report to him. So, what happened to Mr. New Guy? His "abnormal genealogy" is causing internal issues for Mondrian. That is, the fact that his ID is greater than the IDs of his reports is causing Mondrian to incorrectly traverse the reporting hierarchy.
Solving the Abnormal Geneaology Problem
So, how can we solve this problem? We could dig into the Mondrian source code. It's open source afterall. But, I'd prefer to leave revising the core of the Mondrian engine to the java wizards at Pentaho and in the community. Perhaps, we can pull off a bit of Pentaho Data Integration (PDI) magic with some minor data structure changes to overcome this gap faster and easier.
The solution here is to use PDI to generate an extra set of ordered parent-child IDs. Using PDI, we need to identify the correct hierarchical order and make this order available to Mondrian. The Transform shown below demonstrates the steps I used to re-sequence the IDs to the correct genealogical order. Using this approach, parents consistently have smaller IDs than all of their children “making Mondrian happy without the need to modify its core code." Since the solution is data-centric we're also better protected for future versions of Mondrian.
So where exactly is the "magic" needed to resequence the IDs? It's in the use of the Closure Generator step, utilizing the existing parent and child IDs (which works regardless of genealogy). In this case, we take the base data and then reorder the ultimate parent rows by distance, as provided from the Closure Generator. Notice (below) that the Mr. New Guy, employee_id number 1206, is now the ninth "oldest" employee, perfect for a parent!
The next part of the solution is to update the dimension table with both ordered IDs as they will be needed in the Mondrian hierarchy configuration. If you're maintaining this with a PDI Dimension Lookup/Update step, I suggest using the Punch Thru option for these two fields.
The final bit of data engineering requires you to create a tweaked version of the closure table that uses the ordered parent and original child IDs. This allows the correct groupings for parent aggregation while linking back into the original fact table with the actual IDs.
Now that our data engineering tasks are complete, all we need is a small adjustment of the hierarchy definition in the Mondrian schema. We simply modify the level definition to utilize the new ordered fields. These changes result in the following dimension definition. (Note we create new tables called employee2 and employee_closure_cube in order to maintain the original foodmart structures in case we want to revert to the out-of-the-box configuration later.)
Back to Analyzer we go, refreshing both caches and...sure enough; Mr. New Guy is official!
When I first encountered this problem, even Google wasn't able to find an effective solution. I only found a few discussions of the problem in semi-related jiras: http://jira.pentaho.com/browse/MONDRIAN-1328 and http://jira.pentaho.com/browse/MONDRIAN-1511. Given my experience with hierarchical data, I expect that this is a fairly common case. I hope you find this approach helpful.