Pentaho

 View Only

 How to determine all levels of parent-child-relationships?

  • Pentaho
  • Kettle
  • Pentaho
  • Pentaho Data Integration PDI
Gert Wieland's profile image
Gert Wieland posted 01-29-2019 16:25

Hi community,

I have a (long) list of child/ parent relationships (like an organigram), basically, it looks like this:

ChildParentA-B-CADCEB

A and B have no parents, i. e. they are root nodes. Then C is child of A, and D is child of C, i. .e A -> C -> D.

E is child of B, i. e. E -> B. Although it's not explicitly listed, E too is a root node because it's nobody's child.

Is there a way so PDI could explore all paths in this table, and return a "flat list" (as shown above, A -> C -> D).

Thanks so much for any help!


#PentahoDataIntegrationPDI
#Pentaho
#Kettle
Roguen Keller's profile image
Roguen Keller

Hi Gert WielandI've moved your question to the Pentaho Data Integration space within the community.

I hope it gets some additional visibility by the right audience for your question. 

Gert Wieland's profile image
Gert Wieland

Thank you Roguen! Sorry I missed that, I'll put it in the right spot next time.

Much appreciated,

Gert

Sparkles Sparkles's profile image
Sparkles Sparkles

How can E be a root node and be "nobody's child" when B is explicitly stated as being the parent of E?

Ana Gonzalez's profile image
Ana Gonzalez

It should be possible using PDI using loops, but for me the easiest way would be to load that information to a table in a database that supports hierarchical queries and solve the problem with a query, for example, in Oracle database there's the CONNECT BY PRIOR that makes it easy to solve the problem, I've read in recent versions of MariaDB there's the WITH RECURSIVE construction. Of course, that's only a solution if you already have access to a database, I'm suggesting it in case you are stuck to try to solve that problem with PDI when another solution might be easier.

Regards

Gert Wieland's profile image
Gert Wieland

Hi Sparkles,

You are right, thanks for pointing that out. See? That's exactly why I need a systematic solution, obviously, a 5-row table is already too complex for my "human interpretation"

Gert Wieland's profile image
Gert Wieland

Hi Ana,

Thank you very much for your answer, that's very helpful. I'd like to stay within PDI, so I'll definitely have a look at loops (seems like a useful technique anyway).

But indeed, the data I'm pulling sits in an Oracle database to which I have full access, so I'll also check out the "connect by prior" feature you mentioned.

Truly, thank you very much, in just 3 sentences you taught me a lot. I really appreciate that

Gert

Sparkles Sparkles's profile image
Sparkles Sparkles

This was actually a fun exercise to do in java! I made a simplified version if we can guarantee some rules:

  • Each node can have 0-1 parent
  • Each node can have 0-1 child
  • No circular references
  • Each row includes exactly 1 unique child and 0-1 unique parent

If you want multiple to support multiple children (tree instead of graph?), this becomes more complex. There's also no guard / security here, so dirty incoming data could lead to unknown results or behavior.

First, make an (inner) class called Node:

class Node {

     String name = null;

     Node parent = null;

     Node child = null;

 

     public Node()

     {

 

     }

     public Node(String name)

     {

          this.name = name;

     }

     boolean hasName()

     {

          return (name != null);

     }

     boolean hasParent()

     {

          return (parent != null);

     }

     boolean hasChild()

     {

          return (child != null);

     }

     void setChild(Node n)

     {

          child = n;

     }

     void setParent(Node n)

     {

          parent = n;

     }

     Node getRootNode()

     {

          if(parent == null)

          {

               return this;

     }

     return parent.getRootNode();

     }

}

 

Then you need define a data structure, for example a map (outside the processRow method):

Map<String, Node> nodes = new HashMap<String, Node>();

 

Then, read all incoming rows and populate the map:

 

String childName = *get child name field here*;

String parentName = *get child name field here*;

Node childNode = new Node(childName);

Node parentNode = new Node(parentName);

if(nodes.containsKey(parentName)) // if parent already exists, use the existing node

{

     parentNode = nodes.get(parentName);

}

if(parentNode.hasName()) //if parent is valid

{

     parentNode.setChild(childNode);

     childNode.setParent(parentNode);

     nodes.put(parentName, parentNode);

}

nodes.put(childName, childNode);

 

Now you're ready to process the results! For example by making a unique collection of rootNodes:

 

Set<String> rootNodes = new HashSet<String>();

for(String nodeName : nodes.keySet())

{

     Node node = nodes.get(nodeName);

     Node rootNode = node.getRootNode();

     rootNodes.add(rootNode.name);

}

 

for(String rootNodeName : rootNodes)

{

     Node rootNode = nodes.get(rootNodeName);

     String result = rootNode.name;

if(rootNode.hasChild())

{

     Node child = rootNode.child;

     result = result + " -> " + child.name;

while(child.hasChild())

{

     child = child.child;

     result = result + " -> " + child.name;

}

}

     System.out.println(result);

}

 

Result is:

A -> C -> D

B -> E

 

In order to incorporate this into kettle, you need to downgrade the java syntax to whatever version supported in your version (java v1.4 in my kettle). Then you need to integrate this code with the kettle row/field methods, to get the input and set the output.

Gert Wieland's profile image
Gert Wieland

Wow - I'm speechless Sparkles, thank you very much!

That's certainly another league, but I'm very interested in testing it out (and understanding how it actually works), most likely I'll do that over the weekend.

Truly, thank you very much, I really appreciate that.

(Once I got it to work, I'll share my findings here.)

Gert

Matthew Casper's profile image
Matthew Casper

Hi Gert,

The blog post below by Bryan Senseman may be worth looking into if you want to keep the logic in PDI - the Closure Generator step is key to this solution.

Parent-Child Hierarchies with Abnormal Genealogy

Thanks,

Matt

Ana Gonzalez's profile image
Ana Gonzalez

Not being a java developer myself, Sparkles solution is out of my range (I knew someone around here would come up with a Java solution, I was expecting a solution involving regular expressions too...), but Bryan Senseman's blog post seems like something I could find useful in the future, I'm going to mark it to build a test experiment.

I too prefer keeping everything in PDI logic when posible, but sometimes, if it's just easier to do things in other tools you have at hand, it's not worth it.

Regards

Sparkles Sparkles's profile image
Sparkles Sparkles

Since you liked it so much, I made a version for you that will work "out of the box". Downgrading java syntax can be quite painful, so I made it ready for you.

Steps:

1) Create a new "User Defined Java Class" step

2) Make sure the previous step produce rows with fields "child" (String) and "parent" (String)

3) Click checkbox "Clear the result fields?" in lower right corner

4) Add an output Fieldname in the "Fields" tab below, Fieldname = hierarchy and Type = String)

5) Copy code below into main "Class code" text box

import java.util.HashMap;

import java.util.HashSet;

import java.util.Iterator;

import java.util.Map;

import java.util.Set;

 

Map nodes = new HashMap();

Set rootNodeNames = new HashSet();

Iterator rootNodeNamesIterator = null;

 

// Instructions:

// 1) Click checkbox "Clear the result fields?" in lower right corner

// 2) Add an output Fieldname in the "Fields" tab below:

// Fieldname = hierarchy

// Type = String

 

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException

{

     Object[] r = null;

     if (first)

     {

          first = false;

 

          while (true) //first read all incoming rows

          {

               r = getRow();

               if (r == null)

               {

                    break;

               }

               String childName = get(Fields.In, "child").getString(r);

               String parentName = get(Fields.In, "parent").getString(r);

               storeRow(childName, parentName);

 

               for (Iterator it = nodes.entrySet().iterator(); it.hasNext();) //make a set of all root nodes

               {

                    Map.Entry pairs = (Map.Entry) it.next();

                    Node node = (Node)pairs.getValue();

                    rootNodeNames.add(node.getRootNode().name);

               }

          }

          rootNodeNamesIterator = rootNodeNames.iterator(); //store iterator for processing output rows

     }

     return setOutputRow();

}

 

private boolean setOutputRow() throws KettleException

{

     if(rootNodeNamesIterator.hasNext())

     {

          String rootNodeName = (String)rootNodeNamesIterator.next();

          Node rootNode = (Node)nodes.get(rootNodeName);

          String result = rootNode.name;

          Node child = rootNode.child;

          while(child != null)

          {

               result = result + "," + child.name;

               child = child.child;

          }

          Object[] r = createOutputRow(null, data.outputRowMeta.size());

          get(Fields.Out, "hierarchy").setValue(r, result);

          putRow(data.outputRowMeta, r);

          return true;

     }

     setOutputDone();

     return false;

}

 

private void storeRow(String childName, String parentName)

{

     Node childNode = new Node(childName);

     Node parentNode = new Node(parentName);

 

     if(nodes.containsKey(parentName)) // if parent already exists, use the existing node

     {

          parentNode = (Node)nodes.get(parentName);

     }

     if(parentNode.hasName()) //if parent is valid

     {

          parentNode.setChild(childNode);

          childNode.setParent(parentNode);

          nodes.put(parentName, parentNode);

     }

     nodes.put(childName, childNode);

}

class Node {

     String name = null;

     Node parent = null;

     Node child = null;

 

     public Node()

     {

           

     }

 

     public Node(String name)

     {

          this.name = name;

     }

     boolean hasName()

     {

     return (name != null);

     }

     boolean hasParent()

     {

          return (parent != null);

     }

     boolean hasChild()

     {

          return (child != null);

     }

     void setChild(Node n)

     {

          child = n;

     }

     void setParent(Node n)

     {

          parent = n;

     }

     Node getRootNode()

     {

          if(parent == null)

          {

               return this;

          }

          return parent.getRootNode();

     }

}