Create a field based on multiple rows (with same Identifer) in UDJC (version PDI 8.0)

Question

Hello,

What I am doing is taking a comma delimited text field as input, and using PDI 8.0 to output a new text file with changes and new fields.

I have used a UDJC to generate a new field based on calculations from other fields, but I've only learned how to look at a single row at a time.

Here is example of rows of data for a single identifier (imagine csv with multiple rows for every ID, and this is one of the IDs):

EmployeeID, FTE, Job ID, Rownum

99,0.25, 3, 101

99,0.5, 3, 102

99,0.25, 7, 103

I'd like to collect data on all the rows, and do a custom calculation, resulting in a "Keep Row" field for one of the rows for each Id having a Y, with the other being an N. We plan a very complex way of removing the duplicates that involves many fields and it seems creating such a field to use "Filter" with later may be the way to do this. Does there exist examples of a similar PDI job I can view (with UDJC that looks at more than 1 row at a time), or can anyone point me in the right direction?

#Pentaho
#Kettle
#PentahoDataIntegrationPDI

Answer

The Row Normalizer step might help get the several rows for an employee onto one row for processing, but I assume the header you show is not repeated for each employee and there may be no way to grab the different rows into fields because the number of rows for a given employee may not always be 3, right?

Inside the UDJC, you can call the getRow() method multiple times and only use the putRow method when you've processed all the rows for a given employee (by checking for a change in the employee ID), but backing up when you've gone too far is tricky.

I suppose you could have a global Object[] that could represent the saved row from the prior processRow invocation.

Perhaps something like this would work. I haven't tried it and I may not have the right name for the get method for retrieving a value from an input row. You could also find the source code for the Row Normalizer step and adapt its pattern, I would think. I'll be interested to know what you find that works.

private Object[] inputRow;

private Object[] savedInputRow = null;

private Object[] outputRow;

private Integer employeeID;

private Integer priorEmpID = 0;

processRow( StepMetaInterface smi, StepDataInterface sdi ) throws KettleException

{

if ( savedInputRow == null ) { // no saved row from prior invocation

// check for end of data stream

if (inputRow == null) {

setOutputDone();

return false;

}

} else { // start with saved row's value

inputRow = savedInputRow; // 1st line of data for employee from last time through processRow

employeeID = getInputRowMeta().getInteger(inputRow, "EmployeeID", null ); // may not be the right get method

priorEmpID = (Integer) employeeID.clone();

}

//

// do stuff with first row; set some values in outputRow

//

while ( employeeID.equals( priorEmpID ) ) {

//

// do stuff with inputRow data; set outputRow values

//

inputRow = getRow();

// check for end of data stream

if ( inputRow == null ) {

// output the last employee's data

putRow(data.outputRowMeta, outputRow)

setOutputDone();

return false;

}

employeeID = getInputRowMeta().getInteger(inputRow, "EmployeeID", null ); // may not be the right get method

} // end of while

// just read first line of info for next employee; save it for next time through processRow

System.arrayCopy( inputRow, 0, savedInputRow, 0, inputRow.length );

putRow(data.outputRowMeta, outputRow);

return true;

} // end of processRow

Pentaho

Create a field based on multiple rows (with same Identifer) in UDJC (version PDI 8.0)

Related Content

PRD-CE 9.3 Identifies TEXT fields as CLOB

multi_part_upload_demo_w_sample_xml

Pentaho - Call Stored procedure with 1 input parameter and Refcursor output (multiple output fields) with multiple rows

RE: Pentaho - Call Stored procedure with 1 input parameter and Refcursor output (multiple output fields) with multiple rows

Overlapping row banding