Pentaho

 View Only

 How to handle huge(50 lakhs) record in pentaho

  • Pentaho
  • Pentaho
Premkumar Manipillai's profile image
Premkumar Manipillai posted 08-29-2018 10:50

Dear all,

i need to fetch huge(50 lakhs+) record from oracle db and insert in to ms sql server db.

Is there any option in pentaho 8.1 ETL community version? or any other solutions for this. Please guide me.

i connected "Oracle db" with "ms sql server db" and fetching & inserting up to 1 lakhs record successfully using bellow oracle query.

oracle_sql

SELECT * FROM MASTER_TABLE M WHERE ROWNUM<=100000

If i using the above query without having WHERE class it will be hang. right?

kindly any one replay for this.

Thanks


#Pentaho
Dan Keeley's profile image
Dan Keeley

dont use the db procedure. use the bulk loader, or table output.  the db procedure will be EXTREMELY slow.

it wont hang without a where clause. there's no limit to number of rows pdi can process.   

Premkumar Manipillai's profile image
Premkumar Manipillai

Dear Dan Keeley,

Thanks for your valuable information. But i a new to pentaho toll and i do not know how to use bulk loader.

i tried some what and getting error as bellow.

bulk_loader_error2

My requirement is:-

1. Read huge record from multiple table (Oracle).

2. Insert to MS SQL SERVER using bulk loading.

Can you please guide me in details.

Thanks

John Craig's profile image
John Craig

Some issues here:

1)     The PostgreSQL Bulk loader step only works with a PostgreSQL database

That step cannot connect to a SQL Server database.

2)     The Table Output step is probably your best option

Set the "Commit size" value to a fairly large number (so that PDI doesn't do COMMIT TRANSACTION commands often)

3)     As far as I know, there is no MS SQL Server bulk loader step available in the standard product (7.1 or 8.1 are the versions I've used recently), nor on the Tools > Marketplace. If someone knows of one, I'd like to learn about it too.

4)     If this is a one-time load, then PDI may not be the best way to go. 

There is a command line bulk loader for MS SQL Server (called bcp) and if you can get a file from Oracle in the needed format (or write it out via PDI if not), then bcp will load your data much faster. If you do not need indexes while the rows are loading, it is *much* faster.

I'm not familiar with what magnitude 50 lakhs + rows would be. If you use PDI, I suggest you set it up to allow use of more memory (edit the script spoon.bat if running under Windows, or spoon.sh if not on Windows to give a larger value for the Java -Xms parameter).

Hope that helps!

John

Ricardo Miguel Díaz Razo's profile image
Ricardo Miguel Díaz Razo

Hi Premkumar,

Why are using a DB Procedure? Maybe you need to do any transformation to the data before insert to de target database?

We have differents way to do that, but I need to know what exactly want to do (if is posible with screenshots) for help to you

Matthew Casper's profile image
Matthew Casper

1.  Create a job with a transform entry followed by the "BulkLoad into MSSQL" job entry.

2.  For the transform entry point it to your existing transform pulling from Oracle but use the Text File Output step to write to a delimited file on the network accessible by the SQL Server instance.

3.  Configure the "BulkLoad into MSSQL" job entry in the job accordingly.

4.  Another option is to instead of using the "BulkLoad into MSSQL" job entry use the SQL script job entry and write the full command to bulk load the delimited file:

BULK INSERT ${table_name}

FROM '${Directory}/${table_name}.csv'

WITH (FIELDTERMINATOR='|',CODEPAGE = 'RAW',TABLOCK);

pastedimage_0

pastedimage_2

Thanks,

Matt

Premkumar Manipillai's profile image
Premkumar Manipillai

Thank you Mr.John Craig!

Premkumar Manipillai's profile image
Premkumar Manipillai

Thank you Mr.Matthew Casper

Premkumar Manipillai's profile image
Premkumar Manipillai

Dear Mr.Ricardo Diaz,

Thank you very much and i am very happy for your support for me :-)

My exactly requirement is :-

1. Need to refresh entire db table with records one time from Oracle to Sql. Most of the tables holding crores of records.

2. After that, Refresh sql db table record, based on record modified date from oracle: After refreshed entire table one time from oracle to sql, need to check update availability in Oracle db records. If update is available in oracle db record means, we need to delete that record & insert that record newly in sql db table.

How to do without using DB Procedure?

My work flow diagram like bellow..

pentaho

Thanks

Data Conversion's profile image
Data Conversion

Hi Premkumar,

I would recommend to check "Standards for Lookups, Joins, and Subroutines in PDI" in

"Pentaho Customer Support Portal -> Best Practices"

     Pentaho Data Integration – Pentaho Customer Support Portal

I hope that helps!

Kind regards,

Marin