Pentaho

 View Only

 Access parquet files via SQL

  • Pentaho
  • Kettle
  • Pentaho
  • Pentaho Data Integration PDI
Thomas Danner's profile image
Thomas Danner posted 07-01-2021 04:42

Hi,

 

we use PDI 9.1 and want to migrate from Oracle to parquet files.

 

 

For writing data i use the Parquet Output step with CDH Cloudera 6.1

 

 

But to read the data i need to join several parquet files. Do i need to read them with Parquet Input Step and join in spoon or is there a possibility to Access several parquet files via SQL?


#Kettle
#Pentaho
#PentahoDataIntegrationPDI
Brandon Jackson's profile image
Brandon Jackson

PDI itself does not have the kind of SQL capability you are asking about. If you read them in, you can join the streams; but that could mean a lot of data in memory. You are looking for a different tool to carry out the task. For example, you might look at Dremio, or Google BigQuery. They both support connecting to Parquet files and exposing a SQL layer against them and the joins you are talking about across files.