Access parquet files via SQL

Question

Hi,

we use PDI 9.1 and want to migrate from Oracle to parquet files.

For writing data i use the Parquet Output step with CDH Cloudera 6.1

But to read the data i need to join several parquet files. Do i need to read them with Parquet Input Step and join in spoon or is there a possibility to Access several parquet files via SQL?

#Kettle
#Pentaho
#PentahoDataIntegrationPDI

Answer

PDI itself does not have the kind of SQL capability you are asking about. If you read them in, you can join the streams; but that could mean a lot of data in memory. You are looking for a different tool to carry out the task. For example, you might look at Dremio, or Google BigQuery. They both support connecting to Parquet files and exposing a SQL layer against them and the joins you are talking about across files.

Pentaho

Access parquet files via SQL

Related Content

Parquet input error

JSON output via Modified JavaScript value step

Simplify Data Pipelines across On-Premise and Cloud Hadoop Data Lakes

Yaml Input file step

RE: Parquet input error