A small data analytics company, specializing in retail analytics, collected sales transaction data from one of the world’s largest retailers. The data was collected every hour from each of the 10 stores in one geographical region and stored in relational data marts. Every night the data from the previous day was transformed and processed from all the datamarts and pushed to a reporting data warehouse.
The team that handled the daily data ingestion used Pentaho to implement and manage the ETL processes. This approach works so well, the analytics company expanded the contract to include approximately 150 of the retailer's stores across the United States. This presented an entirely new challenge of handling data at scale.
In order to handle the exponential increase in expected data volume, the company decided to use Snowflake, the cloud data platform, for its ability to scale on demand and handle large volumes of data in a fast and reliable manner. Fortunately, Pentaho has built-in components which work with Snowflake, so no changes were required to the ETL processes.
This case study details the steps that were taken to bulk load data into Snowflake using Pentaho tools.
The Business Challenge
The ETL process was designed to accommodate data from datamarts and store it in a relational data warehouse, which is then used by the reporting and data analytics teams. The biggest challenge was to come up with an architecture capable of handling the anticipated 100x increase in data volume to be processed daily.
Before the scale up, ETL jobs were pulling from more than 10 data sources with an average of 300,000 rows of data. It took approximately 60 minutes to process that volume. If the data volume were to grow as predicted, the process would take more than 10 hours to complete, and that was unacceptable. Afterall, there are contractual SLAs which must be met.
To accommodate the increase in data volume, Snowflake’s data warehouse technology was chosen for its scalability, performance, and reliability. For ETL, Pentaho Data Integration (PDI) was selected for a variety of reasons, including familiarity and because it has integration connectors which work seamlessly with Snowflake.
The solution required a redesign of the data processing jobs to enable them to scale. To do this, the following processes were put in place:
- Create compressed flat files and store them on a local file system, instead of writing directly to database tables
- Bulk load the flat files into Snowflake managed object store
- Extend the data retention from 1 month currently to 1 year
Step 1: Input data is processed by PDI jobs and transformations and stored in compressed files.
Step 2: The compressed files are processed by PDI and put into Snowflake stage area.
Step 3: The data is bulk loaded, using PDI, to the Snowflake tables using the Snowflake Virtual Warehouse (managed by Pentaho).
Create a Database in Snowflake
Log into Snowflake Console on the webpage
Create a Database
Create a Snowflake Internal Named Stage
NOTE: Internal stages are recommended to be used for regular data loads that could involve multiple tables and/or users.
Create a Stage
Create a schema in Snowflake
Create a Schema
Create a VFS connection to Snowflake stage in Pentaho
Create a Database connection in Pentaho
Pentaho Jobs and Transformations
Transformation for Input data processing
The input data is read from a table into the P transformation using the Table Input step, and then transformed to be written to output text files.
The output is a Text File Output step that has the following data file format characteristics:
- ASCII (UTF-8 encoded)
- Fields are string datatype
- Fields are delimited by ‘|’
- Split every 100,000 rows
- No header row
- Compressed using gzip compression
Job to upload files to Snowflake stage
A Pentaho job is used to create and manage a Snowflake Virtual Warehouse that will be sized to handle the data load volume.
The job uses the ‘Create Snowflake Warehouse’ job entry to creates a Virtual Warehouse in Snowflake to handle the load process:
- Auto Resume
- Continue if warehouse exists
The job uses the ‘Start Snowflake warehouse’ job entry to start the Virtual Warehouse.
It then calls the job to bulk load the data into Snowflake tables.
It uses the ‘Stop Snowflake warehouse’ job entry to suspend the Virtual Warehouse after the data has been loaded.
Job to bulk load data to Snowflake table
A Pentaho job is used to create the table, stage the data files and bulk load the data to the table.
Next, create the desired table in Snowflake, if it does not exist already.
Upload the files from the output directory on the Pentaho server’s local file system to the Snowflake staging directory.
Uses the ‘Bulk load into Snowflake’ job entry to load the files from the Snowflake staging directory to the desired table(s).
Once the solution was implemented, the customer was able to:
- Cut down the data processing time from 60 minutes to 10 minutes
- Use Pentaho to leverage Snowflake’s on-demand scaling feature to process larger volumes of data
- Experience quick response time on reports and analytics queries
- Reduce the time it takes to onboard data from new stores that adopt the solution
- Extend the data retention window from one month to one year
- Make the data available in a timely manner to data analysts
Hitachi Vantara has developed PDI to help data engineers analyze data at scale whether on their own premises or on a cloud data platform like Snowflake. You can learn more about analyzing data at scale at Hitachi Vantara's free upcoming virtual conference DataOps.NEXT on May 14,2020.
DataOps.NEXT is a one-day, online event which covers the entire data lifecycle including onboarding and preparation; governance and agility; data fabric optimization; and analytics and machine learning. The event has presentation from 30 industry experts across four different learning tracks. There are even opportunities for 1:1 mentoring sessions with experts after each session.
To register for the event, go to DataOps.NEXT Conference page