This post originally published by Chris Deptula on Wednesday, November 19, 2014.
Many of you requested more information on the inner workings of the Sqoop component. Perhaps the best way to explain is via "lessons learned". Here goes...
Use the split-by Option
Sqoop is primarily used to extract and import data from databases into HDFS and vice versa. This was exactly what HDI needed to do. So, I pulled up the Sqoop user manual for a quick refresher. I quickly found the import-all-tables Sqoop option. It seemed like the perfect solution. One command and I can extract every table from the datamart to HDFS. Great! So I tried it...
Almost immediately, Sqoop failed. It turns out that Sqoop, by default, extracts data from tables in chunks, and the chunking is done using each table's primary key. If a table has a single column primary key the extract works. If no primary key exists, it fails. In our experience, most dimension tables follow this design pattern -- using a single-column surrogate key. However, many, if not most, fact tables do not have a primary key, much less one stored in a single column. This was the case in our test database. When Sqoop tried to extract a fact table, the job failed. Time to start over.
It turns out Sqoop has a split-by option specifically designed to override the primary key default. Unfortunately, the split-by option cannot be used with import-all-tables since the split-by column will vary by table. This meant that HDI would have to generate table-specific import statements, each with their own split-by column. Simple enough. I could use Pentaho Data Integration to interrogate the database metadata and generate each table's specific Sqoop import command. Alas, once I began designing, I soon discovered complexity.
When using the split-by option, you should choose a column which contains values that are uniformly distributed. If there is a skew in the column data, then the chunks being processed will also be skewed causing overweighted import job tasks to potentially fail due to lack of compute resource (usually memory). Thus, picking the spit-by column required knowledge of table content that an automated process may not have. We'd have to make educated guesses.
Like the import-all-tables option, the easiest rule to apply is to use a single column primary key if it exists. We'd have to assume that the column is a surrogate key that was populated with a sequence of values. If no primary key column exists, we look for a date column. Our educated guess is that dates represent events and events tend to be more uniformly distributed than not. If no date column exists, we look for an integer column. If no integer, then we simply grab the first column we can find. Clearly this system is "version 1", with much potential for optimization. We have some ideas and plan on iterating the design in the coming months.
With the acknowledged fuzziness of our split-by algorithm, we knew that we'd have to enable HDI configuration so that it only generated code. You can use HDI to interrogate your source database metadata, generate the Sqoop import commands and, then, optionally execute those commands. This gave us more control to manually customize and execute the generated import commands.
Next, with my generated Sqoop import commands, I tried to run them.
Manage Database Fetch Size
The imports were working. I had loaded several dimensions and the first fact table. However, while importing my second, larger fact table, the MapReduce job failed with a GC Overhead limit exceeded error message. After a lot of Google searching and head scratching I determined that the source MySQL database was reading all the rows in one fetch and trying to store everything into memory. I had to tell MySQL to return the data in batches. The following parameters on my jdbc connection string did the trick.
My issue and solution were specific to MySQL, but I suspect that the same problem could occur for other RDBMS and jdbc drivers. You've been warned.
Convert Dates to Strings
With the fetch problem solved, my Sqoop imports finally completed. My next task was to define Hive tables on the newly loaded data and attempt some queries. All was fine until I selected some date fields. They were returned integers instead of dates...ugh! Being an experienced programmer, I quickly realized that the integers were "seconds since epoch." -- which most will agree is not a user friendly way to present dates! I dug through the Sqoop code and recognized that my date values were being treated as timestamps. To prevent Sqoop from converting dates to seconds since epoch, I had to manually map the datatype for each date field. I used Sqoop's map-column-java option to map date columns to the java String class. The generated Sqoop command had to include an option like this:
Once I made the change, the dates were output and queryable in yyyy-MM-dd HH:mm:ss.S format instead of seconds since epoch.
It took quite a bit of trial and error, but I was finally able to develop a Hadoop Datamart Importer. The core Sqoop import command generation module works like a champ! If you would like to learn more about how Pentaho Consulting's Data Engineering services can help you with your Hadoop data ingestion needs, give us a call.