This post published by Chris Deptula on Wednesday, February 11, 2015
This is the first in a 3 part blog on working with small files in Hadoop. Hadoop does not work well with lots of small files and instead wants fewer large files. This is probably a statement you have heard before. But, why does Hadoop have a problem with large numbers of small files? And, what exactly does "small" mean? In the first part of this series I will answer these questions. The subsequent parts will discuss options for solving or working around the small file problem.
What is a small file?
A small file can be defined as any file that is significantly smaller than the Hadoop block size. The Hadoop block size is usually set to 64,128, or 256 MB, trending toward increasingly larger block sizes. Throughout the rest of this blog when providing examples we will use a 128MB block size. I use the rule if a file is not at least 75% of the block size, it is a small file. However, the small file problem does not just affect small files. If a large number of files in your Hadoop cluster are marginally larger than an increment of your block size you will encounter the same challenges as small files. For example if your block size is 128MB but all of the files you load into Hadoop are 136MB you will have a significant number of small 8MB blocks. The good news is that solving the small block problem is as simple as choosing an appropriate (larger) block size. Solving the small file problem is significantly more complex. Notice I never mentioned number of rows. Although number of rows can impact MapReduce performance, it is much less important than file size when determining how to write files to HDFS.
Why do small files occur?
The small file problem is an issue a Pentaho Consulting frequently sees on Hadoop projects. There are a variety of reasons why companies may have small files in Hadoop, including:
- Companies are increasingly hungry for data to be available near real time, causing Hadoop ingestion processes to run every hour/day/week with only, say, 10MB of new data generated per period.
- The source system generates thousands of small files which are copied directly into Hadoop without modification.
- The configuration of MapReduce jobs using more than the necessary number of reducers, each outputting its own file. Along the same lines, if there is a skew in the data that causes the majority of the data to go to one reducer, then the remaining reducers will process very little data and produce small output files.
Why does Hadoop have a small file problem?
There are two primary reasons Hadoop has a small file problem: NameNode memory management and MapReduce performance. The namenode memory problem Every directory, file, and block in Hadoop is represented as an object in memory on the NameNode. As a rule of thumb, each object requires 150 bytes of memory. If you have 20 million files each requiring 1 block, your NameNode needs 6GB of memory. This is obviously quite doable, but as you scale up you eventually reach a practical limit on how many files (blocks) your NameNode can handle. A billion files will require 300GB of memory and that is assuming every file is in the same folder! Let's consider the impact of a 300GB NameNode memory requirement...
- When a NameNode restarts, it must read the metadata of every file from a cache on local disk. This means reading 300GB of data from disk -- likely causing quite a delay in startup time.
- In normal operation, the NameNode must constantly track and check where every block of data is stored in the cluster. This is done by listening for data nodes to report on all of their blocks of data. The more blocks a data node must report, the more network bandwidth it will consume. Even with high-speed interconnects between the nodes, simple block reporting at this scale could become disruptive.
The optimization is clear. If you can reduce the number of small files on your cluster, you can reduce the NameNode memory footprint, startup time and network impact.
The MapReduce performance problem
Having a large number of small files will degrade the performance of MapReduce processing whether it be Hive, Pig, Cascading, Pentaho MapReduce, or Java MapReduce. The first reason is that a large number of small files means a large amount of random disk IO. Disk IO is often one of the biggest limiting factors in MapReduce performance. One large sequential read will always outperform reading the same amount of data via several random reads. If you can store your data in fewer, larger blocks, the performance impact of disk IO is mitigated.
The second reason for performance degradation is a bit more complicated, requiring an understanding of how MapReduce processes files and schedules resources. I will use MapReduce version 1 terminology in this explanation as it is easier to explain than with Yarn, but the same concepts apply for Yarn. When a MapReduce job launches, it schedules one map task per block of data being processed. Each file stored in Hadoop is at least one block. If you have 10,000 files each containing 10 MB of data, a MapReduce job will schedule 10,000 map tasks. Usually Hadoop is configured so that each map task runs in its own JVM. Continuing our example, you will have the overhead of spinning up and tearing down 10,000 JVMs!
Your Hadoop cluster only has so many resources. In MapReduce v1, to avoid overloading your nodes, you specify the maximum number of concurrent mappers a node can process. Often the maximum number of concurrent mappers is in the 5 to 20 range. Therefore, to run 10,000 mappers concurrently you would have to have 500 to 2000 nodes. Most Hadoop clusters are much smaller than this, causing the JobTracker to queue map tasks as they wait for open slots. If you have a 20 node cluster with a total of 100 slots, your queue will become quite large and your process will take a long time. And don't forget, your job is likely not the only job competing for cluster resources.
If instead of 10,000 10MB files you had 800 128 MB files you would only need 800 map tasks. This would require an order of magnitude less JVM maintenance time and will result in better disk IO. Even though an individual map task processing 128 MB will take longer than a map task processing 10 MB, the sum of all of the processing time will almost always be orders of magnitude faster when processing the 800 larger files.
What can you do if you have small files?
Now that we have discussed what constitutes a small file and why Hadoop prefers larger files, how do you avoid the small file problem? In my next post, I will discuss solutions to the NameNode memory problem as well as some initial options for solving the MapReduce performance problem. In my third and final blog in this series, I will discuss additional solutions for the performance problem and how to choose the best solution for your situation.