|
|
-
Hadoop Filecrusher! V2 Released!Edward Capriolo 2011-06-01, 19:31
All,
You know the story: You have data files that are created every 5 minutes. You have hundreds of servers. You want to put those files in hadoop. Eventually: You get lots of files and blocks. Your namenode and secondary name node need more memory (BTW JVM's have issues at large Xmx values). Your map reduce jobs start launching too many tasks. A solution: Hadoop File Crusher http://www.jointhegrid.com/hadoop_filecrush/index.jsp How does it work? Hadoop filecrusher uses map reduce to combine multiple smaller files into a single larger one. What was the deal with v1? V1 was great. It happily crushed files, although some datasets presented some challenges. For example, the case where one partition of a hive table was very large and others were smaller. V1 would allocate a reducer per folder and this job would run as long as the biggest folder. Also V1 ALWAYS created one file per directory, which is not optimal if a directory already had maybe 2 largish files and crushing was not necessary. How does v2 deal with this better? V2 is more intelligent in it's job planning. It has tunable parameters which define which files are too small to crush such as. --threshold Percent threshold relative to the dfs block size over which a file becomes eligible for crushing. Must be in the (0, 1]. Default is 0.75, which means files smaller than or equal to 75% of a dfs block will be eligible for crushing. File greater than 75% of a dfs block will be left untouched. --max-file-blocks The maximum number of dfs blocks per output file. Must be a positive integer. Small input files are associated with an output file under the assumption that input and output compression codecs have similar efficiency. Also, a directory containing a lot of data in many small files will be converted into a directory containing a fewer number of large files rather than one super-massive file. With the default value 8, 80 small files, each being 1/10th of a dfs block will be grouped into to a single output file since 8 * 1/10 = 8 dfs blocks. If there are 81 small files, each being 1/10th of a dfs block, two output files will be created. One output file contain the combined contents of 41 files and the second will contain the combined contents of the other 40. A directory of many small files will be converted into fewer number of larger files where each output file is roughly the same size. Why is file crushing optimal? You can not always control how many files are generated by upstream processes Namenode file and block constraints Jobs have less overhead with less files and run MUCH faster. Usage documentation is found here: http://www.jointhegrid.com/svn/filecrush/trunk/src/main/resources/help.txt Enjoy! |