On Tue, Nov 30, 2010 at 3:21 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> On Tue, Nov 30, 2010 at 4:56 AM, Marc Sturlese <[EMAIL PROTECTED]> wrote:
>> Hey there,
>> I am doing some tests and wandering which are the best practices to deal
>> with very small files which are continuously being generated(1Mb or even
> Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/
>> I see that if I have hundreds of small files in hdfs, hadoop automatically
>> will create A LOT of map tasks to consume them. Each map task will take 10
>> seconds or less... I don't know if it's possible to change the number of map
>> tasks from java code using the new API (I know it can be done with the old
>> one). I would like to do something like NumMapTasksCalculatedByHadoop * 0.3.
>> This way, less maps tasks would be instanciated and each would be working
>> more time.
> Perhaps you need to use MultiFileInputFormat:
> Harsh J
MultiFile and ConbinedInputFormats help.
JVM Re-use helps.
The larger problem is that an average NameNode with 4GB ram will start
JVM pausing with a relatively low number of files/blocks, say
10,000,000. 10mil is not a large number when generating thousands of
files a day.
We open sourced a tool to deal with this problem.
Essentially it takes a pass over a directory and combines multiple
files into one. On 'hourly' directories we run it after the hour is
V2 (which we should throw over the fence in a week or so) uses the
same techniques but will be optimized for dealing with very large
directories and/or subdirectories of varying sizes by doing more
intelligent planning and grouping of which files an individual mapper
or reducer is going to combine.