Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> small files and number of mappers


Copy link to this message
-
Re: small files and number of mappers
On Tue, Nov 30, 2010 at 3:21 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> Hey,
>
> On Tue, Nov 30, 2010 at 4:56 AM, Marc Sturlese <[EMAIL PROTECTED]> wrote:
>>
>> Hey there,
>> I am doing some tests and wandering which are the best practices to deal
>> with very small files which are continuously being generated(1Mb or even
>> less).
>
> Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/
>
>>
>> I see that if I have hundreds of small files in hdfs, hadoop automatically
>> will create A LOT of map tasks to consume them. Each map task will take 10
>> seconds or less... I don't know if it's possible to change the number of map
>> tasks from java code using the new API (I know it can be done with the old
>> one). I would like to do something like NumMapTasksCalculatedByHadoop * 0.3.
>> This way, less maps tasks would be instanciated and each would be working
>> more time.
>
> Perhaps you need to use MultiFileInputFormat:
> http://www.cloudera.com/blog/2009/02/the-small-files-problem/
>
> --
> Harsh J
> www.harshj.com
>

MultiFile and ConbinedInputFormats help.
JVM Re-use helps.

The larger problem is that an average NameNode with 4GB ram will start
JVM pausing with a relatively low number of files/blocks, say
10,000,000. 10mil is not a large number when generating thousands of
files a day.

We open sourced a tool to deal with this problem.
http://www.jointhegrid.com/hadoop_filecrush/index.jsp

Essentially it takes a pass over a directory and combines multiple
files into one. On 'hourly' directories we run it after the hour is
closed out.

V2 (which we should throw over the fence in a week or so) uses the
same techniques but will be optimized for dealing with very large
directories and/or subdirectories of varying sizes by doing more
intelligent planning and grouping of which files an individual mapper
or reducer is going to combine.