Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> NameNode heapsize


Copy link to this message
-
Re: NameNode heapsize
Each "object" (file, directory, and block) uses about 150 bytes of
memory. If you lower the number of files by having larger ones, you
save a modest amount of memory, depending on how many blocks your
existing files use. The real savings comes from having larger files
and a larger block size. Lets say you start by having some number of
files where each file fits in a single block (64 MB). If you double
the average block usage to two (<=128MB files) by lowering your number
of files in half, you'll save up to 1/3 of your NN memory. If you also
double your block size, you'll cut your usage by another 1/3. This
becomes very significant very quickly.

-Joey

On Fri, Jun 10, 2011 at 12:36 PM, Anh Nguyen <[EMAIL PROTECTED]> wrote:
> On 06/10/2011 04:57 AM, Joey Echeverria wrote:
>>
>> Hi On,
>>
>> The namenode stores the full filesystem image in memory. Looking at
>> your stats, you have ~30 million files/directories and ~47 million
>> blocks. That means that on average, each of your files is only ~1.4
>> blocks in size.  One way to lower the pressure on the namenode would
>> be to store fewer, larger files. If you're able to concatenate files
>> and still parse them, great. Otherwise, Hadoop provides a couple of
>> container file formats that might help.
>>
>> SequenceFiles are Hadoop specific binary files that store key/value
>> pairs. If your data fits that model, you can convert the data into
>> SequenceFiles when you write it to HDFS, including data from multiple
>> input files in a single SequenceFile. Here is a simple example of
>> using the SequenceFile API:
>>
>> http://programmer-land.blogspot.com/2009/04/hadoop-sequence-files.html
>>
>> Another options are Hadoop Archive files (HARs). A HAR file lets you
>> combine multiple smaller files into a virtual filesystem. Here are
>> some links with details on HARs:
>>
>>
>> http://developer.yahoo.com/blogs/hadoop/posts/2010/07/hadoop_archive_file_compaction/
>> http://hadoop.apache.org/mapreduce/docs/current/hadoop_archives.html
>>
>> If you're able to use any of these techniques to grow your average
>> file size, then you can also save memory by increasing the block size.
>> The default block size is 64MB, most clusters I've been exposed to run
>> at 128MB.
>>
>> -Joey
>>
>
> Hi Joey,
> The explanation is really helpful as I've been looking for way to size the
> NN heap.
> What I am still not clear though is why changing the average file size would
> save NN heap usage.
> Since it contains the entire FS image, the sum of space would be the same
> regardless of file size.
> Or is it not the case because with larger file size there would be less
> meta-data to maintain.
> Thanks in advance for your clarification, particularly for advice on sizing
> the heap.
>
> Anh-
>>
>> On Fri, Jun 10, 2011 at 7:45 AM, [EMAIL PROTECTED]<[EMAIL PROTECTED]>  wrote:
>>
>>>
>>> Dear all,
>>>
>>> I'm looking for ways to improve the namenode heap size usage of a
>>> 800-node
>>> 10PB testing Hadoop cluster that stores around 30 million files.
>>>
>>> Here's some info:
>>>
>>> 1 x namenode:     32GB RAM, 24GB heap size
>>> 800 x datanode:   8GB RAM, 13TB hdd
>>>
>>> 33050825 files and directories, 47708724 blocks = 80759549 total. Heap
>>> Size
>>> is 22.93 GB / 22.93 GB (100%)
>>>
>>>  From the cluster summary report, it seems the heap size usage is always
>>> full
>>> but couldn't drop, do you guys know of any ways to reduce it ? So far I
>>> don't see any namenode OOM errors so it looks memory assigned for the
>>> namenode process is (just) enough. But i'm curious which factors would
>>> account for the full use of heap size ?
>>>
>>> Regards,
>>> On
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>

--
Joseph Echeverria
Cloudera, Inc.
443.305.9434
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB