Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - NN Memory Jumps every 1 1/2 hours


Copy link to this message
-
Re: NN Memory Jumps every 1 1/2 hours
Michael Segel 2012-12-22, 15:42
Hey Silly question...

How long have you had 27 million files?

I mean can you correlate the number of files to the spat of OOMs?

Even without problems... I'd say it would be a good idea to upgrade due to the probability of a lot of code fixes...

If you're running anything pre 1.x, going to 1.7 java wouldn't be a good idea.  Having said that... outside of MapR, have any of the distros certified themselves on 1.7 yet?

On Dec 22, 2012, at 6:54 AM, Edward Capriolo <[EMAIL PROTECTED]> wrote:

> I will give this a go. I have actually went in JMX and manually triggered
> GC no memory is returned. So I assumed something was leaking.
>
> On Fri, Dec 21, 2012 at 11:59 PM, Adam Faris <[EMAIL PROTECTED]> wrote:
>
>> I know this will sound odd, but try reducing your heap size.   We had an
>> issue like this where GC kept falling behind and we either ran out of heap
>> or would be in full gc.  By reducing heap, we were forcing concurrent mark
>> sweep to occur and avoided both full GC and running out of heap space as
>> the JVM would collect objects more frequently.
>>
>> On Dec 21, 2012, at 8:24 PM, Edward Capriolo <[EMAIL PROTECTED]>
>> wrote:
>>
>>> I have an old hadoop 0.20.2 cluster. Have not had any issues for a while.
>>> (which is why I never bothered an upgrade)
>>>
>>> Suddenly it OOMed last week. Now the OOMs happen periodically. We have a
>>> fairly large NameNode heap Xmx 17GB. It is a fairly large FS about
>>> 27,000,000 files.
>>>
>>> So the strangest thing is that every 1 and 1/2 hour the NN memory usage
>>> increases until the heap is full.
>>>
>>> http://imagebin.org/240287
>>>
>>> We tried failing over the NN to another machine. We change the Java
>> version
>>> from 1.6_23 -> 1.7.0.
>>>
>>> I have set the NameNode logs to debug and ALL and I have done the same
>> with
>>> the data nodes.
>>> Secondary NN is running and shipping edits and making new images.
>>>
>>> I am thinking something has corrupted the NN MetaData and after enough
>> time
>>> it becomes a time bomb, but this is just a total shot in the dark. Does
>>> anyone have any interesting trouble shooting ideas?
>>
>>