Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - HDFS Sink Memory Leak


Copy link to this message
-
Re: HDFS Sink Memory Leak
DSuiter RDX 2013-11-11, 18:29
David,

It was a mix. Our test pipeline is what would euphemistically be called
"low-velocity" when it comes to data. When we experimented with
rollInterval, we found a lot of lingering .tmp, but we did not have an
idleTimeout set on that config IIRC, since we were testing parameters in
isolation. I feel like we also accidentally tested the default roll
parameters when we first started too, because we didn't realize the
defaults are inclusive by default. However, I still have files that are
something like 6 weeks old now, my test cluster VM has been rebooted many
times in the interim, I have spun up dozens of different Flume agent
configs in the weeks in between, and those files are still named .tmp and
show 0 bytes. Like I said, I am sure I can run "hadoop fs -mv
<name.avro.tmp> <name.avro> and that will change the name, I am just not
sure that, without all the other parts of the Flume pipeline, that they
would get properly closed in HDFS, especially because these are from tier 2
of an Avro tiered-ingest agent config. When I read about
serialization/deserialization, it seems like StreamWriter not closing the
stream correctly or exiting properly will cause issues. I guess I'll just
give it a shot, since it's just junk data anyway.

Thanks again,
*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com
On Mon, Nov 11, 2013 at 11:03 AM, Hari Shreedharan <
[EMAIL PROTECTED]> wrote:

> This is because like you said you have too many files open at the same
> time. HDFS stream classes keep a pretty large buffer (this is HDFS client
> code not Flume) which will be cleaned up when the file is closed. Meeting
> maxOpenFiles to a smaller number is a good way to handle this.
>
> On Monday, November 11, 2013, David Sinclair wrote:
>
>> I forgot to mention that map is contained in the HDFSEventSink class.
>>
>> Devin,
>>
>> Are you setting a roll interval? I use roll intervals so the .tmp files
>> were getting closed, even if they were idle. They were just never being
>> removed from that hashmap.
>>
>>
>> On Mon, Nov 11, 2013 at 10:10 AM, DSuiter RDX <[EMAIL PROTECTED]> wrote:
>>
>>> David,
>>>
>>> This is insightful - I found the need to place an idleTimeout value in
>>> the Flume config, but we were not running out of memory, we just found out
>>> that lots of unclosed .tmp files got left laying around when the roll
>>> occurred. I believe these are registering as under-replicated blocks as
>>> well - in my pseudo-distributed testbed, I have 5 under-replicated
>>> blocks...when the replication factor for pseudo-mode is "1" - and so we
>>> don't like them in the actual cluster.
>>>
>>> Can you tell me, in your research, have you found a good way to close
>>> the .tmp files out so they are properly acknowledged by HDFS/BucketWriter?
>>> Or is simply renaming them sufficient? I've been concerned that the manual
>>> rename approach might leave some floating metadata around, which is not
>>> ideal.
>>>
>>> If you're not sure, don't sweat it, obviously. I was just wondering if
>>> you already knew and could save me some empirical research time...
>>>
>>> Thanks!
>>> *Devin Suiter*
>>> Jr. Data Solutions Software Engineer
>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>> Google Voice: 412-256-8556 | www.rdx.com
>>>
>>>
>>> On Mon, Nov 11, 2013 at 10:01 AM, David Sinclair <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have been investigating an OutOfMemory error when using the HDFS
>>>> event sink. I have determined the problem to be with the
>>>>
>>>> WriterLinkedHashMap sfWriters;
>>>>
>>>> Depending on how you generate your file name/directory path, you can
>>>> run out of memory pretty quickly. You need to either set the
>>>> *idleTimeout* to some non-zero value or set the number of
>>>> *maxOpenFiles*.
>>>>
>>>> The map keeps references to BucketWriter around longer than they are
>>>> needed. I was able to reproduce this consistently and took a heap dump to