Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> hdfs.idleTimeout ,what's it used for ?


Copy link to this message
-
Re: hdfs.idleTimeout ,what's it used for ?
I have been using it and it's great feature to have.

One question I have though is, what happens when flume dies unexpectedly,
does it leave .tmp files behind? How to clean those away and close it
gracefully?

On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <
[EMAIL PROTECTED]> wrote:

> It's also useful if you want files to get promptly closed and renamed from
> the .tmp or whatever.
>
> We use it with something like 30seconds setting(we have a constant stream
> of data) and hourly bucketing.
>
> There is also the issue that files closed by rollInterval are never
> removed from the internal linkedList so it actually causes a small memory
> leak(which can get big in the long term if you have a lot of files and
> hourly renames). I believe this is what is causing the OOM Mohit is getting
> in FLUME-1850
>
> So I personally would recommend using it(with a setting that will close
> files before rollInterval does).
>
> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>
>> Ah I see. Again something useful to have in the flume user guide.
>>
>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <[EMAIL PROTECTED]>
>> wrote:
>>
>>> the rollInterval will still cause the last 01-17 file to be closed
>>> eventually. The way the HDFS sink works with the different files is each
>>> unique path is specified by a different BucketWriter object. The sink can
>>> hold as many objects as specified by hdfs.maxOpenWorkers (default: 5000),
>>> and bucketwriters are only removed when you create the 5001th writer
>>> (5001th
>>> unique path). However, generally once a writer is closed it is never used
>>> again (all of your 1-17 writers will never be used again). To avoid
>>> keeping
>>> them in the sink's internal list of writers, the idleTimeout is a
>>> specified
>>> number of seconds in which no data is received by the BucketWriter. After
>>> this time, the writer will try to close itself and will then tell the
>>> sink
>>> to remove it, thus freeing up everything used by the bucketwriter.
>>>
>>> So the idleTimeout is just a setting to help limit memory usage by the
>>> hdfs
>>> sink. The ideal time for it is longer than the maximum time between
>>> events
>>> (capped at the rollInterval) - if you know you'll receive a constant
>>> stream
>>> of events you might just set it to a minute or something. Or if you are
>>> fine
>>> with having multiple files open per hour, you can set it to a lower
>>> number;
>>> maybe just over the average time between events. For me in just testing,
>>> I
>>> set it >= rollInterval for the cases when no events are received in a
>>> given
>>> hour (I'd rather keep the object alive for an extra hour than create
>>> files
>>> every 30 minutes or something).
>>>
>>> Hope that was helpful,
>>>
>>> - Connor
>>>
>>>
>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>>> <[EMAIL PROTECTED]> wrote:
>>>
>>>> Say If I have
>>>>
>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>>
>>>> hdfs.rollInterval=60
>>>>
>>>> Now, if there is a file
>>>> /flume/events/2013-01-17/**flume_XXXXXXXXX.tmp
>>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>>>> up and now it's past 12 midnight, i.e. new day
>>>> And events start to be written to
>>>> /flume/events/2013-01-18/**flume_XXXXXXXX.tmp
>>>>
>>>> will the file 2013-01-17 never be rolled over, unless I have something
>>>> like hdfs.idleTimeout=60  ?
>>>> If so how do flume sinks keep track of files they need to rollover
>>>> after idealTimeout ?
>>>>
>>>> In short what's the exact use of idealTimeout parameter ?
>>>>
>>>
>>>
>