Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> hdfs.idleTimeout ,what's it used for ?


+
Bhaskar V. Karambelkar 2013-01-17, 20:07
+
Connor Woodson 2013-01-17, 20:29
+
Bhaskar V. Karambelkar 2013-01-17, 21:38
Copy link to this message
-
Re: hdfs.idleTimeout ,what's it used for ?
It's also useful if you want files to get promptly closed and renamed
from the .tmp or whatever.

We use it with something like 30seconds setting(we have a constant
stream of data) and hourly bucketing.

There is also the issue that files closed by rollInterval are never
removed from the internal linkedList so it actually causes a small
memory leak(which can get big in the long term if you have a lot of
files and hourly renames). I believe this is what is causing the OOM
Mohit is getting in FLUME-1850

So I personally would recommend using it(with a setting that will close
files before rollInterval does).

On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
> Ah I see. Again something useful to have in the flume user guide.
>
> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <[EMAIL PROTECTED]> wrote:
>> the rollInterval will still cause the last 01-17 file to be closed
>> eventually. The way the HDFS sink works with the different files is each
>> unique path is specified by a different BucketWriter object. The sink can
>> hold as many objects as specified by hdfs.maxOpenWorkers (default: 5000),
>> and bucketwriters are only removed when you create the 5001th writer (5001th
>> unique path). However, generally once a writer is closed it is never used
>> again (all of your 1-17 writers will never be used again). To avoid keeping
>> them in the sink's internal list of writers, the idleTimeout is a specified
>> number of seconds in which no data is received by the BucketWriter. After
>> this time, the writer will try to close itself and will then tell the sink
>> to remove it, thus freeing up everything used by the bucketwriter.
>>
>> So the idleTimeout is just a setting to help limit memory usage by the hdfs
>> sink. The ideal time for it is longer than the maximum time between events
>> (capped at the rollInterval) - if you know you'll receive a constant stream
>> of events you might just set it to a minute or something. Or if you are fine
>> with having multiple files open per hour, you can set it to a lower number;
>> maybe just over the average time between events. For me in just testing, I
>> set it >= rollInterval for the cases when no events are received in a given
>> hour (I'd rather keep the object alive for an extra hour than create files
>> every 30 minutes or something).
>>
>> Hope that was helpful,
>>
>> - Connor
>>
>>
>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>> <[EMAIL PROTECTED]> wrote:
>>> Say If I have
>>>
>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>
>>> hdfs.rollInterval=60
>>>
>>> Now, if there is a file
>>> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>>> up and now it's past 12 midnight, i.e. new day
>>> And events start to be written to
>>> /flume/events/2013-01-18/flume_XXXXXXXX.tmp
>>>
>>> will the file 2013-01-17 never be rolled over, unless I have something
>>> like hdfs.idleTimeout=60  ?
>>> If so how do flume sinks keep track of files they need to rollover
>>> after idealTimeout ?
>>>
>>> In short what's the exact use of idealTimeout parameter ?
>>
+
Mohit Anchlia 2013-01-18, 02:17
+
Connor Woodson 2013-01-18, 02:19
+
Connor Woodson 2013-01-18, 02:20
+
Connor Woodson 2013-01-18, 02:23
+
Juhani Connolly 2013-01-18, 02:46
+
Connor Woodson 2013-01-18, 03:24
+
Juhani Connolly 2013-01-18, 03:39
+
Connor Woodson 2013-01-18, 04:18
+
Mohit Anchlia 2013-01-18, 05:12
+
Juhani Connolly 2013-01-18, 06:37
+
Juhani Connolly 2013-01-18, 02:39
+
Connor Woodson 2013-01-18, 02:42