Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> hdfs.idleTimeout ,what's it used for ?


+
Bhaskar V. Karambelkar 2013-01-17, 20:07
+
Connor Woodson 2013-01-17, 20:29
+
Bhaskar V. Karambelkar 2013-01-17, 21:38
Copy link to this message
-
Re: hdfs.idleTimeout ,what's it used for ?
It's also useful if you want files to get promptly closed and renamed
from the .tmp or whatever.

We use it with something like 30seconds setting(we have a constant
stream of data) and hourly bucketing.

There is also the issue that files closed by rollInterval are never
removed from the internal linkedList so it actually causes a small
memory leak(which can get big in the long term if you have a lot of
files and hourly renames). I believe this is what is causing the OOM
Mohit is getting in FLUME-1850

So I personally would recommend using it(with a setting that will close
files before rollInterval does).

On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
> Ah I see. Again something useful to have in the flume user guide.
>
> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <[EMAIL PROTECTED]> wrote:
>> the rollInterval will still cause the last 01-17 file to be closed
>> eventually. The way the HDFS sink works with the different files is each
>> unique path is specified by a different BucketWriter object. The sink can
>> hold as many objects as specified by hdfs.maxOpenWorkers (default: 5000),
>> and bucketwriters are only removed when you create the 5001th writer (5001th
>> unique path). However, generally once a writer is closed it is never used
>> again (all of your 1-17 writers will never be used again). To avoid keeping
>> them in the sink's internal list of writers, the idleTimeout is a specified
>> number of seconds in which no data is received by the BucketWriter. After
>> this time, the writer will try to close itself and will then tell the sink
>> to remove it, thus freeing up everything used by the bucketwriter.
>>
>> So the idleTimeout is just a setting to help limit memory usage by the hdfs
>> sink. The ideal time for it is longer than the maximum time between events
>> (capped at the rollInterval) - if you know you'll receive a constant stream
>> of events you might just set it to a minute or something. Or if you are fine
>> with having multiple files open per hour, you can set it to a lower number;
>> maybe just over the average time between events. For me in just testing, I
>> set it >= rollInterval for the cases when no events are received in a given
>> hour (I'd rather keep the object alive for an extra hour than create files
>> every 30 minutes or something).
>>
>> Hope that was helpful,
>>
>> - Connor
>>
>>
>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>> <[EMAIL PROTECTED]> wrote:
>>> Say If I have
>>>
>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>
>>> hdfs.rollInterval=60
>>>
>>> Now, if there is a file
>>> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>>> up and now it's past 12 midnight, i.e. new day
>>> And events start to be written to
>>> /flume/events/2013-01-18/flume_XXXXXXXX.tmp
>>>
>>> will the file 2013-01-17 never be rolled over, unless I have something
>>> like hdfs.idleTimeout=60  ?
>>> If so how do flume sinks keep track of files they need to rollover
>>> after idealTimeout ?
>>>
>>> In short what's the exact use of idealTimeout parameter ?
>>
+
Mohit Anchlia 2013-01-18, 02:17
+
Connor Woodson 2013-01-18, 02:19
+
Connor Woodson 2013-01-18, 02:20
+
Connor Woodson 2013-01-18, 02:23
+
Juhani Connolly 2013-01-18, 02:46
+
Connor Woodson 2013-01-18, 03:24
+
Juhani Connolly 2013-01-18, 03:39
+
Connor Woodson 2013-01-18, 04:18
+
Mohit Anchlia 2013-01-18, 05:12
+
Juhani Connolly 2013-01-18, 06:37
+
Juhani Connolly 2013-01-18, 02:39
+
Connor Woodson 2013-01-18, 02:42
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB