Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> hdfs.idleTimeout ,what's it used for ?


Copy link to this message
-
Re: hdfs.idleTimeout ,what's it used for ?
Ya, I read this first; I find the implementation of the idleTimeout
slightly odd that it doesn't persist through the file closing.
On Thu, Jan 17, 2013 at 6:39 PM, Juhani Connolly <
[EMAIL PROTECTED]> wrote:

>  I lined up why it was happening in FLUME-1850
>
> He has hourly rolls, a 4000 interval and a 900 idle.
>
> After an hour 400 remains on the interval. So the interval gets triggered
> first, which triggers close, which cancels all timers including the
> idleTimeout. Thus the entry in sfWriters remains. His memory dump confirms
> this(he has a huge sfWriters map in memory after 30 days). I also confirmed
> this behaviour of rollInterval when developing the idleTimeout feature.
>
> You're right  about the limit on the size of sfWriters. With a limit of
> 5000, even if the closed ones stay in the list, they shouldn't be that big
> since buffers should be cleaned up.
>
> idleTimeout will indeed result in more files if you don't have a steady
> stream of files. It is most useful with a steady stream of data and time
> bucketed data. In such situations, I might even recommend not using
> rollInterval at all and having a short idleTimeout(or if you're not in a
> rush to get your file closed, give it a comfortably long timeout)
>
>
> On 01/18/2013 11:19 AM, Connor Woodson wrote:
>
>  Whether idleTimeout is lower or higher than rollInterval is a
> preference; set it before, and assume you get one message right on the turn
> of the hour, then you will have some part of that hour without any bucket
> writers; but if you get another message at the end of the hour, you will
> end up with two files instead of one. Set it idleTimeout to be longer and
> you will get just one file, but also (at worst case) you will have twice as
> many bucketwriters open; so it all depends on how many files you want/how
> much memory you have to spare.
>
>  - Connor
>
>  An aside:
> bucketwriters, after being closed by rollInterval, aren't really a memory
> leak; they just are very rarely useful to keep around (your path could rely
> on hostname, and you could use a rollinterval, and then those bucketwriters
> will still remain useful). And they will get removed eventually; by default
> after you've created your 5001st bucketwriter, the first (or whichever was
> used longest ago) will be removed.
>
>  And I don't think that's the cause behind 1850 as he did have an
> idleTimeout set at 15 minutes.
>
>
> On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <
> [EMAIL PROTECTED]> wrote:
>
>> It's also useful if you want files to get promptly closed and renamed
>> from the .tmp or whatever.
>>
>> We use it with something like 30seconds setting(we have a constant stream
>> of data) and hourly bucketing.
>>
>> There is also the issue that files closed by rollInterval are never
>> removed from the internal linkedList so it actually causes a small memory
>> leak(which can get big in the long term if you have a lot of files and
>> hourly renames). I believe this is what is causing the OOM Mohit is getting
>> in FLUME-1850
>>
>> So I personally would recommend using it(with a setting that will close
>> files before rollInterval does).
>>
>>
>> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>>
>>> Ah I see. Again something useful to have in the flume user guide.
>>>
>>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>> the rollInterval will still cause the last 01-17 file to be closed
>>>> eventually. The way the HDFS sink works with the different files is each
>>>> unique path is specified by a different BucketWriter object. The sink
>>>> can
>>>> hold as many objects as specified by hdfs.maxOpenWorkers (default:
>>>> 5000),
>>>> and bucketwriters are only removed when you create the 5001th writer
>>>> (5001th
>>>> unique path). However, generally once a writer is closed it is never
>>>> used
>>>> again (all of your 1-17 writers will never be used again). To avoid
>>>> keeping
>>>> them in the sink's internal list of writers, the idleTimeout is a
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB