Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> hdfs.idleTimeout ,what's it used for ?

Bhaskar V. Karambelkar 2013-01-17, 20:07
Connor Woodson 2013-01-17, 20:29
Bhaskar V. Karambelkar 2013-01-17, 21:38
Juhani Connolly 2013-01-18, 02:08
Mohit Anchlia 2013-01-18, 02:17
Connor Woodson 2013-01-18, 02:19
Connor Woodson 2013-01-18, 02:20
Connor Woodson 2013-01-18, 02:23
Juhani Connolly 2013-01-18, 02:46
Copy link to this message
Re: hdfs.idleTimeout ,what's it used for ?
The way idleTimeout works right now is that it's another rollInterval; it
will work best when rollInterval is not set and so it seems that it's use
is best for when you don't want to use a rollInterval and just want to have
your bucketwriters close when no events are coming through (caused by path
change or something else; and you can still roll reliably with either count
or size)

As such, perhaps it is more clear if idleTimeout is renamed to idleRoll or

And then change idleTimeout to only count seconds since it was closed; if a
bucketwriter is closed for long enough it will automatically remove itself.
This type of idle will then work well with rollInterval, while the other
one doesn't (idleRoll + rollInterval creates two time-based rollers. There
are certainly times for that, but not all of the time).

- Connor
On Thu, Jan 17, 2013 at 6:46 PM, Juhani Connolly <

>  It seemed neater at the time. It's only an issue because rollInterval
> doesn't remove the entry in sfWriters. We could change it so that close
> doesn't cancel it, and have it check whether or not the writer is already
> closed, but that'd be kind of ugly.
> @Mohit:
> When flume dies unexpectedly the .tmp file remains. When it restarts there
> is some logic in HDFS sink to recover it(and continue writing from there).
> I'm not actually sure of the specifics. You may want to try and just kill
> -9 a running flume process on a test machine and then start it up, look at
> the logs and see what happens with the output.
> If flume dies cleanly the file is properly closed.
> On 01/18/2013 11:23 AM, Connor Woodson wrote:
> And @ my aside: I hadn't realized that the idleTimeout is canceled by the
> rollInterval occurring. That's annoying. So setting a lower idleTimeout,
> and drastically decreasing maxOpenFiles to at most 2 * possible open files,
> is probably necessary.
> On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson <[EMAIL PROTECTED]>wrote:
>> @Mohit:
>>  For the HDFS Sink, the tmp files are placed based on the hadoop.tmp.dir
>> property. The default location is /tmp/hadoop-${user.name} To change
>> this you can add -Dhadoop.tmp.dir=<path> to your Flume command line call,
>> or you can specify the property in the core-site.xml of wherever your
>> HADOOP_HOME environment variable points to.
>>  - Connor
>> On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson <[EMAIL PROTECTED]>wrote:
>>>  Whether idleTimeout is lower or higher than rollInterval is a
>>> preference; set it before, and assume you get one message right on the turn
>>> of the hour, then you will have some part of that hour without any bucket
>>> writers; but if you get another message at the end of the hour, you will
>>> end up with two files instead of one. Set it idleTimeout to be longer and
>>> you will get just one file, but also (at worst case) you will have twice as
>>> many bucketwriters open; so it all depends on how many files you want/how
>>> much memory you have to spare.
>>>  - Connor
>>>  An aside:
>>> bucketwriters, after being closed by rollInterval, aren't really a
>>> memory leak; they just are very rarely useful to keep around (your path
>>> could rely on hostname, and you could use a rollinterval, and then those
>>> bucketwriters will still remain useful). And they will get removed
>>> eventually; by default after you've created your 5001st bucketwriter, the
>>> first (or whichever was used longest ago) will be removed.
>>>  And I don't think that's the cause behind 1850 as he did have an
>>> idleTimeout set at 15 minutes.
>>>  On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <
>>> [EMAIL PROTECTED]> wrote:
>>>> It's also useful if you want files to get promptly closed and renamed
>>>> from the .tmp or whatever.
>>>> We use it with something like 30seconds setting(we have a constant
>>>> stream of data) and hourly bucketing.
>>>> There is also the issue that files closed by rollInterval are never
Juhani Connolly 2013-01-18, 03:39
Connor Woodson 2013-01-18, 04:18
Mohit Anchlia 2013-01-18, 05:12
Juhani Connolly 2013-01-18, 06:37
Juhani Connolly 2013-01-18, 02:39
Connor Woodson 2013-01-18, 02:42