Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> hdfs.idleTimeout ,what's it used for ?


+
Bhaskar V. Karambelkar 2013-01-17, 20:07
+
Connor Woodson 2013-01-17, 20:29
+
Bhaskar V. Karambelkar 2013-01-17, 21:38
+
Juhani Connolly 2013-01-18, 02:08
+
Mohit Anchlia 2013-01-18, 02:17
+
Connor Woodson 2013-01-18, 02:19
+
Connor Woodson 2013-01-18, 02:20
+
Connor Woodson 2013-01-18, 02:23
+
Juhani Connolly 2013-01-18, 02:46
+
Connor Woodson 2013-01-18, 03:24
+
Juhani Connolly 2013-01-18, 03:39
Copy link to this message
-
Re: hdfs.idleTimeout ,what's it used for ?
Alright, that makes sense. The takeaway from this conversation for everyone
else:

If you use idleTimeout, be sure to set the rollInterval to 0. And if you
don't use idleTimeout, be sure to lower maxOpenFiles to a number relative
to your expected throughput. To use the least memory, you will want to use
idleTimeout; but the result will be that more files created in hdfs.

- Connor
On Thu, Jan 17, 2013 at 7:39 PM, Juhani Connolly <
[EMAIL PROTECTED]> wrote:

>  That breaks the use case idleTimeout was originally made for: making
> sure the file is closed promptly after data stops arriving. We use this to
> make sure the files ready for our batches which run quite soon after. The
> time that rollInterval will trigger is unpredictable as it will reset every
> time any other type of roll is triggered(event count or size).
>
> By making rollInterval behave properly all of this is a non-issue. My
> recommendation to users woudl be not to use rollInterval if they're
> bucketing by time(it's redundant behavior).
>
> Documentation could definitely be improved. Once we sort out the approach
> we want to take I can write it up to make the difference and usage clearer.
>
>
> On 01/18/2013 12:24 PM, Connor Woodson wrote:
>
> The way idleTimeout works right now is that it's another rollInterval; it
> will work best when rollInterval is not set and so it seems that it's use
> is best for when you don't want to use a rollInterval and just want to have
> your bucketwriters close when no events are coming through (caused by path
> change or something else; and you can still roll reliably with either count
> or size)
>
>  As such, perhaps it is more clear if idleTimeout is renamed to idleRoll
> or such?
>
>  And then change idleTimeout to only count seconds since it was closed;
> if a bucketwriter is closed for long enough it will automatically remove
> itself. This type of idle will then work well with rollInterval, while the
> other one doesn't (idleRoll + rollInterval creates two time-based rollers.
> There are certainly times for that, but not all of the time).
>
>  - Connor
>
>
> On Thu, Jan 17, 2013 at 6:46 PM, Juhani Connolly <
> [EMAIL PROTECTED]> wrote:
>
>>  It seemed neater at the time. It's only an issue because rollInterval
>> doesn't remove the entry in sfWriters. We could change it so that close
>> doesn't cancel it, and have it check whether or not the writer is already
>> closed, but that'd be kind of ugly.
>>
>> @Mohit:
>>
>> When flume dies unexpectedly the .tmp file remains. When it restarts
>> there is some logic in HDFS sink to recover it(and continue writing from
>> there). I'm not actually sure of the specifics. You may want to try and
>> just kill -9 a running flume process on a test machine and then start it
>> up, look at the logs and see what happens with the output.
>>
>> If flume dies cleanly the file is properly closed.
>>
>>
>> On 01/18/2013 11:23 AM, Connor Woodson wrote:
>>
>> And @ my aside: I hadn't realized that the idleTimeout is canceled by the
>> rollInterval occurring. That's annoying. So setting a lower idleTimeout,
>> and drastically decreasing maxOpenFiles to at most 2 * possible open files,
>> is probably necessary.
>>
>>
>> On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson <[EMAIL PROTECTED]>wrote:
>>
>>> @Mohit:
>>>
>>>  For the HDFS Sink, the tmp files are placed based on the
>>> hadoop.tmp.dir property. The default location is /tmp/hadoop-${user.name}
>>> To change this you can add -Dhadoop.tmp.dir=<path> to your Flume command
>>> line call, or you can specify the property in the core-site.xml of wherever
>>> your HADOOP_HOME environment variable points to.
>>>
>>>  - Connor
>>>
>>>
>>> On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson <[EMAIL PROTECTED]>wrote:
>>>
>>>>  Whether idleTimeout is lower or higher than rollInterval is a
>>>> preference; set it before, and assume you get one message right on the turn
>>>> of the hour, then you will have some part of that hour without any bucket
+
Mohit Anchlia 2013-01-18, 05:12
+
Juhani Connolly 2013-01-18, 06:37
+
Juhani Connolly 2013-01-18, 02:39
+
Connor Woodson 2013-01-18, 02:42