Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> hdfs.idleTimeout ,what's it used for ?


+
Bhaskar V. Karambelkar 2013-01-17, 20:07
+
Connor Woodson 2013-01-17, 20:29
+
Bhaskar V. Karambelkar 2013-01-17, 21:38
+
Juhani Connolly 2013-01-18, 02:08
+
Mohit Anchlia 2013-01-18, 02:17
+
Connor Woodson 2013-01-18, 02:19
+
Connor Woodson 2013-01-18, 02:20
+
Connor Woodson 2013-01-18, 02:23
+
Juhani Connolly 2013-01-18, 02:46
+
Connor Woodson 2013-01-18, 03:24
+
Juhani Connolly 2013-01-18, 03:39
Copy link to this message
-
Re: hdfs.idleTimeout ,what's it used for ?
Alright, that makes sense. The takeaway from this conversation for everyone
else:

If you use idleTimeout, be sure to set the rollInterval to 0. And if you
don't use idleTimeout, be sure to lower maxOpenFiles to a number relative
to your expected throughput. To use the least memory, you will want to use
idleTimeout; but the result will be that more files created in hdfs.

- Connor
On Thu, Jan 17, 2013 at 7:39 PM, Juhani Connolly <
[EMAIL PROTECTED]> wrote:

>  That breaks the use case idleTimeout was originally made for: making
> sure the file is closed promptly after data stops arriving. We use this to
> make sure the files ready for our batches which run quite soon after. The
> time that rollInterval will trigger is unpredictable as it will reset every
> time any other type of roll is triggered(event count or size).
>
> By making rollInterval behave properly all of this is a non-issue. My
> recommendation to users woudl be not to use rollInterval if they're
> bucketing by time(it's redundant behavior).
>
> Documentation could definitely be improved. Once we sort out the approach
> we want to take I can write it up to make the difference and usage clearer.
>
>
> On 01/18/2013 12:24 PM, Connor Woodson wrote:
>
> The way idleTimeout works right now is that it's another rollInterval; it
> will work best when rollInterval is not set and so it seems that it's use
> is best for when you don't want to use a rollInterval and just want to have
> your bucketwriters close when no events are coming through (caused by path
> change or something else; and you can still roll reliably with either count
> or size)
>
>  As such, perhaps it is more clear if idleTimeout is renamed to idleRoll
> or such?
>
>  And then change idleTimeout to only count seconds since it was closed;
> if a bucketwriter is closed for long enough it will automatically remove
> itself. This type of idle will then work well with rollInterval, while the
> other one doesn't (idleRoll + rollInterval creates two time-based rollers.
> There are certainly times for that, but not all of the time).
>
>  - Connor
>
>
> On Thu, Jan 17, 2013 at 6:46 PM, Juhani Connolly <
> [EMAIL PROTECTED]> wrote:
>
>>  It seemed neater at the time. It's only an issue because rollInterval
>> doesn't remove the entry in sfWriters. We could change it so that close
>> doesn't cancel it, and have it check whether or not the writer is already
>> closed, but that'd be kind of ugly.
>>
>> @Mohit:
>>
>> When flume dies unexpectedly the .tmp file remains. When it restarts
>> there is some logic in HDFS sink to recover it(and continue writing from
>> there). I'm not actually sure of the specifics. You may want to try and
>> just kill -9 a running flume process on a test machine and then start it
>> up, look at the logs and see what happens with the output.
>>
>> If flume dies cleanly the file is properly closed.
>>
>>
>> On 01/18/2013 11:23 AM, Connor Woodson wrote:
>>
>> And @ my aside: I hadn't realized that the idleTimeout is canceled by the
>> rollInterval occurring. That's annoying. So setting a lower idleTimeout,
>> and drastically decreasing maxOpenFiles to at most 2 * possible open files,
>> is probably necessary.
>>
>>
>> On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson <[EMAIL PROTECTED]>wrote:
>>
>>> @Mohit:
>>>
>>>  For the HDFS Sink, the tmp files are placed based on the
>>> hadoop.tmp.dir property. The default location is /tmp/hadoop-${user.name}
>>> To change this you can add -Dhadoop.tmp.dir=<path> to your Flume command
>>> line call, or you can specify the property in the core-site.xml of wherever
>>> your HADOOP_HOME environment variable points to.
>>>
>>>  - Connor
>>>
>>>
>>> On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson <[EMAIL PROTECTED]>wrote:
>>>
>>>>  Whether idleTimeout is lower or higher than rollInterval is a
>>>> preference; set it before, and assume you get one message right on the turn
>>>> of the hour, then you will have some part of that hour without any bucket
+
Mohit Anchlia 2013-01-18, 05:12
+
Juhani Connolly 2013-01-18, 06:37
+
Juhani Connolly 2013-01-18, 02:39
+
Connor Woodson 2013-01-18, 02:42
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB