Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Problem setting the rollInterval for HDFS sink


Copy link to this message
-
Re: Problem setting the rollInterval for HDFS sink
I get that sometimes, but usually when the HDFS sink header interpreters
make a new directory or the file rolls. I have some .tmp files that are
from 2 weeks ago - the agent that wrote them is never going to point to
that filepath again, but they are still there. I usually don't sweat them
in my pseudo-cluster testbed, but in our development/quasi-production
cluster, I applied the hdfs.idleTimeout parameter - our test data grows
slowly, lots of small events with high frequency, but only about 20 MB of
data/day because they are log entries from a single server. I have it set
to make a new directory for the day based on the timestamp applied by the
syslogTCP source, so the first event to hit the source after midnight makes
a new directory and cause a new file, but does not cause the closing of the
previous file, not sure why, I think that is "just how it works" and so I
have a 30-minute idleTimeout in place. This morning, the roll created a new
FlumeData.$TIMESTAMP.avro.tmp, left the previous day's
FlumeData.$TIMESTAMP.avro.tmp open, and then the idleTimeout swooped in at
the 30-minute mark and closed the previous file for me. Setting the
idleTimeout too short will cause problems if it is shorter than the average
frequency of the events. It seems like the idleTimeout tells HDFS
BucketWriter to close the file, but does not tell AvroSink to write to a
new file, so the sink processor heap fills and crashes with OOME.

Hope that helps.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com
On Thu, Oct 24, 2013 at 12:46 PM, Christopher Surage <[EMAIL PROTECTED]>wrote:

> David,
>
> Did you ever have a problem with the hdfs getting stuck on a write, I am
> noticing that it just stops writing files after a certain amount of time
> but it doesn't seem to be finished it just stops at a certain .tmp file.
>
> regards,
>
> Chris
>
>
> On Thu, Oct 24, 2013 at 11:09 AM, DSuiter RDX <[EMAIL PROTECTED]> wrote:
>
>> No problem! Glad I was able to help!
>>
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>>
>> On Thu, Oct 24, 2013 at 11:05 AM, Christopher Surage <[EMAIL PROTECTED]>wrote:
>>
>>> David,
>>>
>>> First of all thank you for your help, the typo was the problem. Second
>>> the reason I was using DataStream as my file type for my hdfs sink was
>>> because when I had it as a SequenceFile, the sink was adding a lot of
>>> garbage data to the file when it copied to the hdfs, which was causing
>>> undesired behavior with my created hive table. When I changed to
>>> DataStream, it just put the plain text in the file. With regard to the
>>> channels, that is something I will definitely look at in order to fine tune
>>> the performance, now that I have solved this problem I can look at that, I
>>> have fumbled around with the memory channel playing with the capacity and
>>> transitionCapacity attributes and I have run into choking of the channel,
>>> just have to read more about it. I don't know if you have seen these before
>>> but I've been looking at them
>>> https://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/
>>> .
>>>
>>> Thanks for your help,
>>>
>>> Chris
>>>
>>>
>>> On Thu, Oct 24, 2013 at 10:17 AM, DSuiter RDX <[EMAIL PROTECTED]> wrote:
>>>
>>>> Christopher,
>>>>
>>>> I use a very similar setup. I had a similar problem for a while. The
>>>> HDFS sink defaults are the tricky part - they are all pretty small, since
>>>> they assume a high data velocity. The tricky part is that unless they are
>>>> all explicitly declared as OFF, then they are on.
>>>>
>>>> So, your HDFS batch size parameter might be the problem. Also, I notice
>>>> you need to capitalize the "S" in the hdfs.roll*S*ize parameter -
>>>> camelcase got me on transactionCapacity once :-) not sure if this is
>>>> copypasta from your config, but that will cause an issue with the param
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB