Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Problem setting the rollInterval for HDFS sink


Copy link to this message
-
Re: Problem setting the rollInterval for HDFS sink
No problem! Glad I was able to help!

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com
On Thu, Oct 24, 2013 at 11:05 AM, Christopher Surage <[EMAIL PROTECTED]>wrote:

> David,
>
> First of all thank you for your help, the typo was the problem. Second the
> reason I was using DataStream as my file type for my hdfs sink was because
> when I had it as a SequenceFile, the sink was adding a lot of garbage data
> to the file when it copied to the hdfs, which was causing undesired
> behavior with my created hive table. When I changed to DataStream, it just
> put the plain text in the file. With regard to the channels, that is
> something I will definitely look at in order to fine tune the performance,
> now that I have solved this problem I can look at that, I have fumbled
> around with the memory channel playing with the capacity and
> transitionCapacity attributes and I have run into choking of the channel,
> just have to read more about it. I don't know if you have seen these before
> but I've been looking at them
> https://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/
> .
>
> Thanks for your help,
>
> Chris
>
>
> On Thu, Oct 24, 2013 at 10:17 AM, DSuiter RDX <[EMAIL PROTECTED]> wrote:
>
>> Christopher,
>>
>> I use a very similar setup. I had a similar problem for a while. The HDFS
>> sink defaults are the tricky part - they are all pretty small, since they
>> assume a high data velocity. The tricky part is that unless they are all
>> explicitly declared as OFF, then they are on.
>>
>> So, your HDFS batch size parameter might be the problem. Also, I notice
>> you need to capitalize the "S" in the hdfs.roll*S*ize parameter -
>> camelcase got me on transactionCapacity once :-) not sure if this is
>> copypasta from your config, but that will cause an issue with the param
>> being respected, so in your case it would roll it at 1024 bytes, or about
>> 10 lines of text probably.
>>
>> One question about your config, though - I notice you have the
>> hdfs.fileType as DataStream for Avro, but you do not have a serializer of
>> avro_event declared. In what format are your files being put into HDFS? As
>> Avro-contained streams, or as aggregated text bodies with newline
>> delimiters? I ask because this setup for us has led to us needing to unwrap
>> Avro event files in MapReduce, which is tricky - if you are getting
>> aggregate text, I have some reconfiguring to do.
>>
>> Other things to look out for are - make sure the HDFS file being written
>> to doesn't close mid-stream, I have not seen that recover gracefully, I am
>> getting OOME in my testbed right now due to something like that; and make
>> sure your transaction capacity in your channels is high enough through the
>> flow, my original one kept choking with a small transaction capacity from
>> the first channel to the Avro sink.
>>
>>
>> Good luck!
>>
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>>
>> On Thu, Oct 24, 2013 at 9:44 AM, Christopher Surage <[EMAIL PROTECTED]>wrote:
>>
>>> Hello I am having an issue increasing the size of the file which get
>>> written into my hdfs. I have tried playing with the rollCount attribute for
>>> an hdfs sink but it seems to cap at 10 lines of text per file, with many
>>> files written to the hdfs directory. Now one may see why I need to change
>>> this.
>>>
>>> I have 2 boxes running
>>> 1) uses a spooldir source to check for new log files copied to a
>>> specific dir. It then sends the events to an avro sink through a mem
>>> channel to the other box with the hdfs on it.
>>>
>>>
>>>
>>>
>>> 2) uses an avro source and sends events to the hdfs sink.
>>>
>>>
>>> configurations:
>>>
>>> 1.
>>>  # Name the compnents of the agent
>>> a1.sources = r1
>>> a1.sinks = k1
>>> a1.channels = c1
>>>
>>>
>>> ###############Describe/configure the source#################
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB