Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Problem setting the rollInterval for HDFS sink


Copy link to this message
-
Re: Problem setting the rollInterval for HDFS sink
Christopher,

I use a very similar setup. I had a similar problem for a while. The HDFS
sink defaults are the tricky part - they are all pretty small, since they
assume a high data velocity. The tricky part is that unless they are all
explicitly declared as OFF, then they are on.

So, your HDFS batch size parameter might be the problem. Also, I notice you
need to capitalize the "S" in the hdfs.roll*S*ize parameter - camelcase got
me on transactionCapacity once :-) not sure if this is copypasta from your
config, but that will cause an issue with the param being respected, so in
your case it would roll it at 1024 bytes, or about 10 lines of text
probably.

One question about your config, though - I notice you have the
hdfs.fileType as DataStream for Avro, but you do not have a serializer of
avro_event declared. In what format are your files being put into HDFS? As
Avro-contained streams, or as aggregated text bodies with newline
delimiters? I ask because this setup for us has led to us needing to unwrap
Avro event files in MapReduce, which is tricky - if you are getting
aggregate text, I have some reconfiguring to do.

Other things to look out for are - make sure the HDFS file being written to
doesn't close mid-stream, I have not seen that recover gracefully, I am
getting OOME in my testbed right now due to something like that; and make
sure your transaction capacity in your channels is high enough through the
flow, my original one kept choking with a small transaction capacity from
the first channel to the Avro sink.
Good luck!

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com
On Thu, Oct 24, 2013 at 9:44 AM, Christopher Surage <[EMAIL PROTECTED]>wrote:

> Hello I am having an issue increasing the size of the file which get
> written into my hdfs. I have tried playing with the rollCount attribute for
> an hdfs sink but it seems to cap at 10 lines of text per file, with many
> files written to the hdfs directory. Now one may see why I need to change
> this.
>
> I have 2 boxes running
> 1) uses a spooldir source to check for new log files copied to a specific
> dir. It then sends the events to an avro sink through a mem channel to the
> other box with the hdfs on it.
>
>
>
>
> 2) uses an avro source and sends events to the hdfs sink.
>
>
> configurations:
>
> 1.
>  # Name the compnents of the agent
> a1.sources = r1
> a1.sinks = k1
> a1.channels = c1
>
>
> ###############Describe/configure the source#################
> a1.sources.r1.type = spooldir
> a1.sources.r1.spoolDir = /u1/csurage/flume_test
> a1.sources.r1.channels = c1
> #a1.sources.r1.fileHeader = true
>
>
> ##############describe the sink#######################
> # file roll sink
> #a1.sinks.k1.type = file_roll
> #a1.sinks.k1.sink.directory = /u1/csurage/target_flume
>
> # Avro sink
> a1.sinks.k1.type = avro
> a1.sinks.k1.hostname = 45.32.96.136
> a1.sinks.k1.port = 9311
>
>
> # Channel the sink connects to
> a1.sinks.k1.channel = c1
>
> ################describe the channel##################
> # use a channel which buffers events in memory
> a1.channels.c1.type = memory
> a1.channels.c1.byteCapacity = 0
>
>
>
> 2. note when I change any of the attributes in bold, the rollCount stays
> at 10 line
>     files written to the hdfs
>
> # Name the compnents of the agent
> a1.sources = r1
> a1.sinks = k1
> a1.channels = c1
>
>
> ###############Describe/configure the source#################
> a1.sources.r1.type = avro
> a1.sources.r1.bind = 45.32.96.136
> a1.sources.r1.port = 9311
> a1.sources.r1.channels = c1
> #a1.sources.r1.fileHeader = true
>
>
> ##############describe the sink#######################
> # HDFS sink
> a1.sinks.k1.type = hdfs
> a1.sinks.k1.hdfs.path = /user/csurage/hive
> a1.sinks.k1.hdfs.fileType = DataStream
> *a1.sinks.k1.hdfs.rollsize = 0*
> *a1.sinks.k1.hdfs.rollCount = 20   *
> *a1.sinks.k1.hdfs.rollInterval = 0*
>
>
> # Channel the sink connects to
> a1.sinks.k1.channel = c1
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB