Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> What does HDFSSink batch size actually effect?


Copy link to this message
-
Re: What does HDFSSink batch size actually effect?
Okay, so the display of there being 0 bytes in the file is a misnomer in
all likelihood.  This is a bit unfortunate as for our use case we then need
to wait an hour to find out how much data is actually in each file.

My understanding is that by default, hdfs appends are NOT active.  I guess
the remaining question is what, if anything will having append turned on
affect?
On Tue, May 14, 2013 at 2:26 PM, Gary Malouf <[EMAIL PROTECTED]> wrote:

> I've previously posted something similar to this on StackOverflow:
> http://stackoverflow.com/questions/16548358/how-come-flume-ng-hdfs-sink-does-not-write-to-file-when-the-number-of-events-equ
>
> My understanding of batch size from looking at the code in flume-ng 1.3.x
> is that batch size determines at what point data is written to hdfs.  With
> my configuration below, I am not seeing any data written to file until the
> rollInterval has passed.
>
> imp-agent.channels.imp-ch1.type = memory
> imp-agent.channels.imp-ch1.capacity = 40000
> imp-agent.channels.imp-ch1.transactionCapacity = 1000
>
> imp-agent.sources.avro-imp-source1.channels = imp-ch1
> imp-agent.sources.avro-imp-source1.type = avro
> imp-agent.sources.avro-imp-source1.bind = 0.0.0.0
> imp-agent.sources.avro-imp-source1.port = 41414
>
> imp-agent.sources.avro-imp-source1.interceptors = host1 timestamp1
> imp-agent.sources.avro-imp-source1.interceptors.host1.type = host
> imp-agent.sources.avro-imp-source1.interceptors.host1.useIP = false
> imp-agent.sources.avro-imp-source1.interceptors.timestamp1.type = timestamp
>
> imp-agent.sinks.hdfs-imp-sink1.channel = imp-ch1
> imp-agent.sinks.hdfs-imp-sink1.type = hdfs
> imp-agent.sinks.hdfs-imp-sink1.hdfs.path = hdfs://mynamenode:8020/flume/impressions/yr=%Y/mo=%m/d=%d/logger=%{host}s1/
> imp-agent.sinks.hdfs-imp-sink1.hdfs.filePrefix = Impr
> imp-agent.sinks.hdfs-imp-sink1.hdfs.batchSize = 10
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollInterval = 3600
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollCount = 0
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollSize = 66584576
>
> imp-agent.channels = imp-ch1
> imp-agent.sources = avro-imp-source1
> imp-agent.sinks = hdfs-imp-sink1
>
> I bring this up as I want to know that after the 'batchSize' number of
> messages are sent to flume that they have been put into HDFS rather than
> waiting for the log roll time to do all of the writing.  My strong
> preference if possible is to make sure that data is being written to '.tmp'
> file throughout the hour and then rolled after the 'rollInterval' amount of
> time has passed.
>