Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> What does HDFSSink batch size actually effect?


Copy link to this message
-
Re: What does HDFSSink batch size actually effect?
To append to my previous post, I have also looked into activating the hdfs
append setting but the descriptions on it are limited and it is tricky to
understand what effects it will have on my logging.
On Tue, May 14, 2013 at 2:26 PM, Gary Malouf <[EMAIL PROTECTED]> wrote:

> I've previously posted something similar to this on StackOverflow:
> http://stackoverflow.com/questions/16548358/how-come-flume-ng-hdfs-sink-does-not-write-to-file-when-the-number-of-events-equ
>
> My understanding of batch size from looking at the code in flume-ng 1.3.x
> is that batch size determines at what point data is written to hdfs.  With
> my configuration below, I am not seeing any data written to file until the
> rollInterval has passed.
>
> imp-agent.channels.imp-ch1.type = memory
> imp-agent.channels.imp-ch1.capacity = 40000
> imp-agent.channels.imp-ch1.transactionCapacity = 1000
>
> imp-agent.sources.avro-imp-source1.channels = imp-ch1
> imp-agent.sources.avro-imp-source1.type = avro
> imp-agent.sources.avro-imp-source1.bind = 0.0.0.0
> imp-agent.sources.avro-imp-source1.port = 41414
>
> imp-agent.sources.avro-imp-source1.interceptors = host1 timestamp1
> imp-agent.sources.avro-imp-source1.interceptors.host1.type = host
> imp-agent.sources.avro-imp-source1.interceptors.host1.useIP = false
> imp-agent.sources.avro-imp-source1.interceptors.timestamp1.type = timestamp
>
> imp-agent.sinks.hdfs-imp-sink1.channel = imp-ch1
> imp-agent.sinks.hdfs-imp-sink1.type = hdfs
> imp-agent.sinks.hdfs-imp-sink1.hdfs.path = hdfs://mynamenode:8020/flume/impressions/yr=%Y/mo=%m/d=%d/logger=%{host}s1/
> imp-agent.sinks.hdfs-imp-sink1.hdfs.filePrefix = Impr
> imp-agent.sinks.hdfs-imp-sink1.hdfs.batchSize = 10
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollInterval = 3600
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollCount = 0
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollSize = 66584576
>
> imp-agent.channels = imp-ch1
> imp-agent.sources = avro-imp-source1
> imp-agent.sinks = hdfs-imp-sink1
>
> I bring this up as I want to know that after the 'batchSize' number of
> messages are sent to flume that they have been put into HDFS rather than
> waiting for the log roll time to do all of the writing.  My strong
> preference if possible is to make sure that data is being written to '.tmp'
> file throughout the hour and then rolled after the 'rollInterval' amount of
> time has passed.
>