Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> What does HDFSSink batch size actually effect?

Copy link to this message
Re: What does HDFSSink batch size actually effect?
HDFS batch size determines the number of events to take from the channel
and send in one go.

These will be split up into multiple files if bucketted, which is worth
consideration(how many events will get written to each file? If it's
only a handful, a higher batch size or less files may be desirable)

The size from hdfs -ls will display as 0 but if you actually download
the file it should contain everything. Each batch invokes a sync()
operation on every bucketwriter. I'm not entirely sure how not having
append activated might affect this.

On 05/15/2013 03:26 AM, Gary Malouf wrote:
> I've previously posted something similar to this on StackOverflow:
> http://stackoverflow.com/questions/16548358/how-come-flume-ng-hdfs-sink-does-not-write-to-file-when-the-number-of-events-equ
> My understanding of batch size from looking at the code in flume-ng
> 1.3.x is that batch size determines at what point data is written to
> hdfs.  With my configuration below, I am not seeing any data written
> to file until the rollInterval has passed.
> |imp-agent.channels.imp-ch1.type=  memory
> imp-agent.channels.imp-ch1.capacity=  40000
> imp-agent.channels.imp-ch1.transactionCapacity=  1000
> imp-agent.sources.avro-imp-source1.channels=  imp-ch1
> imp-agent.sources.avro-imp-source1.type=  avro
> imp-agent.sources.avro-imp-source1.bind=
> imp-agent.sources.avro-imp-source1.port=  41414
> imp-agent.sources.avro-imp-source1.interceptors=  host1 timestamp1
> imp-agent.sources.avro-imp-source1.interceptors.host1.type=  host
> imp-agent.sources.avro-imp-source1.interceptors.host1.useIP=  false
> imp-agent.sources.avro-imp-source1.interceptors.timestamp1.type=  timestamp
> imp-agent.sinks.hdfs-imp-sink1.channel=  imp-ch1
> imp-agent.sinks.hdfs-imp-sink1.type=  hdfs
> imp-agent.sinks.hdfs-imp-sink1.hdfs.path=  hdfs://mynamenode:8020/flume/impressions/yr=%Y/mo=%m/d=%d/logger=%{host}s1/
> imp-agent.sinks.hdfs-imp-sink1.hdfs.filePrefix=  Impr
> imp-agent.sinks.hdfs-imp-sink1.hdfs.batchSize=  10
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollInterval=  3600
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollCount=  0
> imp-agent.sinks.hdfs-imp-sink1.hdfs.rollSize=  66584576
> imp-agent.channels=  imp-ch1
> imp-agent.sources=  avro-imp-source1
> imp-agent.sinks=  hdfs-imp-sink1|
> I bring this up as I want to know that after the 'batchSize' number of
> messages are sent to flume that they have been put into HDFS rather
> than waiting for the log roll time to do all of the writing.  My
> strong preference if possible is to make sure that data is being
> written to '.tmp' file throughout the hour and then rolled after the
> 'rollInterval' amount of time has passed.