Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - What does HDFSSink batch size actually effect?


Copy link to this message
-
What does HDFSSink batch size actually effect?
Gary Malouf 2013-05-14, 18:26
I've previously posted something similar to this on StackOverflow:
http://stackoverflow.com/questions/16548358/how-come-flume-ng-hdfs-sink-does-not-write-to-file-when-the-number-of-events-equ

My understanding of batch size from looking at the code in flume-ng 1.3.x
is that batch size determines at what point data is written to hdfs.  With
my configuration below, I am not seeing any data written to file until the
rollInterval has passed.

imp-agent.channels.imp-ch1.type = memory
imp-agent.channels.imp-ch1.capacity = 40000
imp-agent.channels.imp-ch1.transactionCapacity = 1000

imp-agent.sources.avro-imp-source1.channels = imp-ch1
imp-agent.sources.avro-imp-source1.type = avro
imp-agent.sources.avro-imp-source1.bind = 0.0.0.0
imp-agent.sources.avro-imp-source1.port = 41414

imp-agent.sources.avro-imp-source1.interceptors = host1 timestamp1
imp-agent.sources.avro-imp-source1.interceptors.host1.type = host
imp-agent.sources.avro-imp-source1.interceptors.host1.useIP = false
imp-agent.sources.avro-imp-source1.interceptors.timestamp1.type = timestamp

imp-agent.sinks.hdfs-imp-sink1.channel = imp-ch1
imp-agent.sinks.hdfs-imp-sink1.type = hdfs
imp-agent.sinks.hdfs-imp-sink1.hdfs.path hdfs://mynamenode:8020/flume/impressions/yr=%Y/mo=%m/d=%d/logger=%{host}s1/
imp-agent.sinks.hdfs-imp-sink1.hdfs.filePrefix = Impr
imp-agent.sinks.hdfs-imp-sink1.hdfs.batchSize = 10
imp-agent.sinks.hdfs-imp-sink1.hdfs.rollInterval = 3600
imp-agent.sinks.hdfs-imp-sink1.hdfs.rollCount = 0
imp-agent.sinks.hdfs-imp-sink1.hdfs.rollSize = 66584576

imp-agent.channels = imp-ch1
imp-agent.sources = avro-imp-source1
imp-agent.sinks = hdfs-imp-sink1

I bring this up as I want to know that after the 'batchSize' number of
messages are sent to flume that they have been put into HDFS rather than
waiting for the log roll time to do all of the writing.  My strong
preference if possible is to make sure that data is being written to '.tmp'
file throughout the hour and then rolled after the 'rollInterval' amount of
time has passed.