Jagadish Bihani 2012-10-22, 11:48
Denny Ye 2012-10-22, 13:38
Jagadish Bihani 2012-10-23, 06:31
Juhani Connolly 2012-10-23, 07:08
Jagadish Bihani 2012-10-22, 13:18
Brock Noland 2012-10-22, 13:59
Brock Noland 2012-10-22, 14:29
I am using flume 1.2.0.
About the batching : as per user guide "exec source" does have batch
option in 1.2.0 (param name:
batchSize and default value:20) and I
have tried it. Apparently it works fine. And file channel has parameter
to 1000 by default. Is that the batch size of file channel?
Anyway even with increased batching I couldn't cross 110-150 KB/sec with
Could you please help me understanding questions I asked in the original
mail of this thread about
fsync lies. Because with disk which "apparently does fsync lie" I get 3
MB/sec in 1 flow.
I don't know whether that actually does "fsync lie" but there is
remarkable difference in fsync
performance on 2 machines which do have almost similar hardware.
On 10/22/2012 07:59 PM, Brock Noland wrote:
> In this cae, it's best to think about FileChannel as if it were a
> database. Let's pretend we are going to insert 1 million rows. If we
> committed on each row, would performance be "good"? No, everyone
> knows that when you are inserting rows in databases, you want to batch
> 100-1000 rows into a single commit, if you want "good" performance.
> (Quoting good because it's subjective based on the scenario, but in
> this case we mean lots of MB/second).
> Part of the reason behind this logic is that when a database does a
> commit, it does an fsync operation to ensure that all data is written
> to disk and that you will not lose data due to a subsequent power loss.
> FileChannel behaves *exactly* the same. If your "batch" is only a
> single event, file channel will:
> write single event
> write single event
> As such, if you want "good" performance with FileChannel, you must
> increase your batch size, just like a database. If you have a
> batchSize of say 100, then FileChannel will:
> write single event 0
> write single event 1
> write single event 99
> Which will result in much "better" performance. It's worth noting that
> ExecSource in Flume 1.2, does not have a batchSize and as such each
> event is written and then committed. ExecSource in flume 1.3, which we
> will release soon, does have a configurable batchSize. If you want to
> try that out you can build it from the flume-1.3.0 branch.
> On Mon, Oct 22, 2012 at 8:59 AM, Brock Noland <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
> Which version? 1.2 or trunk?
> On Monday, October 22, 2012 at 8:18 AM, Jagadish Bihani wrote:
>> This is the simplistic configuration with which I am getting
>> lower performance.
>> Even with 2-tier architecture (cat source - avro sinks - avro
>> source- HDFS sink)
>> I get the similar performance with file channel.
>> ========>> adServerAgent.sources = avro-collection-source
>> adServerAgent.channels = fileChannel
>> adServerAgent.sinks = hdfsSink fileSink
>> # For each one of the sources, the type is defined
>> adServerAgent.sources.avro-collection-source.command= cat
>> # The channel can be defined as follows.
>> adServerAgent.sources.avro-collection-source.channels = fileChannel
>> #Define file sink
>> adServerAgent.sinks.fileSink.type = file_roll
>> adServerAgent.sinks.fileSink.sink.directory >> /home/hadoop/flume_sink*
>> adServerAgent.sinks.fileSink.channel = fileChannel
>> And it is run with :
>> JAVA_OPTS = -Xms500m -Xmx700m -Dcom.sun.management.jmxremote
Juhani Connolly 2012-10-23, 07:26