Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> Configuring flume for better throughput


+
Pankaj Gupta 2013-07-26, 01:31
+
Derek Chan 2013-07-26, 07:34
+
Roshan Naik 2013-07-26, 17:59
+
Pankaj Gupta 2013-07-26, 20:38
Copy link to this message
-
Re: Configuring flume for better throughput
Here is the flume config of the collector machine. The File channel is
drained by 4 flume sinks that send messages to a separate hdfs-writer
machine.
agent1.channels.ch1.type = FILE
agent1.channels.ch1.checkpointDir = /flume1/checkpoint
agent1.channels.ch1.dataDirs = /flume1/data
agent1.channels.ch1.maxFileSize = 375809638400
agent1.channels.ch1.capacity = 75000000
agent1.channels.ch1.transactionCapacity = 4000

agent1.sources.avroSource1.channels = ch1
agent1.sources.avroSource1.type = avro
agent1.sources.avroSource1.bind = 0.0.0.0
agent1.sources.avroSource1.port = 4545
agent1.sources.avroSource1.threads = 16

agent1.sinks.avroSink1-1.type = avro
agent1.sinks.avroSink1-1.channel = ch1
agent1.sinks.avroSink1-1.hostname = hdfs-writer-machine-a.mydomain.com
agent1.sinks.avroSink1-1.port = 4545
agent1.sinks.avroSink1-1.connect-timeout = 300000
agent1.sinks.avroSink1-1.batch-size = 4000

agent1.sinks.avroSink1-2.type = avro
agent1.sinks.avroSink1-2.channel = ch1
agent1.sinks.avroSink1-2.hostname = hdfs-writer-machine-b.mydomain.com
agent1.sinks.avroSink1-2.port = 4545
agent1.sinks.avroSink1-2.connect-timeout = 300000
agent1.sinks.avroSink1-2.batch-size = 4000

agent1.sinks.avroSink1-3.type = avro
agent1.sinks.avroSink1-3.channel = ch1
agent1.sinks.avroSink1-3.hostname = hdfs-writer-machine-c.mydomain.com
agent1.sinks.avroSink1-3.port = 4545
agent1.sinks.avroSink1-3.connect-timeout = 300000
agent1.sinks.avroSink1-3.batch-size = 4000

agent1.sinks.avroSink1-4.type = avro
agent1.sinks.avroSink1-4.channel = ch1
agent1.sinks.avroSink1-4.hostname = hdfs-writer-machine-d.mydomain.com
agent1.sinks.avroSink1-4.port = 4545
agent1.sinks.avroSink1-4.connect-timeout = 300000
agent1.sinks.avroSink1-4.batch-size = 4000
#Add the sink groups; load-balance between each group of sinks which round
robin between different hops
agent1.sinkgroups = group1
agent1.sinkgroups.group1.sinks = avroSink1-1 avroSink1-2 avroSink1-3
avroSink1-4
agent1.sinkgroups.group1.processor.type = load_balance
agent1.sinkgroups.group1.processor.selector = ROUND_ROBIN
agent1.sinkgroups.group1.processor.backoff = true

On Fri, Jul 26, 2013 at 1:38 PM, Pankaj Gupta <[EMAIL PROTECTED]> wrote:

> Hi Roshan,
>
> Thanks for the reply. Sorry I worded the first question wrong and confused
> sources with sinks. What I meant to ask was:
> 1. Are the batches from flume Avro Sink sent to the Avro Source on the
> next machine in a pipelined fasion or is the next batch only sent once an
> ack for previous batch is received?
>
> Overall it sounds like adding more sinks would provide more concurrency.
> I'm going to try that.
>
> About the large batch size, in our use case it won't be a big issue as
> long as we can set a timeout after which whatever events are accumulated
> are sent without requiring the batch to be full. Does such a setting exist?
>
> Thanks,
> Pankaj
>
>
>
>
> On Fri, Jul 26, 2013 at 10:59 AM, Roshan Naik <[EMAIL PROTECTED]>wrote:
>
>> could you provide a sample of the config you are using ?
>>
>>
>>    1. Are the batches from flume source sent to the sink in a pipelined
>>    fasion or is the next batch only sent once an ack for previous batch is
>>    received?
>>
>> Source does not send to sink directly. Source dumps a batch of events
>> into the channel... and the sink picks it form the channel in batches and
>> writes them to destination. Sink fetches a batch from channel and writes to
>> destination and then fetches the next batch from channel.. and the cycle
>> continues.
>>
>>
>>    1. If the batch send is not pipelined then would increasing the
>>    number of sinks draining from the channel help.
>>    The idea behind this is to basically achieve pipelining by having
>>    multiple outstanding requests and thus use network better.
>>
>> Increasing the number of sinks will increase concurrency.
>>
>>
>>    1. If batch size is very large, e.g. 1 million, would the batch only
>>    be sent once that many events have accumulated or is there a time limit

*P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | [EMAIL PROTECTED]

Pankaj Gupta | Software Engineer

*BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com
United States | Canada | United Kingdom | Germany
We're hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>
!
+
Pankaj Gupta 2013-08-01, 02:22
+
Pankaj Gupta 2013-08-01, 02:24
+
Hari Shreedharan 2013-08-01, 03:27
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB