Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - Configuring flume for better throughput


Copy link to this message
-
Re: Configuring flume for better throughput
Pankaj Gupta 2013-07-26, 21:12
Here is the flume config of the collector machine. The File channel is
drained by 4 flume sinks that send messages to a separate hdfs-writer
machine.
agent1.channels.ch1.type = FILE
agent1.channels.ch1.checkpointDir = /flume1/checkpoint
agent1.channels.ch1.dataDirs = /flume1/data
agent1.channels.ch1.maxFileSize = 375809638400
agent1.channels.ch1.capacity = 75000000
agent1.channels.ch1.transactionCapacity = 4000

agent1.sources.avroSource1.channels = ch1
agent1.sources.avroSource1.type = avro
agent1.sources.avroSource1.bind = 0.0.0.0
agent1.sources.avroSource1.port = 4545
agent1.sources.avroSource1.threads = 16

agent1.sinks.avroSink1-1.type = avro
agent1.sinks.avroSink1-1.channel = ch1
agent1.sinks.avroSink1-1.hostname = hdfs-writer-machine-a.mydomain.com
agent1.sinks.avroSink1-1.port = 4545
agent1.sinks.avroSink1-1.connect-timeout = 300000
agent1.sinks.avroSink1-1.batch-size = 4000

agent1.sinks.avroSink1-2.type = avro
agent1.sinks.avroSink1-2.channel = ch1
agent1.sinks.avroSink1-2.hostname = hdfs-writer-machine-b.mydomain.com
agent1.sinks.avroSink1-2.port = 4545
agent1.sinks.avroSink1-2.connect-timeout = 300000
agent1.sinks.avroSink1-2.batch-size = 4000

agent1.sinks.avroSink1-3.type = avro
agent1.sinks.avroSink1-3.channel = ch1
agent1.sinks.avroSink1-3.hostname = hdfs-writer-machine-c.mydomain.com
agent1.sinks.avroSink1-3.port = 4545
agent1.sinks.avroSink1-3.connect-timeout = 300000
agent1.sinks.avroSink1-3.batch-size = 4000

agent1.sinks.avroSink1-4.type = avro
agent1.sinks.avroSink1-4.channel = ch1
agent1.sinks.avroSink1-4.hostname = hdfs-writer-machine-d.mydomain.com
agent1.sinks.avroSink1-4.port = 4545
agent1.sinks.avroSink1-4.connect-timeout = 300000
agent1.sinks.avroSink1-4.batch-size = 4000
#Add the sink groups; load-balance between each group of sinks which round
robin between different hops
agent1.sinkgroups = group1
agent1.sinkgroups.group1.sinks = avroSink1-1 avroSink1-2 avroSink1-3
avroSink1-4
agent1.sinkgroups.group1.processor.type = load_balance
agent1.sinkgroups.group1.processor.selector = ROUND_ROBIN
agent1.sinkgroups.group1.processor.backoff = true

On Fri, Jul 26, 2013 at 1:38 PM, Pankaj Gupta <[EMAIL PROTECTED]> wrote:

> Hi Roshan,
>
> Thanks for the reply. Sorry I worded the first question wrong and confused
> sources with sinks. What I meant to ask was:
> 1. Are the batches from flume Avro Sink sent to the Avro Source on the
> next machine in a pipelined fasion or is the next batch only sent once an
> ack for previous batch is received?
>
> Overall it sounds like adding more sinks would provide more concurrency.
> I'm going to try that.
>
> About the large batch size, in our use case it won't be a big issue as
> long as we can set a timeout after which whatever events are accumulated
> are sent without requiring the batch to be full. Does such a setting exist?
>
> Thanks,
> Pankaj
>
>
>
>
> On Fri, Jul 26, 2013 at 10:59 AM, Roshan Naik <[EMAIL PROTECTED]>wrote:
>
>> could you provide a sample of the config you are using ?
>>
>>
>>    1. Are the batches from flume source sent to the sink in a pipelined
>>    fasion or is the next batch only sent once an ack for previous batch is
>>    received?
>>
>> Source does not send to sink directly. Source dumps a batch of events
>> into the channel... and the sink picks it form the channel in batches and
>> writes them to destination. Sink fetches a batch from channel and writes to
>> destination and then fetches the next batch from channel.. and the cycle
>> continues.
>>
>>
>>    1. If the batch send is not pipelined then would increasing the
>>    number of sinks draining from the channel help.
>>    The idea behind this is to basically achieve pipelining by having
>>    multiple outstanding requests and thus use network better.
>>
>> Increasing the number of sinks will increase concurrency.
>>
>>
>>    1. If batch size is very large, e.g. 1 million, would the batch only
>>    be sent once that many events have accumulated or is there a time limit

*P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | [EMAIL PROTECTED]

Pankaj Gupta | Software Engineer

*BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com
United States | Canada | United Kingdom | Germany
We're hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>
!