Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - Best way to increase throughput of Exec->Memory->Avro agent.


Copy link to this message
-
Re: Best way to increase throughput of Exec->Memory->Avro agent.
Roshan Naik 2013-03-12, 22:37
There would be less contention if you could reduce the sharing... so
may be divide them them into 31 per channel. 31 still looks like a
huge number. Best if you can you consolidate 31 down to just 1 or 2 ?

Keep in mind there is one thread per sink and one per source (unless
you are spawning more inside your source / sink). A rule of thumb
(actually more like guidance) is 2 to 4 threads per core. So keep the
an eye out for not overloading your box with too many threads.

On Tue, Mar 12, 2013 at 2:55 PM, Chris Neal <[EMAIL PROTECTED]> wrote:
> So, in a 4 channel setup, would I bind each of the 124 sources to all of the
> 4 channels, or divide them up and put 31 sources on each individual channel?
> :)
>
>
> On Tue, Mar 12, 2013 at 4:40 PM, Chris Neal <[EMAIL PROTECTED]> wrote:
>>
>> Beautiful.  Will try 4 channels in one Agent first.
>> Thanks!
>>
>>
>> On Tue, Mar 12, 2013 at 4:35 PM, Roshan Naik <[EMAIL PROTECTED]>
>> wrote:
>>>
>>> Even 16 on a single channel might be on the higher side IMHO.
>>>
>>> Try instead splitting into four channels with 4 sinks each... or even
>>> four agents with one channel and 4 sinks each ..... it will reduce
>>> contention. be careful to ensure your capacity of each channel is not
>>> too high since you now have many channels.
>>> -roshan
>>>
>>> On Tue, Mar 12, 2013 at 2:24 PM, Chris Neal <[EMAIL PROTECTED]> wrote:
>>> > Thanks for the reply.  You're definitely on to something with the
>>> > ever-increasing number of sinks.  :)
>>> >
>>> > I scaled it back to 16 AvroSinks, and used a
>>> > MemoryChannel.transactionCapacity of 1000, and AvroSink.batch-size of
>>> > 1000.
>>> > My ExecSource.batchSize is 100 (I chose this smaller number because
>>> > there
>>> > are so many of them (124), I didn't want 10s of thousands of events
>>> > getting
>>> > dropped on the MemoryChannel at once, rather just 1000s).  With those
>>> > settings, things are keeping the MemoryChannel drained.  Finally
>>> > getting
>>> > somewhere! :)
>>> >
>>> > Much appreciate the prompt response.  If anything else comes to mind,
>>> > please
>>> > do let me know.
>>> >
>>> > Thanks again.
>>> > Chris
>>> >
>>> >
>>> >
>>> > On Tue, Mar 12, 2013 at 4:12 PM, Roshan Naik <[EMAIL PROTECTED]>
>>> > wrote:
>>> >>
>>> >> i meant 640,000 not 64,000
>>> >>
>>> >> On Tue, Mar 12, 2013 at 2:10 PM, Roshan Naik <[EMAIL PROTECTED]>
>>> >> wrote:
>>> >> > beyond a certain # of sinks it wont help adding more. my suspicion
>>> >> > is
>>> >> > you may have gone way overboard.
>>> >> >
>>> >> >  if your sink-side batch size is that large and you have 64 sinks in
>>> >> > the round-robin.. it will take a lot of events (64,000) to be pumped
>>> >> > in by the source order before the first event can start trickling
>>> >> > out
>>> >> > of any sink.  Also memory consumption will be quite high.. each sink
>>> >> > will open a transaction and hold on to 10000 events. This the cause
>>> >> > for the Memory channel filling up. Until the sink side transaction
>>> >> > is
>>> >> > committed (i.e 10k events are pulled), the memory reservation on the
>>> >> > channel is not relinquished. So your memory channel size will have
>>> >> > to
>>> >> > really high to support so manch sinks each with such a big batch
>>> >> > size.
>>> >> >
>>> >> > My gut feel is that your source-side batch size is not much of an
>>> >> > issue and can be smaller. Increasing the number of sinks will only
>>> >> > help if the sink is indeed the bott
>>> >> >
>>> >> > On Tue, Mar 12, 2013 at 1:43 PM, Chris Neal <[EMAIL PROTECTED]>
>>> >> > wrote:
>>> >> >> Hi all.
>>> >> >>
>>> >> >> I've been working on this for quite some time, and need some advice
>>> >> >> from the
>>> >> >> experts.  I have a two tiered Flume architecture:
>>> >> >>
>>> >> >> App Tier (all on one server):
>>> >> >>  124 ExecSources -> MemoryChannel -> AvroSinks
>>> >> >>
>>> >> >> HDFS Tier (on two servers):
>>> >> >>   AvroSource -> FileChannel -> HDFSSinks
>>> >> >>
>>> >> >> When I run the agents, the HDFS tier is keeping up fine with the