Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - Best way to increase throughput of Exec->Memory->Avro agent.


Copy link to this message
-
Re: Best way to increase throughput of Exec->Memory->Avro agent.
Chris Neal 2013-03-12, 21:24
Thanks for the reply.  You're definitely on to something with the
ever-increasing number of sinks.  :)

I scaled it back to 16 AvroSinks, and used a
MemoryChannel.transactionCapacity of 1000, and AvroSink.batch-size of 1000.
 My ExecSource.batchSize is 100 (I chose this smaller number because there
are so many of them (124), I didn't want 10s of thousands of events getting
dropped on the MemoryChannel at once, rather just 1000s).  With those
settings, things are keeping the MemoryChannel drained.  Finally getting
somewhere! :)

Much appreciate the prompt response.  If anything else comes to mind,
please do let me know.

Thanks again.
Chris

On Tue, Mar 12, 2013 at 4:12 PM, Roshan Naik <[EMAIL PROTECTED]> wrote:

> i meant 640,000 not 64,000
>
> On Tue, Mar 12, 2013 at 2:10 PM, Roshan Naik <[EMAIL PROTECTED]>
> wrote:
> > beyond a certain # of sinks it wont help adding more. my suspicion is
> > you may have gone way overboard.
> >
> >  if your sink-side batch size is that large and you have 64 sinks in
> > the round-robin.. it will take a lot of events (64,000) to be pumped
> > in by the source order before the first event can start trickling out
> > of any sink.  Also memory consumption will be quite high.. each sink
> > will open a transaction and hold on to 10000 events. This the cause
> > for the Memory channel filling up. Until the sink side transaction is
> > committed (i.e 10k events are pulled), the memory reservation on the
> > channel is not relinquished. So your memory channel size will have to
> > really high to support so manch sinks each with such a big batch size.
> >
> > My gut feel is that your source-side batch size is not much of an
> > issue and can be smaller. Increasing the number of sinks will only
> > help if the sink is indeed the bott
> >
> > On Tue, Mar 12, 2013 at 1:43 PM, Chris Neal <[EMAIL PROTECTED]> wrote:
> >> Hi all.
> >>
> >> I've been working on this for quite some time, and need some advice
> from the
> >> experts.  I have a two tiered Flume architecture:
> >>
> >> App Tier (all on one server):
> >>  124 ExecSources -> MemoryChannel -> AvroSinks
> >>
> >> HDFS Tier (on two servers):
> >>   AvroSource -> FileChannel -> HDFSSinks
> >>
> >> When I run the agents, the HDFS tier is keeping up fine with the App
> Tier.
> >> queue sizes stay between 0-10000 (I have a batch size of 10000).  All is
> >> good.
> >>
> >> On the App Tier, when I view the JMX data through jconsole, I watch the
> size
> >> of the MemoryChannel grow steadily until it reaches the max, then it
> starts
> >> throwing exceptions about not being able to put the batch on the
> channel as
> >> expected.
> >>
> >> There seems to be two basic ways to increase the throughput of the App
> Tier:
> >> 1.  Increase the MemoryChannel's transactionCapacity and the
> corresponding
> >> AvroSink's batch-size.  Both are set to 10000 for me.
> >> 2.  Increase the number of AvroSinks to drain the MemoryChannel.  I'm
> up to
> >> 64 Sinks now which round-robin between the two Flume Agents on the HDFS
> >> tier.
> >>
> >> Both of those values seem quite high to me (batch size and number of
> sinks).
> >>
> >> Am I missing something as far as tuning?
> >> Which would allow for greater increase to throughput, more Sinks or
> larger
> >> batch size?
> >>
> >> I'm stumped here.  I still think I can get this to work. :)
> >>
> >> Any suggestions are most welcome.
> >> Thanks for your time.
> >> Chris
> >>
>