Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Of BatchSize / Channel Capacity / Transaction Capacity


Copy link to this message
-
Re: Of BatchSize / Channel Capacity / Transaction Capacity
Published in our wiki:
https://cwiki.apache.org/confluence/display/FLUME/BatchSize,+ChannelCapacity+and+ChannelTransactionCapacity+Properties

- Alex

On Jan 11, 2013, at 6:03 PM, Jeff Lord <[EMAIL PROTECTED]> wrote:

> Bhaskar,
>
> I have created the following jira for this:
> https://issues.apache.org/jira/browse/FLUME-1829
>
> -Jeff
>
>
> On Fri, Jan 11, 2013 at 6:48 AM, Bhaskar V. Karambelkar <[EMAIL PROTECTED]
>> wrote:
>
>> Thanks Jeff,
>> Clear and detailed explanations. These deserve to be on the wiki, as these
>> parameters have direct implications on the performance of flume nodes.
>>
>> thanks
>> Bhaskar
>>
>>
>> On Tue, Jan 8, 2013 at 9:40 PM, Jeff Lord <[EMAIL PROTECTED]> wrote:
>>
>>> Hi Bashkar,
>>>
>>> 1) Batch Size
>>>  1.a) When configured by client code using the flume-core-sdk , to send
>>> events to flume avro source.
>>> The flume client sdk has an appendBatch method. This will take a list of
>>> events and send them to the source as a batch. This is the size of the
>>> number of events to be passed to the source at one time.
>>>
>>>  1.b) When set as a parameter on HDFS sink (or other sinks which support
>>> BatchSize parameter)
>>> This is the number of events written to file before it is flushed to HDFS
>>>
>>> 2)
>>>  2.a) Channel Capacity
>>> This is the maximum capacity number of events of the channel.
>>>
>>>  2.b) Channel Transaction Capacity.
>>> This is the max number of events stored in the channel per transaction.
>>>
>>> How will setting these parameters to different values, affect throughput,
>>> latency in event flow?
>>>
>>> In general you will see better throughput by using memory channel as
>>> opposed to using file channel at the loss of durability.
>>>
>>> The channel capacity is going to need to be sized such that it is large
>>> enough to hold as many events as will be added to it by upstream agents.
>>> Ideal flow would see the sink draining events from the channel faster than
>>> it is having events added by its source.
>>>
>>> The channel transaction capacity will need to be smaller than the channel
>>> capacity.
>>> e.g. If your Channel capacity is set to 10000 than Channel Transaction
>>> Capacity should be set to something like 100.
>>>
>>> Specifically if we have clients with varying frequency of event
>>> generation, i.e. some clients generating thousands of events/sec, while
>>> others at a much slower rate, what effect will different values of these
>>> params have on these clients ?
>>>
>>> Transaction Capacity is going to be what throttles or limits how many
>>> events the source can put into the channel. This going to vary depending on
>>> how many tiers of agents/collectors you have setup.
>>> In general though this should probably be equal to whatever you have the
>>> batch size set to in your client.
>>>
>>> With regards to the hdfs batch size, the larger your batch size the
>>> better performance will be. However, keep in mind that if a transaction
>>> fails the entire transaction will be replayed which could have the
>>> implication of duplicate events downstream.
>>>
>>> -Jeff
>>>
>>>
>>>
>>>
>>> On Tue, Jan 8, 2013 at 10:46 AM, Bhaskar V. Karambelkar <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> Can some one explain the importance of the following
>>>> 1) Batch Size
>>>>  1.a) When configured by client code using the flume-core-sdk , to send
>>>> events to flume avro source.
>>>>  1.b) When set as a parameter on HDFS sink (or other sinks which
>>>> support BatchSize parameter)
>>>> 2)
>>>>  2.a) Channel Capacity
>>>>  2.b) Channel Transaction Capacity.
>>>>
>>>>
>>>> Under which conditions should these params be set to high values, and
>>>> under which conditions should they be set to low values.
>>>>
>>>>
>>>> How will setting these parameters to different values, affect
>>>> throughput, latency in event flow.
>>>> Specifically if we have clients with varying frequency of event
>>>> generation, i.e. some clients generating thousands of events/sec, while

Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB