Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - Avro sink to source is too slow


Copy link to this message
-
Re: Avro sink to source is too slow
Anat Rozenzon 2013-10-03, 19:54
Yes all 3 channels were writing to the same disk, however we are using
Amazon servers, not sure if their 'separate' disks are really separated.
On Thu, Oct 3, 2013 at 6:28 PM, Hari Shreedharan
<[EMAIL PROTECTED]>wrote:

> Yes. Using multiple sinks with no sink groups would give each sin nuts own
> thread. Each time you add a channel to a source you will take some
> performance hit, because the channels are written to one after the
> other.also were these channels sharing disks? Was the checkpoint and data
> files for each of them on separate disks?
>
>
> On Thursday, October 3, 2013, Anat Rozenzon wrote:
>
>> Just a quick update, I found two issues that slowed down flume:
>> 1. Using 3 file replicating channels on the avro source slowed down the
>> acceptance of flume events, it takes up to 5-10  times more than writing to
>> one channel. So I'm now trying to change the collector's configuration to 1
>> file channel and then a spooldir source that will read out of the
>> Collector's file system and into a memory channel for replication.
>> 2. More disturbing is that I see many disconnections in Avro Sink-Source
>> pair while the Source flume (e.g. collector) is doing Full GCs, also the
>> Full GCs were quite long (~ 15 seconds). Changing Java to a non-hanging GC
>> (i.e. gc1) solved this issue as well.
>>
>> BTW Regarding Mike's question above:
>> What is the correct way to put multiple threads that will drain a channel
>> quickly?
>> I thought the correct way is simply to put multiple sinks that are using
>> the same channel, without any sink groups, is that correct?
>>
>> Thanks
>> Anat
>>
>>
>> On Tue, Oct 1, 2013 at 11:10 PM, Roshan Naik <[EMAIL PROTECTED]>wrote:
>>
>>> My thoughts...You have 4 sinks draining the same channel and each has a
>>> batch size 1000. Since they will contend on the same channel & *assuming*
>>> events are evenly distributed among the sinks, there is potential for some
>>> starvation happening in the sinks as their batch sizes may not be reached
>>> until about 4 batches  are inserted by the source. I dont know if there is
>>> a good thumb rule here.
>>>
>>> try these:
>>> -  See if sink batch size of 250 helps.
>>> -  Using a single avro sink instead of 4 with batch size of 1k.
>>> -  Replacing the  avro sink with the null sink on the first agent and
>>> take a measurement. it would be good to ensure spool source is not the
>>> bottle neck.
>>>
>>>
>>>
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>>> to which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.
>>>
>>
>>