Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - flume tail source problem and performance


Copy link to this message
-
Re: flume tail source problem and performance
周梦想 2013-02-04, 08:07
Hi JS,

Thank you for your reply. So there is big shortness of collect log using
flume. can I write my own agent to send logs  via  thrift protocol directly
to collector server?

Best Regards,
Andy Zhou
2013/2/4 Jeong-shik Jang <[EMAIL PROTECTED]>

>  Hi Andy,
>
> 1. "startFromEnd=true" in your source configuration means data missing can
> happen at restart in tail side because flume will ignore any data event
> generated during restart and start at the end all the time.
> 2. With agentSink, data duplication can happen due to ack delay from
> master or at agent restart.
>
> I think it is why Flume-NG doesn't support tail any more but does let user
> handle using script or program; tailing is a tricky job.
>
> My suggestion is to use agentBEChain in agent tier, and DFO in collector
> tier; you can still lose some data during failover at failure.
> To minimize loss and duplication, implementing checkpoint function in tail
> also can help.
>
> Having monitoring system to detecting failure is very important as well,
> so that you can notice failure and do recovering reaction quickly.
>
> -JS
>
>
> On 2/4/13 4:27 PM, 周梦想 wrote:
>
> Hi  JS,
> We can't accept agentBESink. Because this logs are important for data
> analysis, we can't make any errors of the data. losing data, duplication
> are all not acceptable.
> one agent's configure is :   tail("H:/game.log", startFromEnd=true) agentSink("hadoop48",
> 35853)
> every time this windows agent restart, it will resend all the data to
> collector server.
> if some reason we restart the agent node, we can't get the mark of log
> where the agent have sent.
>
>
> 2013/1/29 Jeong-shik Jang <[EMAIL PROTECTED]>
>
>> Hi Andy,
>>
>> As you set startFromEnd option true, resend might be caused by DFO
>> mechanism (agentDFOSink); when you restart flume node in DFO mode, all
>> events in different stages(logged, writing, sending and so on) rolls back
>> to logged stage, which means resending and duplication.
>>
>> And, for better performance, you may want to use agentBESink instead of
>> agentDFOSink.
>> I recommend to use agentBEChain for failover in case of failure in
>> collector tier if you have multiple collectors.
>>
>> -JS
>>
>>
>> On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote:
>>
>>> Hi,
>>>
>>> you could use tail -F, but this depends on the external source. Flume
>>> hasn't control about. You can write your own script and include this.
>>>
>>> What's the content of:
>>> /tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean?
>>>
>>> - Alex
>>>
>>> On Jan 29, 2013, at 8:24 AM, 周梦想 <[EMAIL PROTECTED]> wrote:
>>>
>>>  hello,
>>>> 1. I want to tail a log source and write it to hdfs. below is configure:
>>>> config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
>>>> agentDFOSink("hadoop48",35853) ;]
>>>> config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
>>>> agentDFOSink("hadoop48",35853) ;]
>>>> config [co1, collectorSource( 35853 ),  [collectorSink(
>>>>
>>>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>>>
>>>>
>>>> I found if I restart the agent node, it will resend the content of
>>>> game.log
>>>> to collector. There are some solutions to send logs from where I haven't
>>>> sent before? Or I have to make a mark myself or remove the logs manually
>>>> when restart the agent node?
>>>>
>>>> 2. I tested performance of flume, and found it's a bit slow.
>>>> if I using configure as above, there are only 50MB/minute.
>>>> I changed the configure to below:
>>>> ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip
>>>> agentDFOSink("hadoop48",35853);
>>>>
>>>> config [co1, collectorSource( 35853 ), [collectorSink(
>>>>
>>>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>>>
>>>> I sent 300MB log, it will spent about 3 minutes, so it's about