Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> flume tail source problem and performance


Copy link to this message
-
Re: flume tail source problem and performance
Hi  JS,
We can't accept agentBESink. Because this logs are important for data
analysis, we can't make any errors of the data. losing data, duplication
are all not acceptable.
one agent's configure is :  tail("H:/game.log",
startFromEnd=true)agentSink("hadoop48",
35853)
every time this windows agent restart, it will resend all the data to
collector server.
if some reason we restart the agent node, we can't get the mark of log
where the agent have sent.
2013/1/29 Jeong-shik Jang <[EMAIL PROTECTED]>

> Hi Andy,
>
> As you set startFromEnd option true, resend might be caused by DFO
> mechanism (agentDFOSink); when you restart flume node in DFO mode, all
> events in different stages(logged, writing, sending and so on) rolls back
> to logged stage, which means resending and duplication.
>
> And, for better performance, you may want to use agentBESink instead of
> agentDFOSink.
> I recommend to use agentBEChain for failover in case of failure in
> collector tier if you have multiple collectors.
>
> -JS
>
>
> On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote:
>
>> Hi,
>>
>> you could use tail -F, but this depends on the external source. Flume
>> hasn't control about. You can write your own script and include this.
>>
>> What's the content of:
>> /tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean?
>>
>> - Alex
>>
>> On Jan 29, 2013, at 8:24 AM, 周梦想 <[EMAIL PROTECTED]> wrote:
>>
>>  hello,
>>> 1. I want to tail a log source and write it to hdfs. below is configure:
>>> config [ag1, tail("/home/zhouhh/game.log",**startFromEnd=true),
>>> agentDFOSink("hadoop48",35853) ;]
>>> config [ag2, tail("/home/zhouhh/game.log",**startFromEnd=true),
>>> agentDFOSink("hadoop48",35853) ;]
>>> config [co1, collectorSource( 35853 ),  [collectorSink(
>>> "hdfs://hadoop48:54310/user/**flume/%y%m/%d","%{host}-",**
>>> 5000,raw),collectorSink(
>>> "hdfs://hadoop48:54310/user/**flume/%y%m","%{host}-",10000,**raw)]]
>>>
>>>
>>> I found if I restart the agent node, it will resend the content of
>>> game.log
>>> to collector. There are some solutions to send logs from where I haven't
>>> sent before? Or I have to make a mark myself or remove the logs manually
>>> when restart the agent node?
>>>
>>> 2. I tested performance of flume, and found it's a bit slow.
>>> if I using configure as above, there are only 50MB/minute.
>>> I changed the configure to below:
>>> ag1:tail("/home/zhouhh/game.**log",startFromEnd=true)|batch(**1000) gzip
>>> agentDFOSink("hadoop48",35853)**;
>>>
>>> config [co1, collectorSource( 35853 ), [collectorSink(
>>> "hdfs://hadoop48:54310/user/**flume/%y%m/%d","%{host}-",**
>>> 5000,raw),collectorSink(
>>> "hdfs://hadoop48:54310/user/**flume/%y%m","%{host}-",10000,**raw)]]
>>>
>>> I sent 300MB log, it will spent about 3 minutes, so it's about
>>> 100MB/minute.
>>>
>>> while I send the log from ag1 to co1 via scp, It's about 30MB/second.
>>>
>>> someone give me any ideas?
>>>
>>> thanks!
>>>
>>> Andy
>>>
>> --
>> Alexander Alten-Lorenz
>> http://mapredit.blogspot.com
>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>>
>>
>>
>>
>
> --
> Jeong-shik Jang / [EMAIL PROTECTED]
> Gruter, Inc., R&D Team Leader
> www.gruter.com
> Enjoy Connecting
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB