Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume, mail # user - flume tail source problem and performance


+
周梦想 2013-01-29, 07:24
+
Alexander Alten-Lorenz 2013-01-29, 07:29
+
Jeong-shik Jang 2013-01-29, 07:41
+
周梦想 2013-02-04, 07:27
+
Jeong-shik Jang 2013-02-04, 07:47
+
周梦想 2013-02-04, 08:07
+
Jeong-shik Jang 2013-02-04, 08:13
+
GuoWei 2013-02-04, 11:46
Copy link to this message
-
Re: flume tail source problem and performance
周梦想 2013-02-06, 02:47
Thanks to JS and GuoWei, I'll try this.

Best Regards,
Andy Zhou

2013/2/4 GuoWei <[EMAIL PROTECTED]>

>
>  Exactly, It's very easy to write custom source and sink in flume. You can
> just follow the default source or sink as example. And export as jar file
> to the lib folder of flume. Then you can configure your own source or sink
> in flume conf file.
>
>
> Best Regards
> Guo Wei
>
>
> On 2013-2-4, at 下午4:13, Jeong-shik Jang <[EMAIL PROTECTED]> wrote:
>
>  Yes, you can; Flume plugin framework provides easy way to implement and
> apply your own source, deco and sink.
>
> -JS
>
> On 2/4/13 5:07 PM, 周梦想 wrote:
>
> Hi JS,
>
>  Thank you for your reply. So there is big shortness of collect log using
> flume. can I write my own agent to send logs  via  thrift protocol directly
> to collector server?
>
>  Best Regards,
> Andy Zhou
>
>
> 2013/2/4 Jeong-shik Jang <[EMAIL PROTECTED]>
>
>>  Hi Andy,
>>
>> 1. "startFromEnd=true" in your source configuration means data missing
>> can happen at restart in tail side because flume will ignore any data event
>> generated during restart and start at the end all the time.
>> 2. With agentSink, data duplication can happen due to ack delay from
>> master or at agent restart.
>>
>> I think it is why Flume-NG doesn't support tail any more but does let
>> user handle using script or program; tailing is a tricky job.
>>
>> My suggestion is to use agentBEChain in agent tier, and DFO in collector
>> tier; you can still lose some data during failover at failure.
>> To minimize loss and duplication, implementing checkpoint function in
>> tail also can help.
>>
>> Having monitoring system to detecting failure is very important as well,
>> so that you can notice failure and do recovering reaction quickly.
>>
>> -JS
>>
>>
>> On 2/4/13 4:27 PM, 周梦想 wrote:
>>
>> Hi  JS,
>> We can't accept agentBESink. Because this logs are important for data
>> analysis, we can't make any errors of the data. losing data, duplication
>> are all not acceptable.
>> one agent's configure is :   tail("H:/game.log", startFromEnd=true) agentSink("hadoop48",
>> 35853)
>> every time this windows agent restart, it will resend all the data to
>> collector server.
>> if some reason we restart the agent node, we can't get the mark of log
>> where the agent have sent.
>>
>>
>> 2013/1/29 Jeong-shik Jang <[EMAIL PROTECTED]>
>>
>>> Hi Andy,
>>>
>>> As you set startFromEnd option true, resend might be caused by DFO
>>> mechanism (agentDFOSink); when you restart flume node in DFO mode, all
>>> events in different stages(logged, writing, sending and so on) rolls back
>>> to logged stage, which means resending and duplication.
>>>
>>> And, for better performance, you may want to use agentBESink instead of
>>> agentDFOSink.
>>> I recommend to use agentBEChain for failover in case of failure in
>>> collector tier if you have multiple collectors.
>>>
>>> -JS
>>>
>>>
>>> On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote:
>>>
>>>> Hi,
>>>>
>>>> you could use tail -F, but this depends on the external source. Flume
>>>> hasn't control about. You can write your own script and include this.
>>>>
>>>> What's the content of:
>>>> /tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean?
>>>>
>>>> - Alex
>>>>
>>>> On Jan 29, 2013, at 8:24 AM, 周梦想 <[EMAIL PROTECTED]> wrote:
>>>>
>>>>  hello,
>>>>> 1. I want to tail a log source and write it to hdfs. below is
>>>>> configure:
>>>>> config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
>>>>> agentDFOSink("hadoop48",35853) ;]
>>>>> config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
>>>>> agentDFOSink("hadoop48",35853) ;]
>>>>> config [co1, collectorSource( 35853 ),  [collectorSink(
>>>>> "hdfs://hadoop48:54310/user/flume/%y%m/%d
>>>>> ","%{host}-",5000,raw),collectorSink(
>>>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>>>>
>>>>>
>>>>> I found if I restart the agent node, it will resend the content of
>>>>> game.log
>>>>> to collector. There are some solutions to send logs from where I
+
周梦想 2013-02-04, 07:33
+
Alexander Alten-Lorenz 2013-02-04, 07:39