Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - seeking help on flume cluster deployment


Copy link to this message
-
Re: seeking help on flume cluster deployment
Jeff Lord 2014-01-10, 04:50
Have you taken a look at the load balancing rpc client?
On Thu, Jan 9, 2014 at 8:43 PM, Chen Wang <[EMAIL PROTECTED]>wrote:

> Jeff,
> I have read this ppt at the beginning, but didn't find solution to my user
> case. To simplify my case, I only have 1 data source(composed of 5 socket
> server)  and i am looking for a fault tolerant deployment of flume, that
> can read from this single data source and sink to hdfs in fault tolerant
> mode: when one node dies, another flume node can pick up and continue;
> Thanks,
> Chen
>
>
> On Thu, Jan 9, 2014 at 7:49 PM, Jeff Lord <[EMAIL PROTECTED]> wrote:
>
>> Chen,
>>
>> Have you taken a look at this presentation on Planning and Deploying
>> Flume from ApacheCon?
>>
>>
>> http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf
>>
>> It may have the answers you need.
>>
>> Best,
>>
>> Jeff
>>
>>
>> On Thu, Jan 9, 2014 at 7:24 PM, Chen Wang <[EMAIL PROTECTED]>wrote:
>>
>>> Thanks Saurabh.
>>> If that is the case, I am actually thinking about using storm spout to
>>> talk to our socket server so that the storm cluster can take care of the
>>> reading socket server part. Then in each storm node, start a flume agent,
>>> listening on a RPC port and write to HDFS(with fail over) .Then in the
>>> storm bolt, simply send the data to RPC so that flume can get it.
>>> How do you think of this setup? It takes care of both failover on the
>>> source(by storm) and on the sink(by flume) But It looks a little
>>> complicated for me.
>>> Chen
>>>
>>>
>>> On Thu, Jan 9, 2014 at 7:18 PM, Saurabh B <[EMAIL PROTECTED]>wrote:
>>>
>>>> Hi Chen,
>>>>
>>>> I think Flume doesn't have a way to configure multiple sources pointing
>>>> to same data source. Of course you can do that, but you will end up with
>>>> duplicate data. Flume offers fail over at the sink level.
>>>>
>>>> On Thu, Jan 9, 2014 at 6:56 PM, Chen Wang <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> Ok. so after more researching:) It seems that what i need is the
>>>>> failover for agent source, (not fail over for sink):
>>>>> If one agent dies, another same kind of agent will start running.
>>>>> Does flume support this scenario?
>>>>> Thanks,
>>>>> Chen
>>>>>
>>>>>
>>>>> On Thu, Jan 9, 2014 at 3:12 PM, Chen Wang <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>> After reading more docs, it seems that if I want to achieve my goal,
>>>>>> i have to do the following:
>>>>>> 1. Having one agent with the custom source running on one node. This
>>>>>> agent reads from those 5 socket server, and sink to some kind of sink(maybe
>>>>>> another socket?)
>>>>>> 2. On another(or more) machines, setting up collectors that read from
>>>>>> the agent sink in 1, and sink to hdfs.
>>>>>> 3. Having a master node managing nodes in 1,2.
>>>>>>
>>>>>> But it seems to be overskilled in my case: in 1, i can already sink
>>>>>> to hdfs. Since the data available at socket server are much faster than the
>>>>>> data translation part.  I want to be able to later add more nodes to do the
>>>>>> translation job. so what is the correct setup?
>>>>>> Thanks,
>>>>>> Chen
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 9, 2014 at 2:38 PM, Chen Wang <[EMAIL PROTECTED]
>>>>>> > wrote:
>>>>>>
>>>>>>> Guys,
>>>>>>> In my environment, the client is 5 socket servers. Thus i wrote a
>>>>>>> custom source spawning 5 threads reading from each of them infinitely,and
>>>>>>> the sink is hdfs(hive table). The work fine by running flume-ng
>>>>>>> agent.
>>>>>>>
>>>>>>> But how can i deploy this in distributed mode(cluster)? I am
>>>>>>> confused about the 3 ties(agent,collector,storage) mentioned in the doc.
>>>>>>> Does it apply to my case? How can I separate my agent/collect/storage?
>>>>>>> Apparently i can only have one agent running: multiple agent will result in
>>>>>>> getting duplicates from the socket server. But I want that if one agent