Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> seeking help on flume cluster deployment

Copy link to this message
Re: seeking help on flume cluster deployment
Interesting enough, i was initially doing 1 too, and had a working version.
But finally I give it up because in my bolt i have to flush to hdfs either
when data reaching certain size or a timer times out, which is exactly what
flume can offer. Also it has some complexity of grouping entries within the
same partition while with flume it is a piece of cake.

Thank you so much for all you guys's input. It helped me a lot !

On Thu, Jan 9, 2014 at 10:00 PM, Ashish <[EMAIL PROTECTED]> wrote:

> Got it!
> My first reaction was to use HDFS bolt to write data directly to HDFS, but
> couldn't find an implementation for the same. My knowledge is limited for
> Storm.
> If the data is already flowing through Storm, you got two options
> 1. Write a bolt to dump data to HDFS
> 2. Write a Flume bolt using RPC client as recommended in thread, and reuse
> Flume's capabilities.
> If you already have Flume installation running, #2 is quickest way of
> running. Otherwise also, installing and running Flume is like a walk in the
> park :)
> You can also watch related discussion on
> https://issues.apache.org/jira/browse/FLUME-1286. There is some good info
> in the JIRA.
> thanks
> ashish
> On Fri, Jan 10, 2014 at 11:08 AM, Chen Wang <[EMAIL PROTECTED]>wrote:
>> Ashish,
>> Since we already use storm for other real time processing, i thus want to
>> re utilize it. The biggest advantage for me of using storm in this case is
>> that i could use storm's spout to read from our socket server continuously,
>> and the storm framework can ensure it never stops. Meantime, i can also
>> easily filter out /translate the data in bolt before sending to flume. For
>> this piece of data stream, right now my first step is to get it into hdfs,
>> but will add real time processing soon.
>> Does that make sense to you?
>> Thanks,
>> Chen
>> On Thu, Jan 9, 2014 at 9:29 PM, Ashish <[EMAIL PROTECTED]> wrote:
>>> Why do you need Storm? Are you doing any real time processing? If not,
>>> IMHO, avoid Storm.
>>> Can use something like this
>>> Socket -> Load Balanced RPC Client -> Flume Topology with HA
>>> What Application level protocol are you using at Socket level?
>>> On Fri, Jan 10, 2014 at 10:50 AM, Chen Wang <[EMAIL PROTECTED]>wrote:
>>>> Jeff, Joao,
>>>> Thanks for the pointer!
>>>> I think i am getting close here:
>>>> 1. set up a cluster of flume agent with redundancies, source as avro,
>>>> sink as HDFS.
>>>> 2 use storm(not quite necessary) to read from our socket server, then
>>>> in the bolt, using flume client (load balancing rpc client) to send the
>>>> event to the agent set up in step 1.
>>>> Then I thus get all the benefit of storm and flume. Does this set up
>>>> look right to you?
>>>> thank you very much,
>>>> Chen
>>>> On Thu, Jan 9, 2014 at 8:58 PM, Joao Salcedo <[EMAIL PROTECTED]>wrote:
>>>>> Hi Chen,
>>>>> Maybe it would be worth checking this
>>>>> http://flume.apache.org/FlumeDeveloperGuide.html#loadbalancing-rpc-client
>>>>> Regards,
>>>>> Joao
>>>>> On Fri, Jan 10, 2014 at 3:50 PM, Jeff Lord <[EMAIL PROTECTED]> wrote:
>>>>>> Have you taken a look at the load balancing rpc client?
>>>>>> On Thu, Jan 9, 2014 at 8:43 PM, Chen Wang <[EMAIL PROTECTED]
>>>>>> > wrote:
>>>>>>> Jeff,
>>>>>>> I have read this ppt at the beginning, but didn't find solution to
>>>>>>> my user case. To simplify my case, I only have 1 data source(composed of 5
>>>>>>> socket server)  and i am looking for a fault tolerant deployment of flume,
>>>>>>> that can read from this single data source and sink to hdfs in fault
>>>>>>> tolerant mode: when one node dies, another flume node can pick up and
>>>>>>> continue;
>>>>>>> Thanks,
>>>>>>> Chen
>>>>>>> On Thu, Jan 9, 2014 at 7:49 PM, Jeff Lord <[EMAIL PROTECTED]>wrote:
>>>>>>>> Chen,
>>>>>>>> Have you taken a look at this presentation on Planning and