Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - Seeking advice over choice of language and implementation


Copy link to this message
-
Re: Seeking advice over choice of language and implementation
Hari Shreedharan 2013-07-19, 19:40
Aro's Python client is unlikely to work - because the Avro Netty RPC does
not have a python implementation (and is not compatible with HTTP
transceiver). At this point, either using HTTP Source or using Thrift RPC
is your best option.

Like Ashish said, Flume is not meant for batch jobs, rather for streaming
jobs. It would work for sure, but you may have other options.

Thanks,
Hari
On Fri, Jul 19, 2013 at 7:14 AM, Ashish <[EMAIL PROTECTED]> wrote:

>
>
>
> On Thu, Jul 18, 2013 at 11:08 PM, Sunita Arvind <[EMAIL PROTECTED]>wrote:
>
>> Hello friends,
>>
>> I am new to flume and have written a python script to fetch some data
>> from social media. My response is JSON. I am seeking help on following
>> issues:
>> 1. I am finding it hard to make python and flume talk. Is it just my
>> ignorance or it is indeed a long route? AFAIK, I need to understand thrift
>> API and Avro etc to achieve this. I also read about pipes. Would this be a
>> simple implementation
>>
>
> Python would work fine. As said, you can use HTTP Source. Alternatively,
> you can also use Avro source using Avro's python client
>
>
>>
>> 2. I am equally comfortable (uncomfortable) in java. Hence wondering if
>> its better to re-write my application in Java so that I can easily
>> integrate it with flume. Are there any advantages of having a java
>> application, as all of hadoop is java?
>>
>
> The advantage would be that you can use Flume's Client SDK, reducing a lot
> of work. IMHO, it doesn't matter to Flume as to who is pushing the data
>
>
>>
>> 3. I need to schedule the agent to run on a daily basis. Which of the
>> above approaches would help me achieve this easily?
>>
>
> Looks like you have a batch job which would execute at a point of time
> during the day. If that's the case, please have a re-look if you need
> Flume. Flume can definitely be used, but you could directly do a load on
> HDFS. Again, cannot conclude based on the information provided.
>
>
>>
>> 4. Going by this -
>> http://mail-archives.apache.org/mod_mbox/flume-user/201306.mbox/%[EMAIL PROTECTED]%3Elooks like we need to manually clean up disk space even with flume. I am
>> not clear on the advantages I would have with flume over using a simple
>> cron job to do the task. I can manually write statements like "hadoop fs
>> -put <location of output file on local> <location on hdfs>" in the cron job
>> instead.
>>
>
> The ML thread pointed is related to RollingFileSink, not HDFS sink, so
> it's not valid in context of HDFS sink.
>
> HTH !
>
>
>>
>> Appreciate your help and guidance
>>
>> regards,
>> Sunita
>>
>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>