Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - Seeking advice over choice of language and implementation


Copy link to this message
-
Re: Seeking advice over choice of language and implementation
Ashish 2013-07-19, 14:14
On Thu, Jul 18, 2013 at 11:08 PM, Sunita Arvind <[EMAIL PROTECTED]>wrote:

> Hello friends,
>
> I am new to flume and have written a python script to fetch some data from
> social media. My response is JSON. I am seeking help on following issues:
> 1. I am finding it hard to make python and flume talk. Is it just my
> ignorance or it is indeed a long route? AFAIK, I need to understand thrift
> API and Avro etc to achieve this. I also read about pipes. Would this be a
> simple implementation
>

Python would work fine. As said, you can use HTTP Source. Alternatively,
you can also use Avro source using Avro's python client
>
> 2. I am equally comfortable (uncomfortable) in java. Hence wondering if
> its better to re-write my application in Java so that I can easily
> integrate it with flume. Are there any advantages of having a java
> application, as all of hadoop is java?
>

The advantage would be that you can use Flume's Client SDK, reducing a lot
of work. IMHO, it doesn't matter to Flume as to who is pushing the data
>
> 3. I need to schedule the agent to run on a daily basis. Which of the
> above approaches would help me achieve this easily?
>

Looks like you have a batch job which would execute at a point of time
during the day. If that's the case, please have a re-look if you need
Flume. Flume can definitely be used, but you could directly do a load on
HDFS. Again, cannot conclude based on the information provided.
>
> 4. Going by this -
> http://mail-archives.apache.org/mod_mbox/flume-user/201306.mbox/%[EMAIL PROTECTED]%3Elooks like we need to manually clean up disk space even with flume. I am
> not clear on the advantages I would have with flume over using a simple
> cron job to do the task. I can manually write statements like "hadoop fs
> -put <location of output file on local> <location on hdfs>" in the cron job
> instead.
>

The ML thread pointed is related to RollingFileSink, not HDFS sink, so it's
not valid in context of HDFS sink.

HTH !
>
> Appreciate your help and guidance
>
> regards,
> Sunita
>

--
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal