Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> UDF discussion? Here or on the dev list? / Json Loading


Copy link to this message
-
Re: UDF discussion? Here or on the dev list? / Json Loading
Yes, you would have to distribute ruby (though. it's typically  
installed by default) as well as the wukong and json libraries to all  
the nodes in the cluster. Unfortunately this isn't something wukong  
gives you for free at the moment though it is planned.

As far as I know Pig doesn't do anything more complex than launch a  
hadoop streaming job and use the output in the subsequent steps

btw I write 90% of my mr jobs using either wukong or Pig. Only when  
it's absolutely required do I use a language with as much overhead as  
java :)

--jacob
@thedatachef

Sent from my iPhone

On Jan 30, 2011, at 2:09 PM, Alex McLintock <[EMAIL PROTECTED]>  
wrote:

> On 29 January 2011 13:43, Jacob Perkins <[EMAIL PROTECTED]>  
> wrote:
>>
>> Write a map only wukong script that parses the json as you want it.  
>> See
>> the example here:
>>
>>
>> http://thedatachef.blogspot.com/2011/01/processing-json-records-with-hadoop-and.html
>>
>>
> Hi Jacob,
>
> Thanks very much for helping me out. I haven't heard of Wukong before.
> I am a bit concerned though by adding Ruby into my tool stack as  
> well as
> Pig. It seems like a step too far.
> Presumably I have to distribute Ruby and Wukong across all my job  
> nodes in
> the same way as if I were writing perl or C++ streaming programs.
>
> With STREAMing - the script is launched once per file, right, not  
> once per
> record?
>
> Alex
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB