Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> updates using Pig


Copy link to this message
-
Re: updates using Pig
Hi Srinivas,

you can write a user defined function for this

feed = union feed1, feed2;
feed_grouped = group feed by trade-key;
output = foreach feed_grouped generate
flatten(your_user_defined_function(feed)) as (trade-key, trade-add-date,
trade-price)

your_user_defined_function take the one or more records with the same
trade-key as input, and it should only output the latest tuple of
(trade-key, trade-add-date, trade-price)
by the way, you can sort these 2 files by trade-key then merge them using a
small script, that's much more faster than using pig.

On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani <[EMAIL PROTECTED]>wrote:

> Hi,
>
> I'm trying to do updates of records in hadoop using Pig ( I know this is
> not ideal but trying out POC )..
> data looks like the below:
>
> *feed1:*
> --> here trade key is unique for each order/record
> --> this is history file
>
> trade-key    trade-add-date       trade-price
> *k1                 05/21/2012            2000*
> k2                  04/21/2012             3000
> k3                 03/21/2012            4000
> k4                 05/21/2012             5000
>
> *feed2:  *--> this is the latest/daily feed
> trade-key    trade-add-date       trade-price
> k5                06/22/2012             1000
> k6                 06/22/2012            2000
> *k1                06/21/2012             3000   ---> we can see here,
> trade with key "k1" is appeared again..that means order with trade key "k1"
> has some update*
> *
> *
> Now I'm looking for the below output :  ( merging the both files and and
> looking for common key from both feeds and keeping the latest key record in
> the output file )
> *k1                06/21/2012             3000*
> *
> k2                  04/21/2012             3000
> k3                 06/21/2012            4000
> k4                 07/21/2012             5000
> *k5                06/22/2012             1000
> k6                 06/22/2012            2000*
>
> any help appreciated greatly !!
> *
>
> Regards,
> Srinivas
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB