Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - updates using Pig


Copy link to this message
-
Re: updates using Pig
Jonathan Coveney 2012-08-28, 06:47
I would do this with a cogroup. Whether or not you need a UDF depends on
whether or not a key can appear more than once in a file.

trade-key    trade-add-date       trade-price

feed_group = cogroup feed1 by trade-key, feed2 by trade-key;
feed_proj = foreach feed_group generate FLATTEN( IsEmpty(feed2) ? feed1 ?
feed2 );

and there you go (you may need to tweak the flatten to make it work).

It'd be slightly more complicated if you had multiple key/date pairs.

2012/8/27 Srini <[EMAIL PROTECTED]>

> Hello  TianYi Zhu,
>
> Thanks !! and will get back..
>
> -->by the way, you can sort these 2 files by trade-key then merge them
> using a
> small script, that's much more faster than using pig.
> ... Trying out POC on updates in hadoop
>
> Thanks,
> Srinivas
> On Tue, Aug 28, 2012 at 12:55 AM, TianYi Zhu <
> [EMAIL PROTECTED]> wrote:
>
> > Hi Srinivas,
> >
> > you can write a user defined function for this
> >
> > feed = union feed1, feed2;
> > feed_grouped = group feed by trade-key;
> > output = foreach feed_grouped generate
> > flatten(your_user_defined_function(feed)) as (trade-key, trade-add-date,
> > trade-price)
> >
> > your_user_defined_function take the one or more records with the same
> > trade-key as input, and it should only output the latest tuple of
> > (trade-key, trade-add-date, trade-price)
> >
> >
> > by the way, you can sort these 2 files by trade-key then merge them
> using a
> > small script, that's much more faster than using pig.
> >
> > On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani <
> [EMAIL PROTECTED]
> > >wrote:
> >
> > > Hi,
> > >
> > > I'm trying to do updates of records in hadoop using Pig ( I know this
> is
> > > not ideal but trying out POC )..
> > > data looks like the below:
> > >
> > > *feed1:*
> > > --> here trade key is unique for each order/record
> > > --> this is history file
> > >
> > > trade-key    trade-add-date       trade-price
> > > *k1                 05/21/2012            2000*
> > > k2                  04/21/2012             3000
> > > k3                 03/21/2012            4000
> > > k4                 05/21/2012             5000
> > >
> > > *feed2:  *--> this is the latest/daily feed
> > > trade-key    trade-add-date       trade-price
> > > k5                06/22/2012             1000
> > > k6                 06/22/2012            2000
> > > *k1                06/21/2012             3000   ---> we can see here,
> > > trade with key "k1" is appeared again..that means order with trade key
> > "k1"
> > > has some update*
> > > *
> > > *
> > > Now I'm looking for the below output :  ( merging the both files and
> and
> > > looking for common key from both feeds and keeping the latest key
> record
> > in
> > > the output file )
> > > *k1                06/21/2012             3000*
> > > *
> > > k2                  04/21/2012             3000
> > > k3                 06/21/2012            4000
> > > k4                 07/21/2012             5000
> > > *k5                06/22/2012             1000
> > > k6                 06/22/2012            2000*
> > >
> > > any help appreciated greatly !!
> > > *
> > >
> > > Regards,
> > > Srinivas
> > >
> >
>
>
>
> --
> Regards,
> Srinivas
> [EMAIL PROTECTED]
>