|
|
Srinivas Surasani 2012-08-28, 04:36
Hi,
I'm trying to do updates of records in hadoop using Pig ( I know this is not ideal but trying out POC ).. data looks like the below:
*feed1:* --> here trade key is unique for each order/record --> this is history file
trade-key trade-add-date trade-price *k1 05/21/2012 2000* k2 04/21/2012 3000 k3 03/21/2012 4000 k4 05/21/2012 5000
*feed2: *--> this is the latest/daily feed trade-key trade-add-date trade-price k5 06/22/2012 1000 k6 06/22/2012 2000 *k1 06/21/2012 3000 ---> we can see here, trade with key "k1" is appeared again..that means order with trade key "k1" has some update* * * Now I'm looking for the below output : ( merging the both files and and looking for common key from both feeds and keeping the latest key record in the output file ) *k1 06/21/2012 3000* * k2 04/21/2012 3000 k3 06/21/2012 4000 k4 07/21/2012 5000 *k5 06/22/2012 1000 k6 06/22/2012 2000*
any help appreciated greatly !! *
Regards, Srinivas
TianYi Zhu 2012-08-28, 04:55
Hi Srinivas,
you can write a user defined function for this
feed = union feed1, feed2; feed_grouped = group feed by trade-key; output = foreach feed_grouped generate flatten(your_user_defined_function(feed)) as (trade-key, trade-add-date, trade-price)
your_user_defined_function take the one or more records with the same trade-key as input, and it should only output the latest tuple of (trade-key, trade-add-date, trade-price) by the way, you can sort these 2 files by trade-key then merge them using a small script, that's much more faster than using pig.
On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani <[EMAIL PROTECTED]>wrote:
> Hi, > > I'm trying to do updates of records in hadoop using Pig ( I know this is > not ideal but trying out POC ).. > data looks like the below: > > *feed1:* > --> here trade key is unique for each order/record > --> this is history file > > trade-key trade-add-date trade-price > *k1 05/21/2012 2000* > k2 04/21/2012 3000 > k3 03/21/2012 4000 > k4 05/21/2012 5000 > > *feed2: *--> this is the latest/daily feed > trade-key trade-add-date trade-price > k5 06/22/2012 1000 > k6 06/22/2012 2000 > *k1 06/21/2012 3000 ---> we can see here, > trade with key "k1" is appeared again..that means order with trade key "k1" > has some update* > * > * > Now I'm looking for the below output : ( merging the both files and and > looking for common key from both feeds and keeping the latest key record in > the output file ) > *k1 06/21/2012 3000* > * > k2 04/21/2012 3000 > k3 06/21/2012 4000 > k4 07/21/2012 5000 > *k5 06/22/2012 1000 > k6 06/22/2012 2000* > > any help appreciated greatly !! > * > > Regards, > Srinivas >
Hello TianYi Zhu,
Thanks !! and will get back..
-->by the way, you can sort these 2 files by trade-key then merge them using a small script, that's much more faster than using pig. ... Trying out POC on updates in hadoop
Thanks, Srinivas On Tue, Aug 28, 2012 at 12:55 AM, TianYi Zhu < [EMAIL PROTECTED]> wrote:
> Hi Srinivas, > > you can write a user defined function for this > > feed = union feed1, feed2; > feed_grouped = group feed by trade-key; > output = foreach feed_grouped generate > flatten(your_user_defined_function(feed)) as (trade-key, trade-add-date, > trade-price) > > your_user_defined_function take the one or more records with the same > trade-key as input, and it should only output the latest tuple of > (trade-key, trade-add-date, trade-price) > > > by the way, you can sort these 2 files by trade-key then merge them using a > small script, that's much more faster than using pig. > > On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani <[EMAIL PROTECTED] > >wrote: > > > Hi, > > > > I'm trying to do updates of records in hadoop using Pig ( I know this is > > not ideal but trying out POC ).. > > data looks like the below: > > > > *feed1:* > > --> here trade key is unique for each order/record > > --> this is history file > > > > trade-key trade-add-date trade-price > > *k1 05/21/2012 2000* > > k2 04/21/2012 3000 > > k3 03/21/2012 4000 > > k4 05/21/2012 5000 > > > > *feed2: *--> this is the latest/daily feed > > trade-key trade-add-date trade-price > > k5 06/22/2012 1000 > > k6 06/22/2012 2000 > > *k1 06/21/2012 3000 ---> we can see here, > > trade with key "k1" is appeared again..that means order with trade key > "k1" > > has some update* > > * > > * > > Now I'm looking for the below output : ( merging the both files and and > > looking for common key from both feeds and keeping the latest key record > in > > the output file ) > > *k1 06/21/2012 3000* > > * > > k2 04/21/2012 3000 > > k3 06/21/2012 4000 > > k4 07/21/2012 5000 > > *k5 06/22/2012 1000 > > k6 06/22/2012 2000* > > > > any help appreciated greatly !! > > * > > > > Regards, > > Srinivas > > >
-- Regards, Srinivas [EMAIL PROTECTED]
Jonathan Coveney 2012-08-28, 06:47
I would do this with a cogroup. Whether or not you need a UDF depends on whether or not a key can appear more than once in a file.
trade-key trade-add-date trade-price
feed_group = cogroup feed1 by trade-key, feed2 by trade-key; feed_proj = foreach feed_group generate FLATTEN( IsEmpty(feed2) ? feed1 ? feed2 );
and there you go (you may need to tweak the flatten to make it work).
It'd be slightly more complicated if you had multiple key/date pairs.
2012/8/27 Srini <[EMAIL PROTECTED]>
> Hello TianYi Zhu, > > Thanks !! and will get back.. > > -->by the way, you can sort these 2 files by trade-key then merge them > using a > small script, that's much more faster than using pig. > ... Trying out POC on updates in hadoop > > Thanks, > Srinivas > On Tue, Aug 28, 2012 at 12:55 AM, TianYi Zhu < > [EMAIL PROTECTED]> wrote: > > > Hi Srinivas, > > > > you can write a user defined function for this > > > > feed = union feed1, feed2; > > feed_grouped = group feed by trade-key; > > output = foreach feed_grouped generate > > flatten(your_user_defined_function(feed)) as (trade-key, trade-add-date, > > trade-price) > > > > your_user_defined_function take the one or more records with the same > > trade-key as input, and it should only output the latest tuple of > > (trade-key, trade-add-date, trade-price) > > > > > > by the way, you can sort these 2 files by trade-key then merge them > using a > > small script, that's much more faster than using pig. > > > > On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani < > [EMAIL PROTECTED] > > >wrote: > > > > > Hi, > > > > > > I'm trying to do updates of records in hadoop using Pig ( I know this > is > > > not ideal but trying out POC ).. > > > data looks like the below: > > > > > > *feed1:* > > > --> here trade key is unique for each order/record > > > --> this is history file > > > > > > trade-key trade-add-date trade-price > > > *k1 05/21/2012 2000* > > > k2 04/21/2012 3000 > > > k3 03/21/2012 4000 > > > k4 05/21/2012 5000 > > > > > > *feed2: *--> this is the latest/daily feed > > > trade-key trade-add-date trade-price > > > k5 06/22/2012 1000 > > > k6 06/22/2012 2000 > > > *k1 06/21/2012 3000 ---> we can see here, > > > trade with key "k1" is appeared again..that means order with trade key > > "k1" > > > has some update* > > > * > > > * > > > Now I'm looking for the below output : ( merging the both files and > and > > > looking for common key from both feeds and keeping the latest key > record > > in > > > the output file ) > > > *k1 06/21/2012 3000* > > > * > > > k2 04/21/2012 3000 > > > k3 06/21/2012 4000 > > > k4 07/21/2012 5000 > > > *k5 06/22/2012 1000 > > > k6 06/22/2012 2000* > > > > > > any help appreciated greatly !! > > > * > > > > > > Regards, > > > Srinivas > > > > > > > > > -- > Regards, > Srinivas > [EMAIL PROTECTED] >
Thank-you very much Jonathan...
On Tue, Aug 28, 2012 at 2:47 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:
> I would do this with a cogroup. Whether or not you need a UDF depends on > whether or not a key can appear more than once in a file. > > trade-key trade-add-date trade-price > > feed_group = cogroup feed1 by trade-key, feed2 by trade-key; > feed_proj = foreach feed_group generate FLATTEN( IsEmpty(feed2) ? feed1 ? > feed2 ); > > and there you go (you may need to tweak the flatten to make it work). > > It'd be slightly more complicated if you had multiple key/date pairs. > > 2012/8/27 Srini <[EMAIL PROTECTED]> > > > Hello TianYi Zhu, > > > > Thanks !! and will get back.. > > > > -->by the way, you can sort these 2 files by trade-key then merge them > > using a > > small script, that's much more faster than using pig. > > ... Trying out POC on updates in hadoop > > > > Thanks, > > Srinivas > > On Tue, Aug 28, 2012 at 12:55 AM, TianYi Zhu < > > [EMAIL PROTECTED]> wrote: > > > > > Hi Srinivas, > > > > > > you can write a user defined function for this > > > > > > feed = union feed1, feed2; > > > feed_grouped = group feed by trade-key; > > > output = foreach feed_grouped generate > > > flatten(your_user_defined_function(feed)) as (trade-key, > trade-add-date, > > > trade-price) > > > > > > your_user_defined_function take the one or more records with the same > > > trade-key as input, and it should only output the latest tuple of > > > (trade-key, trade-add-date, trade-price) > > > > > > > > > by the way, you can sort these 2 files by trade-key then merge them > > using a > > > small script, that's much more faster than using pig. > > > > > > On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani < > > [EMAIL PROTECTED] > > > >wrote: > > > > > > > Hi, > > > > > > > > I'm trying to do updates of records in hadoop using Pig ( I know this > > is > > > > not ideal but trying out POC ).. > > > > data looks like the below: > > > > > > > > *feed1:* > > > > --> here trade key is unique for each order/record > > > > --> this is history file > > > > > > > > trade-key trade-add-date trade-price > > > > *k1 05/21/2012 2000* > > > > k2 04/21/2012 3000 > > > > k3 03/21/2012 4000 > > > > k4 05/21/2012 5000 > > > > > > > > *feed2: *--> this is the latest/daily feed > > > > trade-key trade-add-date trade-price > > > > k5 06/22/2012 1000 > > > > k6 06/22/2012 2000 > > > > *k1 06/21/2012 3000 ---> we can see > here, > > > > trade with key "k1" is appeared again..that means order with trade > key > > > "k1" > > > > has some update* > > > > * > > > > * > > > > Now I'm looking for the below output : ( merging the both files and > > and > > > > looking for common key from both feeds and keeping the latest key > > record > > > in > > > > the output file ) > > > > *k1 06/21/2012 3000* > > > > * > > > > k2 04/21/2012 3000 > > > > k3 06/21/2012 4000 > > > > k4 07/21/2012 5000 > > > > *k5 06/22/2012 1000 > > > > k6 06/22/2012 2000* > > > > > > > > any help appreciated greatly !! > > > > * > > > > > > > > Regards, > > > > Srinivas > > > > > > > > > > > > > > > -- > > Regards, > > Srinivas > > [EMAIL PROTECTED] > > >
-- Regards, Srinivas [EMAIL PROTECTED]
pablomar 2012-08-29, 11:04
now I can see it :-) very beautiful place On Wed, Aug 29, 2012 at 5:47 AM, Srini <[EMAIL PROTECTED]> wrote:
> Thank-you very much Jonathan... > > On Tue, Aug 28, 2012 at 2:47 AM, Jonathan Coveney <[EMAIL PROTECTED] > >wrote: > > > I would do this with a cogroup. Whether or not you need a UDF depends on > > whether or not a key can appear more than once in a file. > > > > trade-key trade-add-date trade-price > > > > feed_group = cogroup feed1 by trade-key, feed2 by trade-key; > > feed_proj = foreach feed_group generate FLATTEN( IsEmpty(feed2) ? feed1 ? > > feed2 ); > > > > and there you go (you may need to tweak the flatten to make it work). > > > > It'd be slightly more complicated if you had multiple key/date pairs. > > > > 2012/8/27 Srini <[EMAIL PROTECTED]> > > > > > Hello TianYi Zhu, > > > > > > Thanks !! and will get back.. > > > > > > -->by the way, you can sort these 2 files by trade-key then merge them > > > using a > > > small script, that's much more faster than using pig. > > > ... Trying out POC on updates in hadoop > > > > > > Thanks, > > > Srinivas > > > On Tue, Aug 28, 2012 at 12:55 AM, TianYi Zhu < > > > [EMAIL PROTECTED]> wrote: > > > > > > > Hi Srinivas, > > > > > > > > you can write a user defined function for this > > > > > > > > feed = union feed1, feed2; > > > > feed_grouped = group feed by trade-key; > > > > output = foreach feed_grouped generate > > > > flatten(your_user_defined_function(feed)) as (trade-key, > > trade-add-date, > > > > trade-price) > > > > > > > > your_user_defined_function take the one or more records with the same > > > > trade-key as input, and it should only output the latest tuple of > > > > (trade-key, trade-add-date, trade-price) > > > > > > > > > > > > by the way, you can sort these 2 files by trade-key then merge them > > > using a > > > > small script, that's much more faster than using pig. > > > > > > > > On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani < > > > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > Hi, > > > > > > > > > > I'm trying to do updates of records in hadoop using Pig ( I know > this > > > is > > > > > not ideal but trying out POC ).. > > > > > data looks like the below: > > > > > > > > > > *feed1:* > > > > > --> here trade key is unique for each order/record > > > > > --> this is history file > > > > > > > > > > trade-key trade-add-date trade-price > > > > > *k1 05/21/2012 2000* > > > > > k2 04/21/2012 3000 > > > > > k3 03/21/2012 4000 > > > > > k4 05/21/2012 5000 > > > > > > > > > > *feed2: *--> this is the latest/daily feed > > > > > trade-key trade-add-date trade-price > > > > > k5 06/22/2012 1000 > > > > > k6 06/22/2012 2000 > > > > > *k1 06/21/2012 3000 ---> we can see > > here, > > > > > trade with key "k1" is appeared again..that means order with trade > > key > > > > "k1" > > > > > has some update* > > > > > * > > > > > * > > > > > Now I'm looking for the below output : ( merging the both files > and > > > and > > > > > looking for common key from both feeds and keeping the latest key > > > record > > > > in > > > > > the output file ) > > > > > *k1 06/21/2012 3000* > > > > > * > > > > > k2 04/21/2012 3000 > > > > > k3 06/21/2012 4000 > > > > > k4 07/21/2012 5000 > > > > > *k5 06/22/2012 1000 > > > > > k6 06/22/2012 2000* > > > > > > > > > > any help appreciated greatly !! > > > > > * > > > > > > > > > > Regards, > > > > > Srinivas > > > > > > > > > > > > > > > > > > > > > -- > > > Regards, > > > Srinivas > > > [EMAIL PROTECTED] > > > > > > > > > -- > Regards, > Srinivas > [EMAIL PROTECTED] >
|
|