|
|
-
Pig Conditionals (Do I have to use UDFs)?
Eli Finkelshteyn 2011-09-14, 20:27
Hi, I'd like to generate based on exclusive conditions (something like the CASE statement in SQL). An example:
Say I have data that looks like:
(a, 1) (a, 2) (b, 2) (c, 1) (d, 3) (d, 4)
And I want to just convert each of the numbers to their written forms to get:
(a, one) (a, two) (b, two) (c, one) (d, three) (d, four)
Would I need to write a udf for that, or is there some simple way to do it using cases? I know I can do a bunch of bidirectional generates one on top of the other to achieve this, like:
FOREACH rel GENERATE $0, (($1==1) ? 'one' : (($1 == 2) ? 'two' : (($1 == 3) ? 'three' : 'four')));
but that seems too messy. I'd appreciate any advice.
Thanks! Eli
-
Re: Pig Conditionals (Do I have to use UDFs)?
Clay B. 2011-09-14, 20:59
I have done mappings in the past using joins and mapping files too.
E.g. generate a file of mappings and load it as a relation, then join. A rather heavy weight solution though.
-Clay
On Wed, 14 Sep 2011, Eli Finkelshteyn wrote:
> Hi, > I'd like to generate based on exclusive conditions (something like the CASE > statement in SQL). An example: > > Say I have data that looks like: > > (a, 1) > (a, 2) > (b, 2) > (c, 1) > (d, 3) > (d, 4) > > And I want to just convert each of the numbers to their written forms to get: > > (a, one) > (a, two) > (b, two) > (c, one) > (d, three) > (d, four) > > Would I need to write a udf for that, or is there some simple way to do it > using cases? I know I can do a bunch of bidirectional generates one on top of > the other to achieve this, like: > > FOREACH rel GENERATE $0, (($1==1) ? 'one' : (($1 == 2) ? 'two' : (($1 == 3) ? > 'three' : 'four'))); > > but that seems too messy. I'd appreciate any advice. > > Thanks! > Eli > > >
-
Re: Pig Conditionals (Do I have to use UDFs)?
Ryan Hoegg 2011-09-14, 21:07
What about putting the mappings into their own relation? I tried this with 0.9.0:
example.txt: a,1 a,2 b,2 c,1 d,3 d,4
mapping.txt: 1,one 2,two 3,three 4,four
MAPPINGS = LOAD 'mapping.txt' USING PigStorage(',') AS (number:int,name:chararray); EXAMPLE_SOURCE = LOAD 'example.txt' USING PigStorage(',') AS (item:chararray,number:int); MAPPED = JOIN EXAMPLE_SOURCE BY number LEFT OUTER, MAPPINGS BY number; PRETTY = FOREACH MAPPED GENERATE item, name; DUMP PRETTY; (a,one) (c,one) (a,two) (b,two) (d,three) (d,four)
-- Ryan Hoegg
On Wed, Sep 14, 2011 at 3:27 PM, Eli Finkelshteyn <[EMAIL PROTECTED]>wrote:
> Hi, > I'd like to generate based on exclusive conditions (something like the CASE > statement in SQL). An example: > > Say I have data that looks like: > > (a, 1) > (a, 2) > (b, 2) > (c, 1) > (d, 3) > (d, 4) > > And I want to just convert each of the numbers to their written forms to > get: > > (a, one) > (a, two) > (b, two) > (c, one) > (d, three) > (d, four) > > Would I need to write a udf for that, or is there some simple way to do it > using cases? I know I can do a bunch of bidirectional generates one on top > of the other to achieve this, like: > > FOREACH rel GENERATE $0, (($1==1) ? 'one' : (($1 == 2) ? 'two' : (($1 == 3) > ? 'three' : 'four'))); > > but that seems too messy. I'd appreciate any advice. > > Thanks! > Eli > > >
-
Re: Pig Conditionals (Do I have to use UDFs)?
Eli Finkelshteyn 2011-09-14, 21:24
Sorry, bad example, I guess. I want something I can do case statements with. In this case I could map instead, but if I wanted to use less straight-forward cases (i.e. one case where number == 1, another where number between 2 and 4, another where number greater than 5, etc...), it would be much more difficult to do with mapping.
Again, I know this is something I can do with udfs, but it seemed like something light enough to be built into PIG itself, so I was hoping there was a way to do it without needing to write a udf every time I have a new transformation to make.
Eli
On 9/14/11 5:07 PM, Ryan Hoegg wrote: > What about putting the mappings into their own relation? I tried this with > 0.9.0: > > example.txt: > a,1 > a,2 > b,2 > c,1 > d,3 > d,4 > > mapping.txt: > 1,one > 2,two > 3,three > 4,four > > MAPPINGS = LOAD 'mapping.txt' USING PigStorage(',') AS > (number:int,name:chararray); > EXAMPLE_SOURCE = LOAD 'example.txt' USING PigStorage(',') AS > (item:chararray,number:int); > MAPPED = JOIN EXAMPLE_SOURCE BY number LEFT OUTER, MAPPINGS BY number; > PRETTY = FOREACH MAPPED GENERATE item, name; > DUMP PRETTY; > (a,one) > (c,one) > (a,two) > (b,two) > (d,three) > (d,four) > > -- > Ryan Hoegg > > On Wed, Sep 14, 2011 at 3:27 PM, Eli Finkelshteyn<[EMAIL PROTECTED]>wrote: > >> Hi, >> I'd like to generate based on exclusive conditions (something like the CASE >> statement in SQL). An example: >> >> Say I have data that looks like: >> >> (a, 1) >> (a, 2) >> (b, 2) >> (c, 1) >> (d, 3) >> (d, 4) >> >> And I want to just convert each of the numbers to their written forms to >> get: >> >> (a, one) >> (a, two) >> (b, two) >> (c, one) >> (d, three) >> (d, four) >> >> Would I need to write a udf for that, or is there some simple way to do it >> using cases? I know I can do a bunch of bidirectional generates one on top >> of the other to achieve this, like: >> >> FOREACH rel GENERATE $0, (($1==1) ? 'one' : (($1 == 2) ? 'two' : (($1 == 3) >> ? 'three' : 'four'))); >> >> but that seems too messy. I'd appreciate any advice. >> >> Thanks! >> Eli >> >> >>
-
Re: Pig Conditionals (Do I have to use UDFs)?
Ryan Hoegg 2011-09-14, 21:51
What about trying something with SPLIT and UNION:
SPLIT EXAMPLE_SOURCE INTO GOOD IF number>5, BETTER IF (number>=2 AND number<=4), BEST IF (number>=5);
I did a few FOREACH and a UNION, and got this: (a,6,best) (b,5,best) (d,8,best) (a,6,good) (d,8,good) (a,2,better) (b,2,better) (c,3,better) (d,3,better) (d,4,better)
-- Ryan Hoegg
On Wed, Sep 14, 2011 at 4:24 PM, Eli Finkelshteyn <[EMAIL PROTECTED]>wrote:
> Sorry, bad example, I guess. I want something I can do case statements > with. In this case I could map instead, but if I wanted to use less > straight-forward cases (i.e. one case where number == 1, another where > number between 2 and 4, another where number greater than 5, etc...), it > would be much more difficult to do with mapping. > > Again, I know this is something I can do with udfs, but it seemed like > something light enough to be built into PIG itself, so I was hoping there > was a way to do it without needing to write a udf every time I have a new > transformation to make. > > Eli > > On 9/14/11 5:07 PM, Ryan Hoegg wrote: > >> What about putting the mappings into their own relation? I tried this >> with >> 0.9.0: >> >> example.txt: >> a,1 >> a,2 >> b,2 >> c,1 >> d,3 >> d,4 >> >> mapping.txt: >> 1,one >> 2,two >> 3,three >> 4,four >> >> MAPPINGS = LOAD 'mapping.txt' USING PigStorage(',') AS >> (number:int,name:chararray); >> EXAMPLE_SOURCE = LOAD 'example.txt' USING PigStorage(',') AS >> (item:chararray,number:int); >> MAPPED = JOIN EXAMPLE_SOURCE BY number LEFT OUTER, MAPPINGS BY number; >> PRETTY = FOREACH MAPPED GENERATE item, name; >> DUMP PRETTY; >> (a,one) >> (c,one) >> (a,two) >> (b,two) >> (d,three) >> (d,four) >> >> -- >> Ryan Hoegg >> >> On Wed, Sep 14, 2011 at 3:27 PM, Eli Finkelshteyn<iefinkel@gmail.**com<[EMAIL PROTECTED]> >> >wrote: >> >> Hi, >>> I'd like to generate based on exclusive conditions (something like the >>> CASE >>> statement in SQL). An example: >>> >>> Say I have data that looks like: >>> >>> (a, 1) >>> (a, 2) >>> (b, 2) >>> (c, 1) >>> (d, 3) >>> (d, 4) >>> >>> And I want to just convert each of the numbers to their written forms to >>> get: >>> >>> (a, one) >>> (a, two) >>> (b, two) >>> (c, one) >>> (d, three) >>> (d, four) >>> >>> Would I need to write a udf for that, or is there some simple way to do >>> it >>> using cases? I know I can do a bunch of bidirectional generates one on >>> top >>> of the other to achieve this, like: >>> >>> FOREACH rel GENERATE $0, (($1==1) ? 'one' : (($1 == 2) ? 'two' : (($1 =>>> 3) >>> ? 'three' : 'four'))); >>> >>> but that seems too messy. I'd appreciate any advice. >>> >>> Thanks! >>> Eli >>> >>> >>> >>> >
-
Re: Pig Conditionals (Do I have to use UDFs)?
Eli Finkelshteyn 2011-09-14, 21:53
Ah, neat! That would do the trick. Seems like a lot of extra steps, but I'll take it if that's how it's done in PIG. Thanks!
On 9/14/11 5:51 PM, Ryan Hoegg wrote: > What about trying something with SPLIT and UNION: > > SPLIT EXAMPLE_SOURCE INTO GOOD IF number>5, BETTER IF (number>=2 AND > number<=4), BEST IF (number>=5); > > I did a few FOREACH and a UNION, and got this: > (a,6,best) > (b,5,best) > (d,8,best) > (a,6,good) > (d,8,good) > (a,2,better) > (b,2,better) > (c,3,better) > (d,3,better) > (d,4,better) > > -- > Ryan Hoegg > > On Wed, Sep 14, 2011 at 4:24 PM, Eli Finkelshteyn<[EMAIL PROTECTED]>wrote: > >> Sorry, bad example, I guess. I want something I can do case statements >> with. In this case I could map instead, but if I wanted to use less >> straight-forward cases (i.e. one case where number == 1, another where >> number between 2 and 4, another where number greater than 5, etc...), it >> would be much more difficult to do with mapping. >> >> Again, I know this is something I can do with udfs, but it seemed like >> something light enough to be built into PIG itself, so I was hoping there >> was a way to do it without needing to write a udf every time I have a new >> transformation to make. >> >> Eli >> >> On 9/14/11 5:07 PM, Ryan Hoegg wrote: >> >>> What about putting the mappings into their own relation? I tried this >>> with >>> 0.9.0: >>> >>> example.txt: >>> a,1 >>> a,2 >>> b,2 >>> c,1 >>> d,3 >>> d,4 >>> >>> mapping.txt: >>> 1,one >>> 2,two >>> 3,three >>> 4,four >>> >>> MAPPINGS = LOAD 'mapping.txt' USING PigStorage(',') AS >>> (number:int,name:chararray); >>> EXAMPLE_SOURCE = LOAD 'example.txt' USING PigStorage(',') AS >>> (item:chararray,number:int); >>> MAPPED = JOIN EXAMPLE_SOURCE BY number LEFT OUTER, MAPPINGS BY number; >>> PRETTY = FOREACH MAPPED GENERATE item, name; >>> DUMP PRETTY; >>> (a,one) >>> (c,one) >>> (a,two) >>> (b,two) >>> (d,three) >>> (d,four) >>> >>> -- >>> Ryan Hoegg >>> >>> On Wed, Sep 14, 2011 at 3:27 PM, Eli Finkelshteyn<iefinkel@gmail.**com<[EMAIL PROTECTED]> >>>> wrote: >>> Hi, >>>> I'd like to generate based on exclusive conditions (something like the >>>> CASE >>>> statement in SQL). An example: >>>> >>>> Say I have data that looks like: >>>> >>>> (a, 1) >>>> (a, 2) >>>> (b, 2) >>>> (c, 1) >>>> (d, 3) >>>> (d, 4) >>>> >>>> And I want to just convert each of the numbers to their written forms to >>>> get: >>>> >>>> (a, one) >>>> (a, two) >>>> (b, two) >>>> (c, one) >>>> (d, three) >>>> (d, four) >>>> >>>> Would I need to write a udf for that, or is there some simple way to do >>>> it >>>> using cases? I know I can do a bunch of bidirectional generates one on >>>> top >>>> of the other to achieve this, like: >>>> >>>> FOREACH rel GENERATE $0, (($1==1) ? 'one' : (($1 == 2) ? 'two' : (($1 =>>>> 3) >>>> ? 'three' : 'four'))); >>>> >>>> but that seems too messy. I'd appreciate any advice. >>>> >>>> Thanks! >>>> Eli >>>> >>>> >>>> >>>>
-
Re: Pig Conditionals (Do I have to use UDFs)?
Dmitriy Ryaboy 2011-09-14, 21:55
There's a fair bit of overhead there.
UDFs are ok and normal in pig. Everything is done with them. Don't be afraid of udfs :).
There's some pain with the compile cycle (edit code in java, test, compile, jar, register...). That's where inline python udfs become handy!
D
On Wed, Sep 14, 2011 at 2:53 PM, Eli Finkelshteyn <[EMAIL PROTECTED]> wrote:
> Ah, neat! That would do the trick. Seems like a lot of extra steps, but > I'll take it if that's how it's done in PIG. Thanks! > > > On 9/14/11 5:51 PM, Ryan Hoegg wrote: > >> What about trying something with SPLIT and UNION: >> >> SPLIT EXAMPLE_SOURCE INTO GOOD IF number>5, BETTER IF (number>=2 AND >> number<=4), BEST IF (number>=5); >> >> I did a few FOREACH and a UNION, and got this: >> (a,6,best) >> (b,5,best) >> (d,8,best) >> (a,6,good) >> (d,8,good) >> (a,2,better) >> (b,2,better) >> (c,3,better) >> (d,3,better) >> (d,4,better) >> >> -- >> Ryan Hoegg >> >> On Wed, Sep 14, 2011 at 4:24 PM, Eli Finkelshteyn<iefinkel@gmail.**com<[EMAIL PROTECTED]> >> >wrote: >> >> Sorry, bad example, I guess. I want something I can do case statements >>> with. In this case I could map instead, but if I wanted to use less >>> straight-forward cases (i.e. one case where number == 1, another where >>> number between 2 and 4, another where number greater than 5, etc...), it >>> would be much more difficult to do with mapping. >>> >>> Again, I know this is something I can do with udfs, but it seemed like >>> something light enough to be built into PIG itself, so I was hoping there >>> was a way to do it without needing to write a udf every time I have a new >>> transformation to make. >>> >>> Eli >>> >>> On 9/14/11 5:07 PM, Ryan Hoegg wrote: >>> >>> What about putting the mappings into their own relation? I tried this >>>> with >>>> 0.9.0: >>>> >>>> example.txt: >>>> a,1 >>>> a,2 >>>> b,2 >>>> c,1 >>>> d,3 >>>> d,4 >>>> >>>> mapping.txt: >>>> 1,one >>>> 2,two >>>> 3,three >>>> 4,four >>>> >>>> MAPPINGS = LOAD 'mapping.txt' USING PigStorage(',') AS >>>> (number:int,name:chararray); >>>> EXAMPLE_SOURCE = LOAD 'example.txt' USING PigStorage(',') AS >>>> (item:chararray,number:int); >>>> MAPPED = JOIN EXAMPLE_SOURCE BY number LEFT OUTER, MAPPINGS BY number; >>>> PRETTY = FOREACH MAPPED GENERATE item, name; >>>> DUMP PRETTY; >>>> (a,one) >>>> (c,one) >>>> (a,two) >>>> (b,two) >>>> (d,three) >>>> (d,four) >>>> >>>> -- >>>> Ryan Hoegg >>>> >>>> On Wed, Sep 14, 2011 at 3:27 PM, Eli Finkelshteyn<iefinkel@gmail.**** >>>> com<[EMAIL PROTECTED]> >>>> >>>>> wrote: >>>>> >>>> Hi, >>>> >>>>> I'd like to generate based on exclusive conditions (something like the >>>>> CASE >>>>> statement in SQL). An example: >>>>> >>>>> Say I have data that looks like: >>>>> >>>>> (a, 1) >>>>> (a, 2) >>>>> (b, 2) >>>>> (c, 1) >>>>> (d, 3) >>>>> (d, 4) >>>>> >>>>> And I want to just convert each of the numbers to their written forms >>>>> to >>>>> get: >>>>> >>>>> (a, one) >>>>> (a, two) >>>>> (b, two) >>>>> (c, one) >>>>> (d, three) >>>>> (d, four) >>>>> >>>>> Would I need to write a udf for that, or is there some simple way to do >>>>> it >>>>> using cases? I know I can do a bunch of bidirectional generates one on >>>>> top >>>>> of the other to achieve this, like: >>>>> >>>>> FOREACH rel GENERATE $0, (($1==1) ? 'one' : (($1 == 2) ? 'two' : (($1 >>>>> =>>>>> 3) >>>>> ? 'three' : 'four'))); >>>>> >>>>> but that seems too messy. I'd appreciate any advice. >>>>> >>>>> Thanks! >>>>> Eli >>>>> >>>>> >>>>> >>>>> >>>>> >
|
|