|
yonghu
2012-06-25, 09:50
Hien Luu
2012-06-25, 14:56
Alan Gates
2012-06-25, 16:39
Russell Jurney
2012-06-25, 17:17
Alan Gates
2012-06-25, 17:50
Gianmarco De Francisci Mo...
2012-06-26, 05:56
Alan Gates
2012-06-26, 16:56
Johannes Schwenk
2012-07-04, 12:42
Ruslan Al-Fakikh
2012-07-04, 13:53
Johannes Schwenk
2012-07-04, 14:01
|
-
Does pig support in clause?yonghu 2012-06-25, 09:50
Dear all,
in the sql, there is a in clause which is used to check if the value is in a set or not? Does pig also have the same in clause? Such as: B = filter A by A1 in C; A,B,C are relation names and A1 is a column_name of A. Thanks! Yong
-
Re: Does pig support in clause?Hien Luu 2012-06-25, 14:56
No, currently there is no support for in clause in Pig. I had the same
question the other day. The alternative to use an UDF. I wonder if it makes sense to add this support in future version of Pig. Hien On 6/25/12 2:50 AM, "yonghu" <[EMAIL PROTECTED]> wrote: >Dear all, > >in the sql, there is a in clause which is used to check if the value >is in a set or not? Does pig also have the same in clause? Such as: > >B = filter A by A1 in C; > >A,B,C are relation names and A1 is a column_name of A. > >Thanks! > >Yong
-
Re: Does pig support in clause?Alan Gates 2012-06-25, 16:39
This type of in is really a semi-join. So you could rewrite this as:
B1 = join A by A1, C by A1; B2 = filter B1 by SIZE(C) > 0; B = foreach B2 flatten(A); Alan. On Jun 25, 2012, at 2:50 AM, yonghu wrote: > Dear all, > > in the sql, there is a in clause which is used to check if the value > is in a set or not? Does pig also have the same in clause? Such as: > > B = filter A by A1 in C; > > A,B,C are relation names and A1 is a column_name of A. > > Thanks! > > Yong
-
Re: Does pig support in clause?Russell Jurney 2012-06-25, 17:17
This could be a cool rewrite feature like CUBE/SAMPLE.
Russell Jurney http://datasyndrome.com On Jun 25, 2012, at 9:39 AM, Alan Gates <[EMAIL PROTECTED]> wrote: > This type of in is really a semi-join. So you could rewrite this as: > > B1 = join A by A1, C by A1; > B2 = filter B1 by SIZE(C) > 0; > B = foreach B2 flatten(A); > > Alan. > > On Jun 25, 2012, at 2:50 AM, yonghu wrote: > >> Dear all, >> >> in the sql, there is a in clause which is used to check if the value >> is in a set or not? Does pig also have the same in clause? Such as: >> >> B = filter A by A1 in C; >> >> A,B,C are relation names and A1 is a column_name of A. >> >> Thanks! >> >> Yong >
-
Re: Does pig support in clause?Alan Gates 2012-06-25, 17:50
Agreed. And with some optimization we could make semi-join more efficient than this since it only needs to keep one record per key per map instead of all the records for a key.
Alan. On Jun 25, 2012, at 10:17 AM, Russell Jurney wrote: > This could be a cool rewrite feature like CUBE/SAMPLE. > > Russell Jurney http://datasyndrome.com > > On Jun 25, 2012, at 9:39 AM, Alan Gates <[EMAIL PROTECTED]> wrote: > >> This type of in is really a semi-join. So you could rewrite this as: >> >> B1 = join A by A1, C by A1; >> B2 = filter B1 by SIZE(C) > 0; >> B = foreach B2 flatten(A); >> >> Alan. >> >> On Jun 25, 2012, at 2:50 AM, yonghu wrote: >> >>> Dear all, >>> >>> in the sql, there is a in clause which is used to check if the value >>> is in a set or not? Does pig also have the same in clause? Such as: >>> >>> B = filter A by A1 in C; >>> >>> A,B,C are relation names and A1 is a column_name of A. >>> >>> Thanks! >>> >>> Yong >>
-
Re: Does pig support in clause?Gianmarco De Francisci Mo... 2012-06-26, 05:56
Bloom filters would help efficiency here.
A bloom join or semi-join would be a nice addition to Pig. Cheers, -- Gianmarco On Mon, Jun 25, 2012 at 7:50 PM, Alan Gates <[EMAIL PROTECTED]> wrote: > Agreed. And with some optimization we could make semi-join more efficient > than this since it only needs to keep one record per key per map instead of > all the records for a key. > > Alan. > > On Jun 25, 2012, at 10:17 AM, Russell Jurney wrote: > > > This could be a cool rewrite feature like CUBE/SAMPLE. > > > > Russell Jurney http://datasyndrome.com > > > > On Jun 25, 2012, at 9:39 AM, Alan Gates <[EMAIL PROTECTED]> wrote: > > > >> This type of in is really a semi-join. So you could rewrite this as: > >> > >> B1 = join A by A1, C by A1; > >> B2 = filter B1 by SIZE(C) > 0; > >> B = foreach B2 flatten(A); > >> > >> Alan. > >> > >> On Jun 25, 2012, at 2:50 AM, yonghu wrote: > >> > >>> Dear all, > >>> > >>> in the sql, there is a in clause which is used to check if the value > >>> is in a set or not? Does pig also have the same in clause? Such as: > >>> > >>> B = filter A by A1 in C; > >>> > >>> A,B,C are relation names and A1 is a column_name of A. > >>> > >>> Thanks! > >>> > >>> Yong > >> > >
-
Re: Does pig support in clause?Alan Gates 2012-06-26, 16:56
As of 0.10 there are UDFs for building bloom filters. Those could be used to construct a bloom join.
Alan. On Jun 25, 2012, at 10:56 PM, Gianmarco De Francisci Morales wrote: > Bloom filters would help efficiency here. > A bloom join or semi-join would be a nice addition to Pig. > > Cheers, > -- > Gianmarco > > > > > On Mon, Jun 25, 2012 at 7:50 PM, Alan Gates <[EMAIL PROTECTED]> wrote: > >> Agreed. And with some optimization we could make semi-join more efficient >> than this since it only needs to keep one record per key per map instead of >> all the records for a key. >> >> Alan. >> >> On Jun 25, 2012, at 10:17 AM, Russell Jurney wrote: >> >>> This could be a cool rewrite feature like CUBE/SAMPLE. >>> >>> Russell Jurney http://datasyndrome.com >>> >>> On Jun 25, 2012, at 9:39 AM, Alan Gates <[EMAIL PROTECTED]> wrote: >>> >>>> This type of in is really a semi-join. So you could rewrite this as: >>>> >>>> B1 = join A by A1, C by A1; >>>> B2 = filter B1 by SIZE(C) > 0; >>>> B = foreach B2 flatten(A); >>>> >>>> Alan. >>>> >>>> On Jun 25, 2012, at 2:50 AM, yonghu wrote: >>>> >>>>> Dear all, >>>>> >>>>> in the sql, there is a in clause which is used to check if the value >>>>> is in a set or not? Does pig also have the same in clause? Such as: >>>>> >>>>> B = filter A by A1 in C; >>>>> >>>>> A,B,C are relation names and A1 is a column_name of A. >>>>> >>>>> Thanks! >>>>> >>>>> Yong >>>> >> >>
-
Re: Does pig support in clause?Johannes Schwenk 2012-07-04, 12:42
Hi Alan,
I'd like to use this method to not include records in my output that are already present in previously computed data. So I tried to use your suggestion like this: grunt> cat in.dat 1 2 3 4 5 6 7 8 9 grunt> C = LOAD 'in.dat' AS (A1); -- previously generated data grunt> cat in2.dat 12 2 13 1 10 9 11 8 grunt> A = LOAD 'in2.dat' AS (A1); -- new data grunt> B1 = join A by A1, C by A1; grunt> B2 = filter B1 by SIZE(C) == 0; Which gives me this error: 2012-07-04 14:36:16,768 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 14, column 23> Invalid scalar projection: C : A column needs to be projected from a relation for it to be used as a scalar Details at logfile: /home/schwenk/pig-0.10.0/pig_1341403702015.log The relevant pig stack trace from the logfile can be found at http://pastebin.com/MxPfduWS What am I doing wrong? Greetings, Johannes Am 25.06.2012 18:39, schrieb Alan Gates: > This type of in is really a semi-join. So you could rewrite this as: > > B1 = join A by A1, C by A1; > B2 = filter B1 by SIZE(C) > 0; > B = foreach B2 flatten(A); > > Alan. > > On Jun 25, 2012, at 2:50 AM, yonghu wrote: > >> Dear all, >> >> in the sql, there is a in clause which is used to check if the value >> is in a set or not? Does pig also have the same in clause? Such as: >> >> B = filter A by A1 in C; >> >> A,B,C are relation names and A1 is a column_name of A. >> >> Thanks! >> >> Yong > Johannes Schwenk -- Softwareentwickler (Reporting) ________________________________________________________ ADITION technologies AG Schwarzwaldstraße 78b 79117 Freiburg http://www.adition.com T +49 / (0)761 / 88147 - 30 F +49 / (0)761 / 88147 - 77 SUPPORT +49 / (0)1805 - ADITION (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer UStIDNr.: DE 218 858 434
-
Re: Does pig support in clause?Ruslan Al-Fakikh 2012-07-04, 13:53
Hi Johannes,
Try this C = LOAD 'in.dat' AS (A1); A = LOAD 'in2.dat' AS (A1); joined = JOIN A BY A1 LEFT OUTER, C BY A1; DESCRIBE joined; newEntries = FILTER joined BY C::A1 IS NULL; DUMP newEntries; Ruslan On Wed, Jul 4, 2012 at 4:42 PM, Johannes Schwenk <[EMAIL PROTECTED]> wrote: > Hi Alan, > > I'd like to use this method to not include records in my output that are > already present in previously computed data. So I tried to use your > suggestion like this: > > grunt> cat in.dat > 1 > 2 > 3 > 4 > 5 > 6 > 7 > 8 > 9 > grunt> C = LOAD 'in.dat' AS (A1); -- previously generated data > grunt> cat in2.dat > 12 > 2 > 13 > 1 > 10 > 9 > 11 > 8 > grunt> A = LOAD 'in2.dat' AS (A1); -- new data > grunt> B1 = join A by A1, C by A1; > grunt> B2 = filter B1 by SIZE(C) == 0; > > Which gives me this error: > > 2012-07-04 14:36:16,768 [main] ERROR org.apache.pig.tools.grunt.Grunt - > ERROR 1200: Pig script failed to parse: > <line 14, column 23> Invalid scalar projection: C : A column needs to be > projected from a relation for it to be used as a scalar > Details at logfile: /home/schwenk/pig-0.10.0/pig_1341403702015.log > > The relevant pig stack trace from the logfile can be found at > > http://pastebin.com/MxPfduWS > > What am I doing wrong? > > Greetings, > Johannes > > Am 25.06.2012 18:39, schrieb Alan Gates: >> This type of in is really a semi-join. So you could rewrite this as: >> >> B1 = join A by A1, C by A1; >> B2 = filter B1 by SIZE(C) > 0; >> B = foreach B2 flatten(A); >> >> Alan. >> >> On Jun 25, 2012, at 2:50 AM, yonghu wrote: >> >>> Dear all, >>> >>> in the sql, there is a in clause which is used to check if the value >>> is in a set or not? Does pig also have the same in clause? Such as: >>> >>> B = filter A by A1 in C; >>> >>> A,B,C are relation names and A1 is a column_name of A. >>> >>> Thanks! >>> >>> Yong >> > > > > Johannes Schwenk > > -- > Softwareentwickler (Reporting) > ________________________________________________________ > > ADITION technologies AG > Schwarzwaldstraße 78b > 79117 Freiburg > > http://www.adition.com > > T +49 / (0)761 / 88147 - 30 > F +49 / (0)761 / 88147 - 77 > SUPPORT +49 / (0)1805 - ADITION > > (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) > > Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 > Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter > Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer > UStIDNr.: DE 218 858 434 > > >
-
Re: Does pig support in clause?Johannes Schwenk 2012-07-04, 14:01
Thank you very much Ruslan! That works well!
Greetings, Johannes Am 04.07.2012 15:53, schrieb Ruslan Al-Fakikh: > Hi Johannes, > > Try this > C = LOAD 'in.dat' AS (A1); > A = LOAD 'in2.dat' AS (A1); > > joined = JOIN A BY A1 LEFT OUTER, C BY A1; > > DESCRIBE joined; > > newEntries = FILTER joined BY C::A1 IS NULL; > > DUMP newEntries; > > Ruslan > > On Wed, Jul 4, 2012 at 4:42 PM, Johannes Schwenk > <[EMAIL PROTECTED]> wrote: >> Hi Alan, >> >> I'd like to use this method to not include records in my output that are >> already present in previously computed data. So I tried to use your >> suggestion like this: >> >> grunt> cat in.dat >> 1 >> 2 >> 3 >> 4 >> 5 >> 6 >> 7 >> 8 >> 9 >> grunt> C = LOAD 'in.dat' AS (A1); -- previously generated data >> grunt> cat in2.dat >> 12 >> 2 >> 13 >> 1 >> 10 >> 9 >> 11 >> 8 >> grunt> A = LOAD 'in2.dat' AS (A1); -- new data >> grunt> B1 = join A by A1, C by A1; >> grunt> B2 = filter B1 by SIZE(C) == 0; >> >> Which gives me this error: >> >> 2012-07-04 14:36:16,768 [main] ERROR org.apache.pig.tools.grunt.Grunt - >> ERROR 1200: Pig script failed to parse: >> <line 14, column 23> Invalid scalar projection: C : A column needs to be >> projected from a relation for it to be used as a scalar >> Details at logfile: /home/schwenk/pig-0.10.0/pig_1341403702015.log >> >> The relevant pig stack trace from the logfile can be found at >> >> http://pastebin.com/MxPfduWS >> >> What am I doing wrong? >> >> Greetings, >> Johannes >> >> Am 25.06.2012 18:39, schrieb Alan Gates: >>> This type of in is really a semi-join. So you could rewrite this as: >>> >>> B1 = join A by A1, C by A1; >>> B2 = filter B1 by SIZE(C) > 0; >>> B = foreach B2 flatten(A); >>> >>> Alan. >>> >>> On Jun 25, 2012, at 2:50 AM, yonghu wrote: >>> >>>> Dear all, >>>> >>>> in the sql, there is a in clause which is used to check if the value >>>> is in a set or not? Does pig also have the same in clause? Such as: >>>> >>>> B = filter A by A1 in C; >>>> >>>> A,B,C are relation names and A1 is a column_name of A. >>>> >>>> Thanks! >>>> >>>> Yong >>> >> >> >> >> Johannes Schwenk >> >> -- >> Softwareentwickler (Reporting) >> ________________________________________________________ >> >> ADITION technologies AG >> Schwarzwaldstraße 78b >> 79117 Freiburg >> >> http://www.adition.com >> >> T +49 / (0)761 / 88147 - 30 >> F +49 / (0)761 / 88147 - 77 >> SUPPORT +49 / (0)1805 - ADITION >> >> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) >> >> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 >> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter >> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer >> UStIDNr.: DE 218 858 434 >> >> >> Johannes Schwenk -- Softwareentwickler (Reporting) ________________________________________________________ ADITION technologies AG Schwarzwaldstraße 78b 79117 Freiburg http://www.adition.com T +49 / (0)761 / 88147 - 30 F +49 / (0)761 / 88147 - 77 SUPPORT +49 / (0)1805 - ADITION (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer UStIDNr.: DE 218 858 434 |