|
|
-
Access only data from LEFT OUTER JOIN side of joined data without projection prefix
Florian Zumkeller-Quast 2012-07-25, 09:48
Hello, I got the following code: A = LOAD '§file1' USING AvroStorage(); B = LOAD '$file2' USING AvroStorage(); C = JOIN A BY id LEFT OUTER, B BY id; SPLIT C INTO D IF B::id IS NULL, E OTHERWISE; DESCRIBE shows the following data structure D: {A::id: long,A::time: int,B::id: long,B::time: int} E: {A::id: long,A::time: int,B::id: long,B::time: int} But i can't store D and E using AvroStorage because the filed names contain "::" which is not an allowed character. I need structure like F: {id: long,time: int} where id = E::A::id and time = E::A::time. The problem is: The number, name and type of fields may vary. So E might looks like E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int} Thus I can't use F = FOREACH … GENERATE …; because i don't want to write code for each filetype as long as I don't really need to. Can someone give me an advice how to get the result I need? Thanks! With kind regards Florian Zumkeller-Quast -- Developer ________________________________________________________ ADITION technologies AG Schwarzwaldstraße 78b 79117 Freiburg http://www.adition.com T +49 / (0)761 / 88147 - 30 F +49 / (0)761 / 88147 - 77 SUPPORT +49 / (0)1805 - ADITION (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer UStIDNr.: DE 218 858 434
-
Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix
Alan Gates 2012-07-25, 14:22
Basically you need to transform the schema, not the data. The easiest way I can think of to do that is to use a UDF that has an outputSchema function that renames columns. The exec call can then be a simple pass through. If you wanted to you could have it consolidate the join keys. You imply you would like to consolidate other columns as well (A::E::time in your example), but that is not valid. Since time is not a join key it will not necessarily be the same in A and E. Alan. On Jul 25, 2012, at 2:48 AM, Florian Zumkeller-Quast wrote: > Hello, > I got the following code: > > A = LOAD '§file1' USING AvroStorage(); > B = LOAD '$file2' USING AvroStorage(); > C = JOIN A BY id LEFT OUTER, B BY id; > SPLIT C INTO D IF B::id IS NULL, E OTHERWISE; > > DESCRIBE shows the following data structure > > D: {A::id: long,A::time: int,B::id: long,B::time: int} > E: {A::id: long,A::time: int,B::id: long,B::time: int} > > But i can't store D and E using AvroStorage because the filed names contain > "::" which is not an allowed character. > > I need structure like > F: {id: long,time: int} > where id = E::A::id and time = E::A::time. > > The problem is: The number, name and type of fields may vary. > > So E might looks like > E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int} > > Thus I can't use > > F = FOREACH … GENERATE …; > > because i don't want to write code for each filetype as long as I don't really > need to. > > Can someone give me an advice how to get the result I need? > > Thanks! > > With kind regards > Florian Zumkeller-Quast > -- > Developer > ________________________________________________________ > > ADITION technologies AG > Schwarzwaldstraße 78b > 79117 Freiburg > > http://www.adition.com> > T +49 / (0)761 / 88147 - 30 > F +49 / (0)761 / 88147 - 77 > SUPPORT +49 / (0)1805 - ADITION > > (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) > > Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 > Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter > Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer > UStIDNr.: DE 218 858 434
-
Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix
Alex Rovner 2012-07-26, 13:54
I am proposing to patch avrostorage to have an option of storing field names without their relation name. A::b will be saved as "b". Thoughts? Sent from my iPhone On Jul 25, 2012, at 5:48 AM, "Florian Zumkeller-Quast" <[EMAIL PROTECTED]> wrote: > Hello, > I got the following code: > > A = LOAD '§file1' USING AvroStorage(); > B = LOAD '$file2' USING AvroStorage(); > C = JOIN A BY id LEFT OUTER, B BY id; > SPLIT C INTO D IF B::id IS NULL, E OTHERWISE; > > DESCRIBE shows the following data structure > > D: {A::id: long,A::time: int,B::id: long,B::time: int} > E: {A::id: long,A::time: int,B::id: long,B::time: int} > > But i can't store D and E using AvroStorage because the filed names contain > "::" which is not an allowed character. > > I need structure like > F: {id: long,time: int} > where id = E::A::id and time = E::A::time. > > The problem is: The number, name and type of fields may vary. > > So E might looks like > E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int} > > Thus I can't use > > F = FOREACH … GENERATE …; > > because i don't want to write code for each filetype as long as I don't really > need to. > > Can someone give me an advice how to get the result I need? > > Thanks! > > With kind regards > Florian Zumkeller-Quast > -- > Developer > ________________________________________________________ > > ADITION technologies AG > Schwarzwaldstraße 78b > 79117 Freiburg > > http://www.adition.com> > T +49 / (0)761 / 88147 - 30 > F +49 / (0)761 / 88147 - 77 > SUPPORT +49 / (0)1805 - ADITION > > (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) > > Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 > Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter > Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer > UStIDNr.: DE 218 858 434
-
Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix
Alan Gates 2012-07-26, 15:01
How will you handle ambiguities when there is an A::b and B::b? Alan. On Jul 26, 2012, at 6:54 AM, Alex Rovner wrote: > I am proposing to patch avrostorage to have an option of storing field names without their relation name. A::b will be saved as "b". > > Thoughts? > > Sent from my iPhone > > On Jul 25, 2012, at 5:48 AM, "Florian Zumkeller-Quast" <[EMAIL PROTECTED]> wrote: > >> Hello, >> I got the following code: >> >> A = LOAD '§file1' USING AvroStorage(); >> B = LOAD '$file2' USING AvroStorage(); >> C = JOIN A BY id LEFT OUTER, B BY id; >> SPLIT C INTO D IF B::id IS NULL, E OTHERWISE; >> >> DESCRIBE shows the following data structure >> >> D: {A::id: long,A::time: int,B::id: long,B::time: int} >> E: {A::id: long,A::time: int,B::id: long,B::time: int} >> >> But i can't store D and E using AvroStorage because the filed names contain >> "::" which is not an allowed character. >> >> I need structure like >> F: {id: long,time: int} >> where id = E::A::id and time = E::A::time. >> >> The problem is: The number, name and type of fields may vary. >> >> So E might looks like >> E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int} >> >> Thus I can't use >> >> F = FOREACH … GENERATE …; >> >> because i don't want to write code for each filetype as long as I don't really >> need to. >> >> Can someone give me an advice how to get the result I need? >> >> Thanks! >> >> With kind regards >> Florian Zumkeller-Quast >> -- >> Developer >> ________________________________________________________ >> >> ADITION technologies AG >> Schwarzwaldstraße 78b >> 79117 Freiburg >> >> http://www.adition.com>> >> T +49 / (0)761 / 88147 - 30 >> F +49 / (0)761 / 88147 - 77 >> SUPPORT +49 / (0)1805 - ADITION >> >> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) >> >> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 >> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter >> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer >> UStIDNr.: DE 218 858 434
-
Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix
Russell Jurney 2012-07-26, 15:13
I transform my schemas to not have the Avro invalid character, ':' in them, before I store. For example: >>> D: {A::id: long,A::time: int,B::id: long,B::time: int} D = foreach D generate A::id as a_id, A::time as a_time, B::id as b_id, B::time as b_time; You might try creating tuples for A and B, then you could access the field names as A.id, A.time, B.id, B.time. For example: D = foreach D generate ToTuple(A::id, A::time) as A, ToTuple(B::id, B::time) as B; That will store too, and should be scriptable with a macro? Russell Jurney http://datasyndrome.comOn Jul 26, 2012, at 8:01 AM, Alan Gates <[EMAIL PROTECTED]> wrote: > How will you handle ambiguities when there is an A::b and B::b? > > Alan. > > On Jul 26, 2012, at 6:54 AM, Alex Rovner wrote: > >> I am proposing to patch avrostorage to have an option of storing field names without their relation name. A::b will be saved as "b". >> >> Thoughts? >> >> Sent from my iPhone >> >> On Jul 25, 2012, at 5:48 AM, "Florian Zumkeller-Quast" <[EMAIL PROTECTED]> wrote: >> >>> Hello, >>> I got the following code: >>> >>> A = LOAD '§file1' USING AvroStorage(); >>> B = LOAD '$file2' USING AvroStorage(); >>> C = JOIN A BY id LEFT OUTER, B BY id; >>> SPLIT C INTO D IF B::id IS NULL, E OTHERWISE; >>> >>> DESCRIBE shows the following data structure >>> >>> D: {A::id: long,A::time: int,B::id: long,B::time: int} >>> E: {A::id: long,A::time: int,B::id: long,B::time: int} >>> >>> But i can't store D and E using AvroStorage because the filed names contain >>> "::" which is not an allowed character. >>> >>> I need structure like >>> F: {id: long,time: int} >>> where id = E::A::id and time = E::A::time. >>> >>> The problem is: The number, name and type of fields may vary. >>> >>> So E might looks like >>> E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int} >>> >>> Thus I can't use >>> >>> F = FOREACH … GENERATE …; >>> >>> because i don't want to write code for each filetype as long as I don't really >>> need to. >>> >>> Can someone give me an advice how to get the result I need? >>> >>> Thanks! >>> >>> With kind regards >>> Florian Zumkeller-Quast >>> -- >>> Developer >>> ________________________________________________________ >>> >>> ADITION technologies AG >>> Schwarzwaldstraße 78b >>> 79117 Freiburg >>> >>> http://www.adition.com>>> >>> T +49 / (0)761 / 88147 - 30 >>> F +49 / (0)761 / 88147 - 77 >>> SUPPORT +49 / (0)1805 - ADITION >>> >>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) >>> >>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 >>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter >>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer >>> UStIDNr.: DE 218 858 434 >
-
Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix
Florian Zumkeller-Quast 2012-07-26, 15:55
Am Mittwoch, 25. Juli 2012, 16:22:01 schrieb Alan Gates: > Basically you need to transform the schema, not the data. The easiest way > I can think of to do that is to use a UDF that has an outputSchema > function that renames columns. The exec call can then be a simple pass > through. How do I pass the schema to the UDF? It is clear how the UDF code could look like. But how does the corresponding pig script code has to look like? A = LOAD … B = LOAD … C= JOIN A, B … D = myUDF( C::A); does obviously not work because there is no alias C::A D = myUDF( C, 'A'); doesn't work either (Syntax error) The last shown variant shown here is the first i thought of. It think it would be the most intuitive way D = FOREACH C GENERATE myUDF( C, 'A'); But this is also invalid: Invalid scalar projection: d : A column needs to be projected from a relation for it to be used as a scalar. So, my question ist: How do I pass the schema/data to my UDF if i don't know the number or names of the parameters? With kind regards Florian Zumkeller-Quast -- Developer ________________________________________________________ ADITION technologies AG Schwarzwaldstraße 78b 79117 Freiburg http://www.adition.com T +49 / (0)761 / 88147 - 30 F +49 / (0)761 / 88147 - 77 SUPPORT +49 / (0)1805 - ADITION (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer UStIDNr.: DE 218 858 434
-
Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix
Alex Rovner 2012-07-27, 14:01
We have two options here: 1. Fail in this case with appropriate error message 2. Or simply replace "::" with an allowable character. This way the relation is preserved and everyone is happy. Sent from my iPhone On Jul 26, 2012, at 11:01 AM, Alan Gates <[EMAIL PROTECTED]> wrote: > How will you handle ambiguities when there is an A::b and B::b? > > Alan. > > On Jul 26, 2012, at 6:54 AM, Alex Rovner wrote: > >> I am proposing to patch avrostorage to have an option of storing field names without their relation name. A::b will be saved as "b". >> >> Thoughts? >> >> Sent from my iPhone >> >> On Jul 25, 2012, at 5:48 AM, "Florian Zumkeller-Quast" <[EMAIL PROTECTED]> wrote: >> >>> Hello, >>> I got the following code: >>> >>> A = LOAD '§file1' USING AvroStorage(); >>> B = LOAD '$file2' USING AvroStorage(); >>> C = JOIN A BY id LEFT OUTER, B BY id; >>> SPLIT C INTO D IF B::id IS NULL, E OTHERWISE; >>> >>> DESCRIBE shows the following data structure >>> >>> D: {A::id: long,A::time: int,B::id: long,B::time: int} >>> E: {A::id: long,A::time: int,B::id: long,B::time: int} >>> >>> But i can't store D and E using AvroStorage because the filed names contain >>> "::" which is not an allowed character. >>> >>> I need structure like >>> F: {id: long,time: int} >>> where id = E::A::id and time = E::A::time. >>> >>> The problem is: The number, name and type of fields may vary. >>> >>> So E might looks like >>> E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 int} >>> >>> Thus I can't use >>> >>> F = FOREACH … GENERATE …; >>> >>> because i don't want to write code for each filetype as long as I don't really >>> need to. >>> >>> Can someone give me an advice how to get the result I need? >>> >>> Thanks! >>> >>> With kind regards >>> Florian Zumkeller-Quast >>> -- >>> Developer >>> ________________________________________________________ >>> >>> ADITION technologies AG >>> Schwarzwaldstraße 78b >>> 79117 Freiburg >>> >>> http://www.adition.com>>> >>> T +49 / (0)761 / 88147 - 30 >>> F +49 / (0)761 / 88147 - 77 >>> SUPPORT +49 / (0)1805 - ADITION >>> >>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) >>> >>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 >>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter >>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer >>> UStIDNr.: DE 218 858 434 >
-
Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix
Florian Zumkeller-Quast 2012-07-27, 15:34
Am Donnerstag, 26. Juli 2012, 17:55:48 schrieb Florian Zumkeller-Quast: > Am Mittwoch, 25. Juli 2012, 16:22:01 schrieb Alan Gates: > > Basically you need to transform the schema, not the data. The easiest > > way I can think of to do that is to use a UDF that has an outputSchema > > function that renames columns. The exec call can then be a simple pass > > through. > > How do I pass the schema to the UDF? The solution: D = FOREACH C GENERATE myUDF(*); This passes an unknown amount of uknown field names to the UDF as Tuple. I just have to return it as Tuple. If i try to store the return Tuple, I get an avro schema like: But because I need the schema to be the same as the input schema (and this one differs because of the "null" and because of the data stored as tuple per record. ["null",{"type":"record","name":"TUPLE_0","fields":[{"name":"id","type": ["null","long"],"doc":"autogenerated from Pig Field Schema"}, {"name":"time","type":["null","int"],"doc":"autogenerated from Pig Field Schema"}]}] Dumped with Pig: ((id, time, …)) My idea was to flatten it. D = FOREACH C GENERATE flatten(myUDF(*)); This doesn't work either. When I am trying to store it, it get the old error again: ERROR 2999: Unexpected internal error. Illegal character in: null::id So I'm back at the beginning - with the same but slightly different error. Can someone give me an advice how to trip the "null" namespace? The modified Storage was already proposed in this thread. Because i got no duplicate names inside the tuple, that might be an solution. But is it the only one? Has anybody here an advice for me? How do I get the inner tuple data as record with the schema like it already is? Without knowing names and number of fields! With king regards Florian Zumkeller-Quast -- Developer ________________________________________________________ ADITION technologies AG Schwarzwaldstraße 78b 79117 Freiburg http://www.adition.com T +49 / (0)761 / 88147 - 30 F +49 / (0)761 / 88147 - 77 SUPPORT +49 / (0)1805 - ADITION (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer UStIDNr.: DE 218 858 434
|
|