|
Ankur Jain
2011-07-19, 21:28
Ashutosh Chauhan
2011-07-20, 17:22
Ankur Jain
2011-07-20, 19:13
Tomas Svarovsky
2011-07-20, 19:54
Ashutosh Chauhan
2011-07-20, 20:15
Ankur Jain
2011-07-20, 21:16
Ashutosh Chauhan
2011-07-20, 21:21
Ankur Jain
2011-07-20, 22:48
|
-
Merge joinAnkur Jain 2011-07-19, 21:28
Hi all,
I'm trying to do a map-side only merge join [1] in pig using Zebra's TableLoader. (My data allows merge join.) But I'm being unable to use the TableLoader. Even a simple script that loads a table and just stores it back doesn't work - ---- A = load 'my_input' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); store A into 'my_output'; ---- 'my_input' is input directory containing a single file with just 1 column - --- 1 2 3 --- The error I get is - "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. Failed to find deleted column groupsjava.io.IOException: BT Schema file doesn't exist: *file:/......./my_input/.btschema*" I have tried specifying the schema using the 'AS' clause and the DESCRIBE statement as well, but its fetches me the same error. Is the .btschema file required? Is there any documentation available on its format? (I tried comma-separated column names with/without type info) I am also willing to work with any other loader that satisfies the merge join constraints. Thanks in anticipation. Regards, Ankur [1] *http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins*
-
Re: Merge joinAshutosh Chauhan 2011-07-20, 17:22
Hey Ankur,
Zebra's TableLoader works with the data written out using Zebra's TableStorer. So, you need to write the data first using Zebra and then subsequently load using TableLoader and do merge-join. Ashutosh On Tue, Jul 19, 2011 at 14:28, Ankur Jain <[EMAIL PROTECTED]> wrote: > Hi all, > > I'm trying to do a map-side only merge join [1] in pig using Zebra's > TableLoader. (My data allows merge join.) But I'm being unable to use the > TableLoader. Even a simple script that loads a table and just stores it back > doesn't work - > > ---- > A = load 'my_input' using org.apache.hadoop.zebra.pig.TableLoader('', > 'sorted'); > store A into 'my_output'; > ---- > > > 'my_input' is input directory containing a single file with just 1 column - > --- > 1 > 2 > 3 > --- > > The error I get is - > > "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal > error. Failed to find deleted column groupsjava.io.IOException: BT Schema > file doesn't exist: *file:/......./my_input/.btschema*" > > > I have tried specifying the schema using the 'AS' clause and the DESCRIBE > statement as well, but its fetches me the same error. Is the .btschema file > required? Is there any documentation available on its format? (I tried > comma-separated column names with/without type info) > > > I am also willing to work with any other loader that satisfies the merge > join constraints. Thanks in anticipation. > > > Regards, > Ankur > > > [1] *http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins* >
-
Re: Merge joinAnkur Jain 2011-07-20, 19:13
Thanks Ashutosh! Right, I too realized that yesterday. So, is there any
other loader that implements CollectableLoadFunc interface required by the merge join? Thanks, Ankur On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan <[EMAIL PROTECTED]>wrote: > Hey Ankur, > > Zebra's TableLoader works with the data written out using Zebra's > TableStorer. So, you need to write the data first using Zebra and then > subsequently load using TableLoader and do merge-join. > > Ashutosh > On Tue, Jul 19, 2011 at 14:28, Ankur Jain <[EMAIL PROTECTED]> wrote: > > Hi all, > > > > I'm trying to do a map-side only merge join [1] in pig using Zebra's > > TableLoader. (My data allows merge join.) But I'm being unable to use the > > TableLoader. Even a simple script that loads a table and just stores it > back > > doesn't work - > > > > ---- > > A = load 'my_input' using org.apache.hadoop.zebra.pig.TableLoader('', > > 'sorted'); > > store A into 'my_output'; > > ---- > > > > > > 'my_input' is input directory containing a single file with just 1 > column - > > --- > > 1 > > 2 > > 3 > > --- > > > > The error I get is - > > > > "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected > internal > > error. Failed to find deleted column groupsjava.io.IOException: BT Schema > > file doesn't exist: *file:/......./my_input/.btschema*" > > > > > > I have tried specifying the schema using the 'AS' clause and the > DESCRIBE > > statement as well, but its fetches me the same error. Is the .btschema > file > > required? Is there any documentation available on its format? (I tried > > comma-separated column names with/without type info) > > > > > > I am also willing to work with any other loader that satisfies the merge > > join constraints. Thanks in anticipation. > > > > > > Regards, > > Ankur > > > > > > [1] *http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins* > > >
-
Re: Merge joinTomas Svarovsky 2011-07-20, 19:54
Not sure if this would be helpful, but docs says that the default
PigStorage does implement that. I guess that your data needs to be already sorted if you do not want to go through the reduce phase during the join. T On Wed, Jul 20, 2011 at 12:13 PM, Ankur Jain <[EMAIL PROTECTED]> wrote: > Thanks Ashutosh! Right, I too realized that yesterday. So, is there any > other loader that implements > CollectableLoadFunc interface required by the merge join? > > > Thanks, > Ankur > > > On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan <[EMAIL PROTECTED]>wrote: > >> Hey Ankur, >> >> Zebra's TableLoader works with the data written out using Zebra's >> TableStorer. So, you need to write the data first using Zebra and then >> subsequently load using TableLoader and do merge-join. >> >> Ashutosh >> On Tue, Jul 19, 2011 at 14:28, Ankur Jain <[EMAIL PROTECTED]> wrote: >> > Hi all, >> > >> > I'm trying to do a map-side only merge join [1] in pig using Zebra's >> > TableLoader. (My data allows merge join.) But I'm being unable to use the >> > TableLoader. Even a simple script that loads a table and just stores it >> back >> > doesn't work - >> > >> > ---- >> > A = load 'my_input' using org.apache.hadoop.zebra.pig.TableLoader('', >> > 'sorted'); >> > store A into 'my_output'; >> > ---- >> > >> > >> > 'my_input' is input directory containing a single file with just 1 >> column - >> > --- >> > 1 >> > 2 >> > 3 >> > --- >> > >> > The error I get is - >> > >> > "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected >> internal >> > error. Failed to find deleted column groupsjava.io.IOException: BT Schema >> > file doesn't exist: *file:/......./my_input/.btschema*" >> > >> > >> > I have tried specifying the schema using the 'AS' clause and the >> DESCRIBE >> > statement as well, but its fetches me the same error. Is the .btschema >> file >> > required? Is there any documentation available on its format? (I tried >> > comma-separated column names with/without type info) >> > >> > >> > I am also willing to work with any other loader that satisfies the merge >> > join constraints. Thanks in anticipation. >> > >> > >> > Regards, >> > Ankur >> > >> > >> > [1] *http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins* >> > >> >
-
Re: Merge joinAshutosh Chauhan 2011-07-20, 20:15
It depends on whether you want to do inner or outer (also called
co-group) merge join. If you are doing inner merge join on two tables PigStorage satisfies all the criteria and can be used. If you want to do outer merge join (or inner merge join on more then two tables), then you need CollectableLoadFunc which PigStorage doesn't implement and only Zebra's TableLoader does. Hope it helps, Ashutosh On Wed, Jul 20, 2011 at 12:54, Tomas Svarovsky <[EMAIL PROTECTED]> wrote: > Not sure if this would be helpful, but docs says that the default > PigStorage does implement that. I guess that your data needs to be > already sorted if you do not want to go through the reduce phase > during the join. > > T > > On Wed, Jul 20, 2011 at 12:13 PM, Ankur Jain <[EMAIL PROTECTED]> wrote: >> Thanks Ashutosh! Right, I too realized that yesterday. So, is there any >> other loader that implements >> CollectableLoadFunc interface required by the merge join? >> >> >> Thanks, >> Ankur >> >> >> On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan <[EMAIL PROTECTED]>wrote: >> >>> Hey Ankur, >>> >>> Zebra's TableLoader works with the data written out using Zebra's >>> TableStorer. So, you need to write the data first using Zebra and then >>> subsequently load using TableLoader and do merge-join. >>> >>> Ashutosh >>> On Tue, Jul 19, 2011 at 14:28, Ankur Jain <[EMAIL PROTECTED]> wrote: >>> > Hi all, >>> > >>> > I'm trying to do a map-side only merge join [1] in pig using Zebra's >>> > TableLoader. (My data allows merge join.) But I'm being unable to use the >>> > TableLoader. Even a simple script that loads a table and just stores it >>> back >>> > doesn't work - >>> > >>> > ---- >>> > A = load 'my_input' using org.apache.hadoop.zebra.pig.TableLoader('', >>> > 'sorted'); >>> > store A into 'my_output'; >>> > ---- >>> > >>> > >>> > 'my_input' is input directory containing a single file with just 1 >>> column - >>> > --- >>> > 1 >>> > 2 >>> > 3 >>> > --- >>> > >>> > The error I get is - >>> > >>> > "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected >>> internal >>> > error. Failed to find deleted column groupsjava.io.IOException: BT Schema >>> > file doesn't exist: *file:/......./my_input/.btschema*" >>> > >>> > >>> > I have tried specifying the schema using the 'AS' clause and the >>> DESCRIBE >>> > statement as well, but its fetches me the same error. Is the .btschema >>> file >>> > required? Is there any documentation available on its format? (I tried >>> > comma-separated column names with/without type info) >>> > >>> > >>> > I am also willing to work with any other loader that satisfies the merge >>> > join constraints. Thanks in anticipation. >>> > >>> > >>> > Regards, >>> > Ankur >>> > >>> > >>> > [1] *http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins* >>> > >>> >> >
-
Re: Merge joinAnkur Jain 2011-07-20, 21:16
Yeah, I need (full) outer join, which has this constraint on the loader.
Thanks. On Wed, Jul 20, 2011 at 1:15 PM, Ashutosh Chauhan <[EMAIL PROTECTED]>wrote: > It depends on whether you want to do inner or outer (also called > co-group) merge join. If you are doing inner merge join on two tables > PigStorage satisfies all the criteria and can be used. If you want to > do outer merge join (or inner merge join on more then two tables), > then you need CollectableLoadFunc which PigStorage doesn't implement > and only Zebra's TableLoader does. > > Hope it helps, > Ashutosh > On Wed, Jul 20, 2011 at 12:54, Tomas Svarovsky > <[EMAIL PROTECTED]> wrote: > > Not sure if this would be helpful, but docs says that the default > > PigStorage does implement that. I guess that your data needs to be > > already sorted if you do not want to go through the reduce phase > > during the join. > > > > T > > > > On Wed, Jul 20, 2011 at 12:13 PM, Ankur Jain <[EMAIL PROTECTED]> > wrote: > >> Thanks Ashutosh! Right, I too realized that yesterday. So, is there any > >> other loader that implements > >> CollectableLoadFunc interface required by the merge join? > >> > >> > >> Thanks, > >> Ankur > >> > >> > >> On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan < > [EMAIL PROTECTED]>wrote: > >> > >>> Hey Ankur, > >>> > >>> Zebra's TableLoader works with the data written out using Zebra's > >>> TableStorer. So, you need to write the data first using Zebra and then > >>> subsequently load using TableLoader and do merge-join. > >>> > >>> Ashutosh > >>> On Tue, Jul 19, 2011 at 14:28, Ankur Jain <[EMAIL PROTECTED]> > wrote: > >>> > Hi all, > >>> > > >>> > I'm trying to do a map-side only merge join [1] in pig using Zebra's > >>> > TableLoader. (My data allows merge join.) But I'm being unable to use > the > >>> > TableLoader. Even a simple script that loads a table and just stores > it > >>> back > >>> > doesn't work - > >>> > > >>> > ---- > >>> > A = load 'my_input' using > org.apache.hadoop.zebra.pig.TableLoader('', > >>> > 'sorted'); > >>> > store A into 'my_output'; > >>> > ---- > >>> > > >>> > > >>> > 'my_input' is input directory containing a single file with just 1 > >>> column - > >>> > --- > >>> > 1 > >>> > 2 > >>> > 3 > >>> > --- > >>> > > >>> > The error I get is - > >>> > > >>> > "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected > >>> internal > >>> > error. Failed to find deleted column groupsjava.io.IOException: BT > Schema > >>> > file doesn't exist: *file:/......./my_input/.btschema*" > >>> > > >>> > > >>> > I have tried specifying the schema using the 'AS' clause and the > >>> DESCRIBE > >>> > statement as well, but its fetches me the same error. Is the > .btschema > >>> file > >>> > required? Is there any documentation available on its format? (I > tried > >>> > comma-separated column names with/without type info) > >>> > > >>> > > >>> > I am also willing to work with any other loader that satisfies the > merge > >>> > join constraints. Thanks in anticipation. > >>> > > >>> > > >>> > Regards, > >>> > Ankur > >>> > > >>> > > >>> > [1] * > http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins* > >>> > > >>> > >> > > >
-
Re: Merge joinAshutosh Chauhan 2011-07-20, 21:21
If you control the generation of data which needs to be joined, then
you can store it with Zebra and then do the joins. If not, then you either need to rewrite the data using Zebra or need to implement another loader which implements CollectableLoadFunc. Ashutosh On Wed, Jul 20, 2011 at 14:16, Ankur Jain <[EMAIL PROTECTED]> wrote: > Yeah, I need (full) outer join, which has this constraint on the loader. > > Thanks. > > > On Wed, Jul 20, 2011 at 1:15 PM, Ashutosh Chauhan <[EMAIL PROTECTED]>wrote: > >> It depends on whether you want to do inner or outer (also called >> co-group) merge join. If you are doing inner merge join on two tables >> PigStorage satisfies all the criteria and can be used. If you want to >> do outer merge join (or inner merge join on more then two tables), >> then you need CollectableLoadFunc which PigStorage doesn't implement >> and only Zebra's TableLoader does. >> >> Hope it helps, >> Ashutosh >> On Wed, Jul 20, 2011 at 12:54, Tomas Svarovsky >> <[EMAIL PROTECTED]> wrote: >> > Not sure if this would be helpful, but docs says that the default >> > PigStorage does implement that. I guess that your data needs to be >> > already sorted if you do not want to go through the reduce phase >> > during the join. >> > >> > T >> > >> > On Wed, Jul 20, 2011 at 12:13 PM, Ankur Jain <[EMAIL PROTECTED]> >> wrote: >> >> Thanks Ashutosh! Right, I too realized that yesterday. So, is there any >> >> other loader that implements >> >> CollectableLoadFunc interface required by the merge join? >> >> >> >> >> >> Thanks, >> >> Ankur >> >> >> >> >> >> On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan < >> [EMAIL PROTECTED]>wrote: >> >> >> >>> Hey Ankur, >> >>> >> >>> Zebra's TableLoader works with the data written out using Zebra's >> >>> TableStorer. So, you need to write the data first using Zebra and then >> >>> subsequently load using TableLoader and do merge-join. >> >>> >> >>> Ashutosh >> >>> On Tue, Jul 19, 2011 at 14:28, Ankur Jain <[EMAIL PROTECTED]> >> wrote: >> >>> > Hi all, >> >>> > >> >>> > I'm trying to do a map-side only merge join [1] in pig using Zebra's >> >>> > TableLoader. (My data allows merge join.) But I'm being unable to use >> the >> >>> > TableLoader. Even a simple script that loads a table and just stores >> it >> >>> back >> >>> > doesn't work - >> >>> > >> >>> > ---- >> >>> > A = load 'my_input' using >> org.apache.hadoop.zebra.pig.TableLoader('', >> >>> > 'sorted'); >> >>> > store A into 'my_output'; >> >>> > ---- >> >>> > >> >>> > >> >>> > 'my_input' is input directory containing a single file with just 1 >> >>> column - >> >>> > --- >> >>> > 1 >> >>> > 2 >> >>> > 3 >> >>> > --- >> >>> > >> >>> > The error I get is - >> >>> > >> >>> > "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected >> >>> internal >> >>> > error. Failed to find deleted column groupsjava.io.IOException: BT >> Schema >> >>> > file doesn't exist: *file:/......./my_input/.btschema*" >> >>> > >> >>> > >> >>> > I have tried specifying the schema using the 'AS' clause and the >> >>> DESCRIBE >> >>> > statement as well, but its fetches me the same error. Is the >> .btschema >> >>> file >> >>> > required? Is there any documentation available on its format? (I >> tried >> >>> > comma-separated column names with/without type info) >> >>> > >> >>> > >> >>> > I am also willing to work with any other loader that satisfies the >> merge >> >>> > join constraints. Thanks in anticipation. >> >>> > >> >>> > >> >>> > Regards, >> >>> > Ankur >> >>> > >> >>> > >> >>> > [1] * >> http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins* >> >>> > >> >>> >> >> >> > >> >
-
Re: Merge joinAnkur Jain 2011-07-20, 22:48
Thanks Ashutosh. Let me re-consider various options available to me.
-Ankur On Wed, Jul 20, 2011 at 2:21 PM, Ashutosh Chauhan <[EMAIL PROTECTED]>wrote: > If you control the generation of data which needs to be joined, then > you can store it with Zebra and then do the joins. If not, then you > either need to rewrite the data using Zebra or need to implement > another loader which implements CollectableLoadFunc. > > Ashutosh > On Wed, Jul 20, 2011 at 14:16, Ankur Jain <[EMAIL PROTECTED]> wrote: > > Yeah, I need (full) outer join, which has this constraint on the loader. > > > > Thanks. > > > > > > On Wed, Jul 20, 2011 at 1:15 PM, Ashutosh Chauhan <[EMAIL PROTECTED] > >wrote: > > > >> It depends on whether you want to do inner or outer (also called > >> co-group) merge join. If you are doing inner merge join on two tables > >> PigStorage satisfies all the criteria and can be used. If you want to > >> do outer merge join (or inner merge join on more then two tables), > >> then you need CollectableLoadFunc which PigStorage doesn't implement > >> and only Zebra's TableLoader does. > >> > >> Hope it helps, > >> Ashutosh > >> On Wed, Jul 20, 2011 at 12:54, Tomas Svarovsky > >> <[EMAIL PROTECTED]> wrote: > >> > Not sure if this would be helpful, but docs says that the default > >> > PigStorage does implement that. I guess that your data needs to be > >> > already sorted if you do not want to go through the reduce phase > >> > during the join. > >> > > >> > T > >> > > >> > On Wed, Jul 20, 2011 at 12:13 PM, Ankur Jain <[EMAIL PROTECTED]> > >> wrote: > >> >> Thanks Ashutosh! Right, I too realized that yesterday. So, is there > any > >> >> other loader that implements > >> >> CollectableLoadFunc interface required by the merge join? > >> >> > >> >> > >> >> Thanks, > >> >> Ankur > >> >> > >> >> > >> >> On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan < > >> [EMAIL PROTECTED]>wrote: > >> >> > >> >>> Hey Ankur, > >> >>> > >> >>> Zebra's TableLoader works with the data written out using Zebra's > >> >>> TableStorer. So, you need to write the data first using Zebra and > then > >> >>> subsequently load using TableLoader and do merge-join. > >> >>> > >> >>> Ashutosh > >> >>> On Tue, Jul 19, 2011 at 14:28, Ankur Jain <[EMAIL PROTECTED]> > >> wrote: > >> >>> > Hi all, > >> >>> > > >> >>> > I'm trying to do a map-side only merge join [1] in pig using > Zebra's > >> >>> > TableLoader. (My data allows merge join.) But I'm being unable to > use > >> the > >> >>> > TableLoader. Even a simple script that loads a table and just > stores > >> it > >> >>> back > >> >>> > doesn't work - > >> >>> > > >> >>> > ---- > >> >>> > A = load 'my_input' using > >> org.apache.hadoop.zebra.pig.TableLoader('', > >> >>> > 'sorted'); > >> >>> > store A into 'my_output'; > >> >>> > ---- > >> >>> > > >> >>> > > >> >>> > 'my_input' is input directory containing a single file with just > 1 > >> >>> column - > >> >>> > --- > >> >>> > 1 > >> >>> > 2 > >> >>> > 3 > >> >>> > --- > >> >>> > > >> >>> > The error I get is - > >> >>> > > >> >>> > "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected > >> >>> internal > >> >>> > error. Failed to find deleted column groupsjava.io.IOException: BT > >> Schema > >> >>> > file doesn't exist: *file:/......./my_input/.btschema*" > >> >>> > > >> >>> > > >> >>> > I have tried specifying the schema using the 'AS' clause and the > >> >>> DESCRIBE > >> >>> > statement as well, but its fetches me the same error. Is the > >> .btschema > >> >>> file > >> >>> > required? Is there any documentation available on its format? (I > >> tried > >> >>> > comma-separated column names with/without type info) > >> >>> > > >> >>> > > >> >>> > I am also willing to work with any other loader that satisfies the > >> merge > >> >>> > join constraints. Thanks in anticipation. > >> >>> > > >> >>> > > >> >>> > Regards, > >> >>> > Ankur > >> >>> > > >> >>> > > >> >>> > [1] * > >> http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins* |