|
Stan Rosenberg
2011-12-07, 17:24
Daniel Dai
2012-01-01, 02:42
Stan Rosenberg
2012-01-01, 03:34
Stan Rosenberg
2012-01-01, 03:37
Daniel Dai
2012-01-01, 08:36
Dmitriy Ryaboy
2012-01-02, 01:34
Daniel Dai
2012-01-02, 02:09
Dmitriy Ryaboy
2012-01-02, 02:16
|
-
Partition keys in LoadMetadata is broken in 0.10?Stan Rosenberg 2011-12-07, 17:24
Hi,
I am trying to implement a loader which is partition-aware. As prescribed, my loader implements LoadMetadata, however, getPartitionKeys is never invoked. The script is of this form: X = LOAD 'input' USING MyLoader(); X = FILTER X BY partition_col == 'some_string'; and the schema returned by MyLoader.getSchema includes the column 'partition_col' which is of type 'chararray'. After debugging pig, I have found what appears to be a bug in the new code (version 0.10 snapshot and also in 0.9.1). The reason MyLoader.getPartitionKeys is never invoked is due to the wrongfully inserted 'foreach' after the 'load' and before the 'filter'. The code in TypeCastInserterTransformer.check used to return 'false' if the schemas matched or all fields were of type 'bytearray'; cf. pig version 0.8.1. Effectively, the above script gets transformed into: X = LOAD 'input' USING MyLoader(); X = FOREACH X GENERATE ...; X = FILTER X BY partition_col == 'some_string'; Subsequently, PartitionFilterPushDownTransformer.check observes that the immediate successor of 'load' is _not_ 'filter', whence getPartitionKeys is never invoked. Any suggestions? Thanks, stan P.S. While in the above case the 'foreach' can be avoided, in general typecasting may need to be performed if the user-provided schema does not match the one returned by the loader. I think the general case needs to be handled correctly, perhaps by ignoring all synthetic operators after the 'load'. (This is just a wild guess.)
-
Re: Partition keys in LoadMetadata is broken in 0.10?Daniel Dai 2012-01-01, 02:42
Hi, Stan,
Foreach is inserted only if you have "as" in "load" statement. This is to assure the data loaded conforms with "as" clause. At some point there is a bug in implementation, this should be fixed in PIG-2346 and will be included in all subsequent releases. Thanks, Daniel On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg < [EMAIL PROTECTED]> wrote: > Howdy All, > > I am resurrecting my previous message sent to the list on Dec. 7. Let > me first summarize. In a nutshell, as far as I can tell, > partition-aware loading is broken > in pig, and the culprit is PIG-1188 wherein the final decision was to > introduce project & cast, i.e, foreach, after load. There are two > problems with that approach. > First, as indicated in my original message, 'getPartitionKeys' is > never invoked because instead of the expected instruction sequence > 'load; filter', PIG-1188 > changed it to 'load; foreach; filter'. Second, if a loader already > happens to project & cast in order to adhere the data to the schema, > then the foreach synthesized > by pig is a waste of time. > > Essentially, we had to undo the patch in 'PIG-1188' in order to get > partition filters to work; this enabled us to implement a HiveLoader > very much like > HCatLoader which incidentally is also broken for the very same reason. > This is obviously a hack and a real solution is needed. > If the decision made in PIG-1188 cannot be re-considered, then I > suggest that we revisit the logic which is used to pass partition > filters to partition-aware loaders. > > Many thanks! > > stan > > > > ---------- Forwarded message ---------- > From: Stan Rosenberg <[EMAIL PROTECTED]> > Date: Wed, Dec 7, 2011 at 12:24 PM > Subject: Partition keys in LoadMetadata is broken in 0.10? > To: [EMAIL PROTECTED] > > > Hi, > > I am trying to implement a loader which is partition-aware. As > prescribed, my loader implements LoadMetadata, however, > getPartitionKeys is never invoked. > The script is of this form: > > X = LOAD 'input' USING MyLoader(); > X = FILTER X BY partition_col == 'some_string'; > > and the schema returned by MyLoader.getSchema includes the column > 'partition_col' which is of type 'chararray'. > > > After debugging pig, I have found what appears to be a bug in the new > code (version 0.10 snapshot and also in 0.9.1). The reason > MyLoader.getPartitionKeys is never invoked is due to the wrongfully > inserted > 'foreach' after the 'load' and before the 'filter'. The code in > TypeCastInserterTransformer.check used to return 'false' if the > schemas matched or all fields were of type 'bytearray'; cf. pig > version 0.8.1. > Effectively, the above script gets transformed into: > > X = LOAD 'input' USING MyLoader(); > X = FOREACH X GENERATE ...; > X = FILTER X BY partition_col == 'some_string'; > > Subsequently, PartitionFilterPushDownTransformer.check observes that > the immediate successor of 'load' is _not_ 'filter', whence > getPartitionKeys is never invoked. > > Any suggestions? > > Thanks, > > stan > > P.S. While in the above case the 'foreach' can be avoided, in general > typecasting may need to be performed if the user-provided schema does > not match the one returned by the loader. > I think the general case needs to be handled correctly, perhaps by > ignoring all synthetic operators after the 'load'. (This is just a > wild guess.) >
-
Re: Partition keys in LoadMetadata is broken in 0.10?Stan Rosenberg 2012-01-01, 03:34
Hi Daniel,
Thanks for pointing out PIG-2346. However, what happens if the user decides to rename some of the fields using the 'as' statement; we have the same problem, i.e., 'foreach' is generated. As a heuristic, perhaps synthesized operators should be marked as such. This way, pig can skip synthesized operators when trying to match the sequence 'load; filter'. Another alternative is to create a new keyword, say 'where', to be used for specifying partitions. E.g., A = load 'daily_activity' from HiveLoader where date_partition >20110101 and date_partition <= 20110201; stan On Sat, Dec 31, 2011 at 9:42 PM, Daniel Dai <[EMAIL PROTECTED]> wrote: > Hi, Stan, > Foreach is inserted only if you have "as" in "load" statement. This is to > assure the data loaded conforms with "as" clause. At some point there is a > bug in implementation, this should be fixed in PIG-2346 and will be > included in all subsequent releases. > > Thanks, > Daniel > > On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg < > [EMAIL PROTECTED]> wrote: > >> Howdy All, >> >> I am resurrecting my previous message sent to the list on Dec. 7. Let >> me first summarize. In a nutshell, as far as I can tell, >> partition-aware loading is broken >> in pig, and the culprit is PIG-1188 wherein the final decision was to >> introduce project & cast, i.e, foreach, after load. There are two >> problems with that approach. >> First, as indicated in my original message, 'getPartitionKeys' is >> never invoked because instead of the expected instruction sequence >> 'load; filter', PIG-1188 >> changed it to 'load; foreach; filter'. Second, if a loader already >> happens to project & cast in order to adhere the data to the schema, >> then the foreach synthesized >> by pig is a waste of time. >> >> Essentially, we had to undo the patch in 'PIG-1188' in order to get >> partition filters to work; this enabled us to implement a HiveLoader >> very much like >> HCatLoader which incidentally is also broken for the very same reason. >> This is obviously a hack and a real solution is needed. >> If the decision made in PIG-1188 cannot be re-considered, then I >> suggest that we revisit the logic which is used to pass partition >> filters to partition-aware loaders. >> >> Many thanks! >> >> stan >> >> >> >> ---------- Forwarded message ---------- >> From: Stan Rosenberg <[EMAIL PROTECTED]> >> Date: Wed, Dec 7, 2011 at 12:24 PM >> Subject: Partition keys in LoadMetadata is broken in 0.10? >> To: [EMAIL PROTECTED] >> >> >> Hi, >> >> I am trying to implement a loader which is partition-aware. As >> prescribed, my loader implements LoadMetadata, however, >> getPartitionKeys is never invoked. >> The script is of this form: >> >> X = LOAD 'input' USING MyLoader(); >> X = FILTER X BY partition_col == 'some_string'; >> >> and the schema returned by MyLoader.getSchema includes the column >> 'partition_col' which is of type 'chararray'. >> >> >> After debugging pig, I have found what appears to be a bug in the new >> code (version 0.10 snapshot and also in 0.9.1). The reason >> MyLoader.getPartitionKeys is never invoked is due to the wrongfully >> inserted >> 'foreach' after the 'load' and before the 'filter'. The code in >> TypeCastInserterTransformer.check used to return 'false' if the >> schemas matched or all fields were of type 'bytearray'; cf. pig >> version 0.8.1. >> Effectively, the above script gets transformed into: >> >> X = LOAD 'input' USING MyLoader(); >> X = FOREACH X GENERATE ...; >> X = FILTER X BY partition_col == 'some_string'; >> >> Subsequently, PartitionFilterPushDownTransformer.check observes that >> the immediate successor of 'load' is _not_ 'filter', whence >> getPartitionKeys is never invoked. >> >> Any suggestions? >> >> Thanks, >> >> stan >> >> P.S. While in the above case the 'foreach' can be avoided, in general >> typecasting may need to be performed if the user-provided schema does >> not match the one returned by the loader.
-
Re: Partition keys in LoadMetadata is broken in 0.10?Stan Rosenberg 2012-01-01, 03:37
Just to be clear, the concrete syntax had a typo; should have been:
A = load 'daily_activity' USING HiveLoader WHERE date_partition >20110101 and date_partition <= 20110201; On Sat, Dec 31, 2011 at 10:34 PM, Stan Rosenberg <[EMAIL PROTECTED]> wrote: > > A = load 'daily_activity' from HiveLoader where date_partition >> 20110101 and date_partition <= 20110201; > > stan > > On Sat, Dec 31, 2011 at 9:42 PM, Daniel Dai <[EMAIL PROTECTED]> wrote: >> Hi, Stan, >> Foreach is inserted only if you have "as" in "load" statement. This is to >> assure the data loaded conforms with "as" clause. At some point there is a >> bug in implementation, this should be fixed in PIG-2346 and will be >> included in all subsequent releases. >> >> Thanks, >> Daniel >> >> On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg < >> [EMAIL PROTECTED]> wrote: >> >>> Howdy All, >>> >>> I am resurrecting my previous message sent to the list on Dec. 7. Let >>> me first summarize. In a nutshell, as far as I can tell, >>> partition-aware loading is broken >>> in pig, and the culprit is PIG-1188 wherein the final decision was to >>> introduce project & cast, i.e, foreach, after load. There are two >>> problems with that approach. >>> First, as indicated in my original message, 'getPartitionKeys' is >>> never invoked because instead of the expected instruction sequence >>> 'load; filter', PIG-1188 >>> changed it to 'load; foreach; filter'. Second, if a loader already >>> happens to project & cast in order to adhere the data to the schema, >>> then the foreach synthesized >>> by pig is a waste of time. >>> >>> Essentially, we had to undo the patch in 'PIG-1188' in order to get >>> partition filters to work; this enabled us to implement a HiveLoader >>> very much like >>> HCatLoader which incidentally is also broken for the very same reason. >>> This is obviously a hack and a real solution is needed. >>> If the decision made in PIG-1188 cannot be re-considered, then I >>> suggest that we revisit the logic which is used to pass partition >>> filters to partition-aware loaders. >>> >>> Many thanks! >>> >>> stan >>> >>> >>> >>> ---------- Forwarded message ---------- >>> From: Stan Rosenberg <[EMAIL PROTECTED]> >>> Date: Wed, Dec 7, 2011 at 12:24 PM >>> Subject: Partition keys in LoadMetadata is broken in 0.10? >>> To: [EMAIL PROTECTED] >>> >>> >>> Hi, >>> >>> I am trying to implement a loader which is partition-aware. As >>> prescribed, my loader implements LoadMetadata, however, >>> getPartitionKeys is never invoked. >>> The script is of this form: >>> >>> X = LOAD 'input' USING MyLoader(); >>> X = FILTER X BY partition_col == 'some_string'; >>> >>> and the schema returned by MyLoader.getSchema includes the column >>> 'partition_col' which is of type 'chararray'. >>> >>> >>> After debugging pig, I have found what appears to be a bug in the new >>> code (version 0.10 snapshot and also in 0.9.1). The reason >>> MyLoader.getPartitionKeys is never invoked is due to the wrongfully >>> inserted >>> 'foreach' after the 'load' and before the 'filter'. The code in >>> TypeCastInserterTransformer.check used to return 'false' if the >>> schemas matched or all fields were of type 'bytearray'; cf. pig >>> version 0.8.1. >>> Effectively, the above script gets transformed into: >>> >>> X = LOAD 'input' USING MyLoader(); >>> X = FOREACH X GENERATE ...; >>> X = FILTER X BY partition_col == 'some_string'; >>> >>> Subsequently, PartitionFilterPushDownTransformer.check observes that >>> the immediate successor of 'load' is _not_ 'filter', whence >>> getPartitionKeys is never invoked. >>> >>> Any suggestions? >>> >>> Thanks, >>> >>> stan >>> >>> P.S. While in the above case the 'foreach' can be avoided, in general >>> typecasting may need to be performed if the user-provided schema does >>> not match the one returned by the loader. >>> I think the general case needs to be handled correctly, perhaps by >>> ignoring all synthetic operators after the 'load'. (This is just a
-
Re: Partition keys in LoadMetadata is broken in 0.10?Daniel Dai 2012-01-01, 08:36
Hi, Stan,
I miss one point in my previous mail. We do apply PushUpFilter rule first, so filter will be pushed in front of the added ForEach in most cases. There is also a bug before (See PIG-2339) but current code should be fixed. So even you use as clause to change the name, partition filter should still apply. Daniel On Sat, Dec 31, 2011 at 7:37 PM, Stan Rosenberg < [EMAIL PROTECTED]> wrote: > Just to be clear, the concrete syntax had a typo; should have been: > > A = load 'daily_activity' USING HiveLoader WHERE date_partition >> 20110101 and date_partition <= 20110201; > > On Sat, Dec 31, 2011 at 10:34 PM, Stan Rosenberg > <[EMAIL PROTECTED]> wrote: > > > > A = load 'daily_activity' from HiveLoader where date_partition >> > 20110101 and date_partition <= 20110201; > > > > stan > > > > On Sat, Dec 31, 2011 at 9:42 PM, Daniel Dai <[EMAIL PROTECTED]> > wrote: > >> Hi, Stan, > >> Foreach is inserted only if you have "as" in "load" statement. This is > to > >> assure the data loaded conforms with "as" clause. At some point there > is a > >> bug in implementation, this should be fixed in PIG-2346 and will be > >> included in all subsequent releases. > >> > >> Thanks, > >> Daniel > >> > >> On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg < > >> [EMAIL PROTECTED]> wrote: > >> > >>> Howdy All, > >>> > >>> I am resurrecting my previous message sent to the list on Dec. 7. Let > >>> me first summarize. In a nutshell, as far as I can tell, > >>> partition-aware loading is broken > >>> in pig, and the culprit is PIG-1188 wherein the final decision was to > >>> introduce project & cast, i.e, foreach, after load. There are two > >>> problems with that approach. > >>> First, as indicated in my original message, 'getPartitionKeys' is > >>> never invoked because instead of the expected instruction sequence > >>> 'load; filter', PIG-1188 > >>> changed it to 'load; foreach; filter'. Second, if a loader already > >>> happens to project & cast in order to adhere the data to the schema, > >>> then the foreach synthesized > >>> by pig is a waste of time. > >>> > >>> Essentially, we had to undo the patch in 'PIG-1188' in order to get > >>> partition filters to work; this enabled us to implement a HiveLoader > >>> very much like > >>> HCatLoader which incidentally is also broken for the very same reason. > >>> This is obviously a hack and a real solution is needed. > >>> If the decision made in PIG-1188 cannot be re-considered, then I > >>> suggest that we revisit the logic which is used to pass partition > >>> filters to partition-aware loaders. > >>> > >>> Many thanks! > >>> > >>> stan > >>> > >>> > >>> > >>> ---------- Forwarded message ---------- > >>> From: Stan Rosenberg <[EMAIL PROTECTED]> > >>> Date: Wed, Dec 7, 2011 at 12:24 PM > >>> Subject: Partition keys in LoadMetadata is broken in 0.10? > >>> To: [EMAIL PROTECTED] > >>> > >>> > >>> Hi, > >>> > >>> I am trying to implement a loader which is partition-aware. As > >>> prescribed, my loader implements LoadMetadata, however, > >>> getPartitionKeys is never invoked. > >>> The script is of this form: > >>> > >>> X = LOAD 'input' USING MyLoader(); > >>> X = FILTER X BY partition_col == 'some_string'; > >>> > >>> and the schema returned by MyLoader.getSchema includes the column > >>> 'partition_col' which is of type 'chararray'. > >>> > >>> > >>> After debugging pig, I have found what appears to be a bug in the new > >>> code (version 0.10 snapshot and also in 0.9.1). The reason > >>> MyLoader.getPartitionKeys is never invoked is due to the wrongfully > >>> inserted > >>> 'foreach' after the 'load' and before the 'filter'. The code in > >>> TypeCastInserterTransformer.check used to return 'false' if the > >>> schemas matched or all fields were of type 'bytearray'; cf. pig > >>> version 0.8.1. > >>> Effectively, the above script gets transformed into: > >>> > >>> X = LOAD 'input' USING MyLoader(); > >>> X = FOREACH X GENERATE ...;
-
Re: Partition keys in LoadMetadata is broken in 0.10?Dmitriy Ryaboy 2012-01-02, 01:34
That getAll() call destroyed our lazy deserialization optimizations, btw...
it's unfortunate that even if my loader constructs optimized tuples, they immediately get turned into object-bloated regular tuples :(. D On Sun, Jan 1, 2012 at 12:36 AM, Daniel Dai <[EMAIL PROTECTED]> wrote: > Hi, Stan, > I miss one point in my previous mail. We do apply PushUpFilter rule first, > so filter will be pushed in front of the added ForEach in most cases. There > is also a bug before (See PIG-2339) but current code should be fixed. So > even you use as clause to change the name, partition filter should still > apply. > > Daniel > > On Sat, Dec 31, 2011 at 7:37 PM, Stan Rosenberg < > [EMAIL PROTECTED]> wrote: > > > Just to be clear, the concrete syntax had a typo; should have been: > > > > A = load 'daily_activity' USING HiveLoader WHERE date_partition >> > 20110101 and date_partition <= 20110201; > > > > On Sat, Dec 31, 2011 at 10:34 PM, Stan Rosenberg > > <[EMAIL PROTECTED]> wrote: > > > > > > A = load 'daily_activity' from HiveLoader where date_partition >> > > 20110101 and date_partition <= 20110201; > > > > > > stan > > > > > > On Sat, Dec 31, 2011 at 9:42 PM, Daniel Dai <[EMAIL PROTECTED]> > > wrote: > > >> Hi, Stan, > > >> Foreach is inserted only if you have "as" in "load" statement. This is > > to > > >> assure the data loaded conforms with "as" clause. At some point there > > is a > > >> bug in implementation, this should be fixed in PIG-2346 and will be > > >> included in all subsequent releases. > > >> > > >> Thanks, > > >> Daniel > > >> > > >> On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg < > > >> [EMAIL PROTECTED]> wrote: > > >> > > >>> Howdy All, > > >>> > > >>> I am resurrecting my previous message sent to the list on Dec. 7. > Let > > >>> me first summarize. In a nutshell, as far as I can tell, > > >>> partition-aware loading is broken > > >>> in pig, and the culprit is PIG-1188 wherein the final decision was to > > >>> introduce project & cast, i.e, foreach, after load. There are two > > >>> problems with that approach. > > >>> First, as indicated in my original message, 'getPartitionKeys' is > > >>> never invoked because instead of the expected instruction sequence > > >>> 'load; filter', PIG-1188 > > >>> changed it to 'load; foreach; filter'. Second, if a loader already > > >>> happens to project & cast in order to adhere the data to the schema, > > >>> then the foreach synthesized > > >>> by pig is a waste of time. > > >>> > > >>> Essentially, we had to undo the patch in 'PIG-1188' in order to get > > >>> partition filters to work; this enabled us to implement a HiveLoader > > >>> very much like > > >>> HCatLoader which incidentally is also broken for the very same > reason. > > >>> This is obviously a hack and a real solution is needed. > > >>> If the decision made in PIG-1188 cannot be re-considered, then I > > >>> suggest that we revisit the logic which is used to pass partition > > >>> filters to partition-aware loaders. > > >>> > > >>> Many thanks! > > >>> > > >>> stan > > >>> > > >>> > > >>> > > >>> ---------- Forwarded message ---------- > > >>> From: Stan Rosenberg <[EMAIL PROTECTED]> > > >>> Date: Wed, Dec 7, 2011 at 12:24 PM > > >>> Subject: Partition keys in LoadMetadata is broken in 0.10? > > >>> To: [EMAIL PROTECTED] > > >>> > > >>> > > >>> Hi, > > >>> > > >>> I am trying to implement a loader which is partition-aware. As > > >>> prescribed, my loader implements LoadMetadata, however, > > >>> getPartitionKeys is never invoked. > > >>> The script is of this form: > > >>> > > >>> X = LOAD 'input' USING MyLoader(); > > >>> X = FILTER X BY partition_col == 'some_string'; > > >>> > > >>> and the schema returned by MyLoader.getSchema includes the column > > >>> 'partition_col' which is of type 'chararray'. > > >>> > > >>> > > >>> After debugging pig, I have found what appears to be a bug in the new > > >>> code (version 0.10 snapshot and also in 0.9.1). The reason
-
Re: Partition keys in LoadMetadata is broken in 0.10?Daniel Dai 2012-01-02, 02:09
Which getAll() call do you mean?
On Sun, Jan 1, 2012 at 5:34 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > That getAll() call destroyed our lazy deserialization optimizations, btw... > it's unfortunate that even if my loader constructs optimized tuples, they > immediately get turned into object-bloated regular tuples :(. > > D > > On Sun, Jan 1, 2012 at 12:36 AM, Daniel Dai <[EMAIL PROTECTED]> wrote: > > > Hi, Stan, > > I miss one point in my previous mail. We do apply PushUpFilter rule > first, > > so filter will be pushed in front of the added ForEach in most cases. > There > > is also a bug before (See PIG-2339) but current code should be fixed. So > > even you use as clause to change the name, partition filter should still > > apply. > > > > Daniel > > > > On Sat, Dec 31, 2011 at 7:37 PM, Stan Rosenberg < > > [EMAIL PROTECTED]> wrote: > > > > > Just to be clear, the concrete syntax had a typo; should have been: > > > > > > A = load 'daily_activity' USING HiveLoader WHERE date_partition >> > > 20110101 and date_partition <= 20110201; > > > > > > On Sat, Dec 31, 2011 at 10:34 PM, Stan Rosenberg > > > <[EMAIL PROTECTED]> wrote: > > > > > > > > A = load 'daily_activity' from HiveLoader where date_partition >> > > > 20110101 and date_partition <= 20110201; > > > > > > > > stan > > > > > > > > On Sat, Dec 31, 2011 at 9:42 PM, Daniel Dai <[EMAIL PROTECTED]> > > > wrote: > > > >> Hi, Stan, > > > >> Foreach is inserted only if you have "as" in "load" statement. This > is > > > to > > > >> assure the data loaded conforms with "as" clause. At some point > there > > > is a > > > >> bug in implementation, this should be fixed in PIG-2346 and will be > > > >> included in all subsequent releases. > > > >> > > > >> Thanks, > > > >> Daniel > > > >> > > > >> On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg < > > > >> [EMAIL PROTECTED]> wrote: > > > >> > > > >>> Howdy All, > > > >>> > > > >>> I am resurrecting my previous message sent to the list on Dec. 7. > > Let > > > >>> me first summarize. In a nutshell, as far as I can tell, > > > >>> partition-aware loading is broken > > > >>> in pig, and the culprit is PIG-1188 wherein the final decision was > to > > > >>> introduce project & cast, i.e, foreach, after load. There are two > > > >>> problems with that approach. > > > >>> First, as indicated in my original message, 'getPartitionKeys' is > > > >>> never invoked because instead of the expected instruction sequence > > > >>> 'load; filter', PIG-1188 > > > >>> changed it to 'load; foreach; filter'. Second, if a loader already > > > >>> happens to project & cast in order to adhere the data to the > schema, > > > >>> then the foreach synthesized > > > >>> by pig is a waste of time. > > > >>> > > > >>> Essentially, we had to undo the patch in 'PIG-1188' in order to get > > > >>> partition filters to work; this enabled us to implement a > HiveLoader > > > >>> very much like > > > >>> HCatLoader which incidentally is also broken for the very same > > reason. > > > >>> This is obviously a hack and a real solution is needed. > > > >>> If the decision made in PIG-1188 cannot be re-considered, then I > > > >>> suggest that we revisit the logic which is used to pass partition > > > >>> filters to partition-aware loaders. > > > >>> > > > >>> Many thanks! > > > >>> > > > >>> stan > > > >>> > > > >>> > > > >>> > > > >>> ---------- Forwarded message ---------- > > > >>> From: Stan Rosenberg <[EMAIL PROTECTED]> > > > >>> Date: Wed, Dec 7, 2011 at 12:24 PM > > > >>> Subject: Partition keys in LoadMetadata is broken in 0.10? > > > >>> To: [EMAIL PROTECTED] > > > >>> > > > >>> > > > >>> Hi, > > > >>> > > > >>> I am trying to implement a loader which is partition-aware. As > > > >>> prescribed, my loader implements LoadMetadata, however, > > > >>> getPartitionKeys is never invoked. > > > >>> The script is of this form: > > > >>> > > > >>> X = LOAD 'input' USING MyLoader(); > > > >
-
Re: Partition keys in LoadMetadata is broken in 0.10?Dmitriy Ryaboy 2012-01-02, 02:16
Lol, the one we weren't talking about :). Sorry, thought it was related.
This, in PigGenericMapBase: for (PhysicalOperator root : roots) { if (inIllustrator) { if (root != null) { root.attachInput(inpTuple); } } else { * root.attachInput(tf.newTupleNoCopy(inpTuple.getAll()));* } } On Sun, Jan 1, 2012 at 6:09 PM, Daniel Dai <[EMAIL PROTECTED]> wrote: > Which getAll() call do you mean? > > On Sun, Jan 1, 2012 at 5:34 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > > > That getAll() call destroyed our lazy deserialization optimizations, > btw... > > it's unfortunate that even if my loader constructs optimized tuples, they > > immediately get turned into object-bloated regular tuples :(. > > > > D > > > > On Sun, Jan 1, 2012 at 12:36 AM, Daniel Dai <[EMAIL PROTECTED]> > wrote: > > > > > Hi, Stan, > > > I miss one point in my previous mail. We do apply PushUpFilter rule > > first, > > > so filter will be pushed in front of the added ForEach in most cases. > > There > > > is also a bug before (See PIG-2339) but current code should be fixed. > So > > > even you use as clause to change the name, partition filter should > still > > > apply. > > > > > > Daniel > > > > > > On Sat, Dec 31, 2011 at 7:37 PM, Stan Rosenberg < > > > [EMAIL PROTECTED]> wrote: > > > > > > > Just to be clear, the concrete syntax had a typo; should have been: > > > > > > > > A = load 'daily_activity' USING HiveLoader WHERE date_partition >> > > > 20110101 and date_partition <= 20110201; > > > > > > > > On Sat, Dec 31, 2011 at 10:34 PM, Stan Rosenberg > > > > <[EMAIL PROTECTED]> wrote: > > > > > > > > > > A = load 'daily_activity' from HiveLoader where date_partition >> > > > > 20110101 and date_partition <= 20110201; > > > > > > > > > > stan > > > > > > > > > > On Sat, Dec 31, 2011 at 9:42 PM, Daniel Dai <[EMAIL PROTECTED] > > > > > > wrote: > > > > >> Hi, Stan, > > > > >> Foreach is inserted only if you have "as" in "load" statement. > This > > is > > > > to > > > > >> assure the data loaded conforms with "as" clause. At some point > > there > > > > is a > > > > >> bug in implementation, this should be fixed in PIG-2346 and will > be > > > > >> included in all subsequent releases. > > > > >> > > > > >> Thanks, > > > > >> Daniel > > > > >> > > > > >> On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg < > > > > >> [EMAIL PROTECTED]> wrote: > > > > >> > > > > >>> Howdy All, > > > > >>> > > > > >>> I am resurrecting my previous message sent to the list on Dec. 7. > > > Let > > > > >>> me first summarize. In a nutshell, as far as I can tell, > > > > >>> partition-aware loading is broken > > > > >>> in pig, and the culprit is PIG-1188 wherein the final decision > was > > to > > > > >>> introduce project & cast, i.e, foreach, after load. There are > two > > > > >>> problems with that approach. > > > > >>> First, as indicated in my original message, 'getPartitionKeys' is > > > > >>> never invoked because instead of the expected instruction > sequence > > > > >>> 'load; filter', PIG-1188 > > > > >>> changed it to 'load; foreach; filter'. Second, if a loader > already > > > > >>> happens to project & cast in order to adhere the data to the > > schema, > > > > >>> then the foreach synthesized > > > > >>> by pig is a waste of time. > > > > >>> > > > > >>> Essentially, we had to undo the patch in 'PIG-1188' in order to > get > > > > >>> partition filters to work; this enabled us to implement a > > HiveLoader > > > > >>> very much like > > > > >>> HCatLoader which incidentally is also broken for the very same > > > reason. > > > > >>> This is obviously a hack and a real solution is needed. > > > > >>> If the decision made in PIG-1188 cannot be re-considered, then I > > > > >>> suggest that we revisit the logic which is used to pass partition > > > > >>> filters to partition-aware loaders. |