|
|
-
Rewire and multi-query load/store optimization
Santhosh Srinivasan 2009-06-12, 21:19
With the implementation of rewire as part of the optimizer infrastructure, a bug was exposed in the load/store optimization in the multi-query feature. Below, I will articulate the bug and the ramifications of a few possible solutions.
Load/store optimization in the multi-query feature? ---------------------------------------------------
If a script has an explicit store and a corresponding load which loads the output of the store, the store-load combination can be optimized. An example will illustrate the concept.
Pre-conditions:
1. The store location and the load location should match 2. The store format and the load format should be compatible
{code}
A = load 'input'; B = group A by $0; store B into 'output'; C = load 'output'; D = group C by $0; store D into 'some_other_output';
{code}
In the script above, the output of the first store serves as input of the second load (C). In addition, the store and load use PigStorage() as the store/load mechanism. In the logical plan this combination by splitting B into the store and D.
Bug ---
When the load in the store/load combination was removed, the inner plans of the load's successors (in this case D), were not updated correctly. As a result, the projections in the inner plans still held references to non-existing operators.
Consequence of the bug fix ---------------------------
During the map-reduce (M/R) compilation the split operator is compiled into a store and a load. Prior to multi-query, for each M/R boundary resulted in a temporary store using BinStorage. The subsequent load could infer the type as BinStorage returns typed records, i.e., non-byte array records.
With multi-query and the load/store optimization, the temporary BinStorage data is not generated. Instead, the subsequent load uses the output of the previous store as its input. Here, the loader can get typed or untyped records based on the loader. As a result, the operators in the map phase that rely on the type information (inferred from the logical plan) will fail due to type mismatch.
Possible Solutions ------------------
Solution 1 =========Switch the load/store optimization. Users were primarily storing intermediate data within the same script to overcome Pig's limitation, i.e., absence of the multi-query feature. Going forward, with multi-query turned on, users who store intermediate data will not enjoy all the benefits of the optimization.
Solution 2 =========After the M/R compilation is completed, during the final pass of the plan, fix the types of the projections to reflect typed/untyped data. In other words, if the loader is returning typed data then retain the types else change the types to bytearray. In order to make this decision, loaders should support an interface to indicate if the records are typed or untyped. Thanks, Santhosh
-
Re: Rewire and multi-query load/store optimization
Alan Gates 2009-06-16, 17:31
+1 on option one. The use of store->load was only to overcome a temporary problem in Pig. We've fixed the problem, so let's not propagate it. We will need to document this very clearly (maybe even to the point of issuing warnings in the parser when we see this combo) so users understand that this is now a hinderance rather than a help.
Alan.
On Jun 12, 2009, at 2:19 PM, Santhosh Srinivasan wrote:
> With the implementation of rewire as part of the optimizer > infrastructure, a bug was exposed in the load/store optimization in > the > multi-query feature. Below, I will articulate the bug and the > ramifications of a few possible solutions. > > Load/store optimization in the multi-query feature? > --------------------------------------------------- > > If a script has an explicit store and a corresponding load which loads > the output of the store, the store-load combination can be > optimized. An > example will illustrate the concept. > > Pre-conditions: > > 1. The store location and the load location should match > 2. The store format and the load format should be compatible > > {code} > > A = load 'input'; > B = group A by $0; > store B into 'output'; > C = load 'output'; > D = group C by $0; > store D into 'some_other_output'; > > {code} > > In the script above, the output of the first store serves as input of > the second load (C). In addition, the store and load use > PigStorage() as > the store/load mechanism. In the logical plan this combination by > splitting B into the store and D. > > Bug > --- > > When the load in the store/load combination was removed, the inner > plans > of the load's successors (in this case D), were not updated correctly. > As a result, the projections in the inner plans still held > references to > non-existing operators. > > Consequence of the bug fix > --------------------------- > > During the map-reduce (M/R) compilation the split operator is compiled > into a store and a load. Prior to multi-query, for each M/R boundary > resulted in a temporary store using BinStorage. The subsequent load > could infer the type as BinStorage returns typed records, i.e., non- > byte > array records. > > With multi-query and the load/store optimization, the temporary > BinStorage data is not generated. Instead, the subsequent load uses > the > output of the previous store as its input. Here, the loader can get > typed or untyped records based on the loader. As a result, the > operators > in the map phase that rely on the type information (inferred from the > logical plan) will fail due to type mismatch. > > Possible Solutions > ------------------ > > Solution 1 > =========> Switch the load/store optimization. Users were primarily storing > intermediate data within the same script to overcome Pig's limitation, > i.e., absence of the multi-query feature. Going forward, with > multi-query turned on, users who store intermediate data will not > enjoy > all the benefits of the optimization. > > Solution 2 > =========> After the M/R compilation is completed, during the final pass of the > plan, fix the types of the projections to reflect typed/untyped > data. In > other words, if the loader is returning typed data then retain the > types > else change the types to bytearray. In order to make this decision, > loaders should support an interface to indicate if the records are > typed > or untyped. > > > Thanks, > Santhosh
|
|