Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Rewire and multi-query load/store optimization


Copy link to this message
-
Rewire and multi-query load/store optimization
Santhosh Srinivasan 2009-06-12, 21:19
With the implementation of rewire as part of the optimizer
infrastructure, a bug was exposed in the load/store optimization in the
multi-query feature. Below, I will articulate the bug and the
ramifications of a few possible solutions.

Load/store optimization in the multi-query feature?
---------------------------------------------------

If a script has an explicit store and a corresponding load which loads
the output of the store, the store-load combination can be optimized. An
example will illustrate the concept.

Pre-conditions:

1. The store location and the load location should match
2. The store format and the load format should be compatible

{code}

A = load 'input';
B = group A by $0;
store B into 'output';
C = load 'output';
D = group C by $0;
store D into 'some_other_output';

{code}

In the script above, the output of the first store serves as input of
the second load (C). In addition, the store and load use PigStorage() as
the store/load mechanism. In the logical plan this combination by
splitting B into the store and D.

Bug
---

When the load in the store/load combination was removed, the inner plans
of the load's successors (in this case D), were not updated correctly.
As a result, the projections in the inner plans still held references to
non-existing operators.

Consequence of the bug fix
---------------------------

During the map-reduce (M/R) compilation the split operator is compiled
into a store and a load. Prior to multi-query, for each M/R boundary
resulted in a temporary store using BinStorage. The subsequent load
could infer the type as BinStorage returns typed records, i.e., non-byte
array records.

With multi-query and the load/store optimization, the temporary
BinStorage data is not generated. Instead, the subsequent load uses the
output of the previous store as its input. Here, the loader can get
typed or untyped records based on the loader. As a result, the operators
in the map phase that rely on the type information (inferred from the
logical plan) will fail due to type mismatch.

Possible Solutions
------------------

Solution 1
=========Switch the load/store optimization. Users were primarily storing
intermediate data within the same script to overcome Pig's limitation,
i.e., absence of the multi-query feature. Going forward, with
multi-query turned on, users who store intermediate data will not enjoy
all the benefits of the optimization.

Solution 2
=========After the M/R compilation is completed, during the final pass of the
plan, fix the types of the projections to reflect typed/untyped data. In
other words, if the loader is returning typed data then retain the types
else change the types to bytearray. In order to make this decision,
loaders should support an interface to indicate if the records are typed
or untyped.
Thanks,
Santhosh