Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # dev >> Re: Review Request: HIVE-2206: add a new optimizer for query correlation discovery and optimization


Copy link to this message
-
Re: Review Request: HIVE-2206: add a new optimizer for query correlation discovery and optimization


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java, line 275
> > <https://reviews.apache.org/r/7126/diff/11/?file=221827#file221827line275>
> >
> >     Does this imply CLSRSOperator cannot have more than one child operator in any case. If so, can you please add comments stating that along with small description why is that?
>
> Yin Huai wrote:
>     I thought that, in the original plan, a ReduceSinkOperator can only have 1 child. Because a CLSRSOperator replaces a ReduceSinkOperator, it also has a single child. Is my understanding correct?
>
> Ashutosh Chauhan wrote:
>     I think you are correct. Since its a terminal operator, RSOperator can only have 1 child. It will be good to add this as comment there.

done
> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java, line 1451
> > <https://reviews.apache.org/r/7126/diff/11/?file=221831#file221831line1451>
> >
> >     I don't get the reason of introducing rowNumber in Operator. It doesn't look its required for optimization to take place correctly. Can you elaborate the need of this variable and different associated methods which are introduced along with it?

If a query plan is optimized by correlation optimizer, multiple operation paths can be merged to a single Map phase. CorrelationCompositeOperator is used to evaluate results from different paths and forward a single row to ReduceSinkOperator. CorrelationCompositeOperator will first buffer rows forwarded to it from different paths. rowNumber is introduced to let CorrelationCompositeOperator know when a row has been processed by all paths. When a new rowNumber (a new row will come) is set through setRowNumber, CorrelationCompositeOperator will evaluate its current buffer and tag which operation paths this current row belongs to. I will add comment to explain it.
> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java, line 134
> > <https://reviews.apache.org/r/7126/diff/11/?file=221835#file221835line134>
> >
> >     Can you add comments clarifying whats the reason for this?

done
> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java, line 270
> > <https://reviews.apache.org/r/7126/diff/11/?file=221835#file221835line270>
> >
> >     If I am reading this correctly than, map-side groupbys are converted to reduce-side groupbys. I don't see any fundamental reason why correlation optimizer cannot work with map-side groupbys. Is this just the limitation of current implementation or is there any fundamental reason for it. Please add comments whatever the case is.

For example, if an MR job for an aggregation operator shares the input table with an MR job for a join operator, with Reduce-side aggregation (RS-GBY pattern), the ReduceSinkOperator for the merged Map phase only needs to emit a single row for both the aggregation operator and the join operator. Thus, I decide to convert map-side groupbys to reduce-side groupbys.
> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java, line 94
> > <https://reviews.apache.org/r/7126/diff/11/?file=221838#file221838line94>
> >
> >     Can you add comments why it is not compatible with skew-join and groupby-skew optimizations?

seems skew optimizations will split a single MR job to 2 jobs. I have not carefully thought how to apply the correlation optimization on those plans optimized by skew optimizations. So, I'd suggest we evaluate this issue in a follow-up jira.
> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java, line 393
> > <https://reviews.apache.org/r/7126/diff/11/?file=221845#file221845line393>
> >
> >     Explain annotation should have worked, no ? Having this working will be very useful for debugging.

Enabled. We need to re-generate some test results (probably all results involving "EXPLAIN"?).

i am working on it.

Yes, I agree. One question. Is there any convenient way to clone a ParseContext or an entire plan tree?
- Yin
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/7126/#review15124
On Nov. 19, 2012, 7:51 p.m., Yin Huai wrote: