Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Multi-group-by select always scans entire table


Copy link to this message
-
Re: Multi-group-by select always scans entire table
Mark Grover 2012-06-08, 04:10
Hi Jan,
I did some testing for this on Apache Hive 0.9 and I have boiled it down
the following:
Predicate pushdown seems to work for single-insert queries using LATERAL
VIEW. It also seems to work for multi-insert queries NOT using LATERAL
VIEW. However, it doesn't work for multi-insert queries using LATERAL VIEW.

Here are some examples. In the below examples, I make use of the fact that
a query with no partition filtering when run under
"hive.mapred.mode=strict" fails.

For example,
--Table creation and population
DROP TABLE IF EXISTS test;
CREATE TABLE test (col1 array<int>, col2 int)  PARTITIONED BY (part_col
int);
INSERT OVERWRITE TABLE test PARTITION (part_col=1) SELECT array(1,2),
count(*) FROM test;
INSERT OVERWRITE TABLE test PARTITION (part_col=2) SELECT array(2,4,6),
count(*) FROM test;

-- Query 1
-- This succeeds (using LATERAL VIEW with single insert)
set hive.mapred.mode=strict;
FROM partition_test
LATERAL VIEW explode(col1) tmp AS exp_col1
INSERT OVERWRITE DIRECTORY '/test/1'
    SELECT exp_col1
    WHERE (part_col=2);

-- Query 2
-- This succeeds (NOT using LATERAL VIEW with multi-insert)
set hive.mapred.mode=strict;
FROM partition_test
INSERT OVERWRITE DIRECTORY '/test/1'
    SELECT col1
    WHERE (part_col=2)
INSERT OVERWRITE DIRECTORY '/test/2'
    SELECT col1
    WHERE (part_col=2);

-- Query 3
-- This fails (using LATERAL VIEW with multi-insert)
set hive.mapred.mode=strict;
FROM partition_test
LATERAL VIEW explode(col1) tmp AS exp_col1
INSERT OVERWRITE DIRECTORY '/test/1'
    SELECT exp_col1
    WHERE (part_col=2)
INSERT OVERWRITE DIRECTORY '/test/2'
    SELECT exp_col1
    WHERE (part_col=2);

I have created a JIRA for this:
https://issues.apache.org/jira/browse/HIVE-3104

Now, in your previous email, you said that the explain plan for a query
just like Query 2 didn't change based on the value of hive.optimize.ppd. It
seems to me that predicate pushdown was implemented in various phases and
some initial phases didn't have support for multi-insert queries. However,
in a later version of Hive, it was added. Seems like my version (Apache
Hive 9.0) has that support while yours (Cloudera 0.7.1) doesn't. Hence the
cause of confusion.

Anyways, hope this clears things up. If you have any further
questions/comments, please let me know. Thanks again for all your input.

Mark

On Thu, Jun 7, 2012 at 10:03 AM, Jan Dolinár <[EMAIL PROTECTED]> wrote:

> On 6/7/12, Mark Grover <[EMAIL PROTECTED]> wrote:
> > Can you please check if predicate push down enabled changes the explain
> > plan on a simple inner join query like:
> >
> > select a.* from a inner join b on(a.key=b.key) where a.some_col=blah;
>
> No problem, I ran following as you suggested (INNER JOIN didn't work
> for me, so I used just JOIN):
>
> create table a (key int, some_col string);
> create table b (key int, some_col string);
>
> set hive.optimize.ppd=true;
> explain select a.* from a join b on(a.key=b.key) where a.some_col='blah';
>
> set hive.optimize.ppd=false;
> explain select a.* from a join b on(a.key=b.key) where a.some_col='blah';
>
> There is a difference in the explains, the first one has a Filter
> operator on some_col, quite high in the tree. So I guess here it is
> working, although I still see another Filter operator in reduce deeper
> down in both, I'm not sure if that is correct or not, but I believe
> that it should be only executed once. I put the results at pastebin so
> you can see yourself: http://pastebin.com/gquMksqE and
> http://pastebin.com/0FPx7KKG.
>
> Jan
>