|
Jan Dolinár
2012-05-25, 10:03
Jan Dolinár
2012-05-29, 05:57
Mark Grover
2012-06-04, 17:20
Jan Dolinár
2012-06-05, 05:19
Mark Grover
2012-06-06, 01:20
Jan Dolinár
2012-06-06, 07:15
Mark Grover
2012-06-07, 12:26
Jan Dolinár
2012-06-07, 14:03
Mark Grover
2012-06-08, 04:10
Jan Dolinár
2012-06-08, 05:42
|
-
Multi-group-by select always scans entire tableJan Dolinár 2012-05-25, 10:03
Hello,
I've encountered a weird issue with hive and I'm not sure if I'm doing something wrong or if it is a bug. I'm trying to do a multi-group-by select statement on a partitioned table. I wan't only data from one partition, therefore all the WHERE statements are exactly the same and contain only the partition columns. Now, I would expect that the optimization kicks in and hadoop will process only files from the given partition, but in fact, it reads all the data in the table. Which makes it practicaly unusable for big tables... Simple testcase: -- create a simple partitioned table DROP TABLE IF EXISTS partition_test; CREATE TABLE partition_test (col1 array<bigint>) PARTITIONED BY (part_col bigint); -- add some partitions filled with some random data INSERT OVERWRITE TABLE partition_test PARTITION (part_col=1) SELECT array(1,count(*)) FROM partition_test LIMIT 1; INSERT OVERWRITE TABLE partition_test PARTITION (part_col=2) SELECT array(2,count(*)) FROM partition_test LIMIT 2; -- see what happens when you try to perform multi-group-by query on one of the partitions EXPLAIN EXTENDED FROM partition_test LATERAL VIEW explode(col1) tmp AS exp_col1 INSERT OVERWRITE DIRECTORY '/test/1' SELECT exp_col1 WHERE (part_col=2) INSERT OVERWRITE DIRECTORY '/test/2' SELECT exp_col1 WHERE (part_col=2); -- result: it wants to scan all partitions :-( I also tried to trick Hive by using a view (CREATE VIEW part_view AS SELECT * FROM partition_test WHERE(part_col=2);) and than running the select on that, but the behavior is still the same... The version of Hive I tested this on is 0.7.1 (Cloudera distribution) if that matters. I would be very grateful if you could point me to some other way how to get the data without reading entire table... And in case this is a bug, where should I report it? Best regards, J. Dolinar
-
Re: Multi-group-by select always scans entire tableJan Dolinár 2012-05-29, 05:57
On Fri, May 25, 2012 at 12:03 PM, Jan Dolinár <[EMAIL PROTECTED]> wrote:
> > -- see what happens when you try to perform multi-group-by query on one of > the partitions > EXPLAIN EXTENDED > FROM partition_test > LATERAL VIEW explode(col1) tmp AS exp_col1 > INSERT OVERWRITE DIRECTORY '/test/1' > SELECT exp_col1 > WHERE (part_col=2) > INSERT OVERWRITE DIRECTORY '/test/2' > SELECT exp_col1 > WHERE (part_col=2); > -- result: it wants to scan all partitions :-( > Since nobody else did, let me answer myself... In the end I found out that the correct partition pruning can be achieved using subquery. Continuing the example from my last post, the query would be: FROM ( SELECT * FROM partition_test LATERAL VIEW explode(col1) tmp AS exp_col1 WHERE part_col=2 ) t INSERT OVERWRITE DIRECTORY '/test/1' SELECT exp_col1 INSERT OVERWRITE DIRECTORY '/test/2' SELECT exp_col1; I still think the pruning should work correctly no matter how the query is written, but for now I'm happy with this solution. J. Dolinar
-
Re: Multi-group-by select always scans entire tableMark Grover 2012-06-04, 17:20
Hi Jan,
Glad you found something workable. What version of Hive are you using? Could you also please check what the value of the property hive.optimize.ppd is for you? Thanks, Mark ----- Original Message ----- From: "Jan Dolinár" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, May 29, 2012 1:57:25 AM Subject: Re: Multi-group-by select always scans entire table On Fri, May 25, 2012 at 12:03 PM, Jan Dolinár < [EMAIL PROTECTED] > wrote: -- see what happens when you try to perform multi-group-by query on one of the partitions EXPLAIN EXTENDED FROM partition_test LATERAL VIEW explode(col1) tmp AS exp_col1 INSERT OVERWRITE DIRECTORY '/test/1' SELECT exp_col1 WHERE (part_col=2) INSERT OVERWRITE DIRECTORY '/test/2' SELECT exp_col1 WHERE (part_col=2); -- result: it wants to scan all partitions :-( Since nobody else did, let me answer myself... In the end I found out that the correct partition pruning can be achieved using subquery. Continuing the example from my last post, the query would be: FROM ( SELECT * FROM partition_test LATERAL VIEW explode(col1) tmp AS exp_col1 WHERE part_col=2 ) t INSERT OVERWRITE DIRECTORY '/test/1' SELECT exp_col1 INSERT OVERWRITE DIRECTORY '/test/2' SELECT exp_col1; I still think the pruning should work correctly no matter how the query is written, but for now I'm happy with this solution. J. Dolinar
-
Re: Multi-group-by select always scans entire tableJan Dolinár 2012-06-05, 05:19
On Mon, Jun 4, 2012 at 7:20 PM, Mark Grover <[EMAIL PROTECTED]> wrote:
> Hi Jan, > Glad you found something workable. > > What version of Hive are you using? Could you also please check what the > value of the property hive.optimize.ppd is for you? > > Thanks, > Mark > > Hi Mark, Thanks for reply. I'm using hive 0.7.1 distributed from Cloudera as cdh3u4. The property hive.optimize.ppd is set true, but I have tried to turn it off and it doesn't effect the behavior of the problematic query at all. Any other ideas? :-) Also could some of you good guys try to check this on hadoop 0.8 or newer? It would be nice to know if it is worth to go through all the hassle of upgrading or if it won't help. Also, if it is not fixed already, it might be good idea to report it as a bug. Jan
-
Re: Multi-group-by select always scans entire tableMark Grover 2012-06-06, 01:20
Hi Jan,
The quick answer is I don't know but may be someone else on the mailing list does:-) Looking at the wiki page for Lateral view( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView), there was a problem related to predicate pushdown on UDTF's ( https://issues.apache.org/jira/browse/HIVE-1056). However, that seemed to have been fixed in Hive 0.6.0 so it shouldn't have any impact on you. The fix for above ticket introduced a unit test (at ql/src/test/results/clientpositive/lateral_view_ppd.q) that tests predicate pushdown on UDTF's. Now, all the subsequent releases should have had that test pass (otherwise they wouldn't have been released, I hope). The test checks for a non-partition column for predicate pushdown. I wonder if it makes a difference with a partition column being used. Can you verify if your query with predicate pushdown enabled works as expected with a non-partition column in the where clause? In that case, the explain/explain extended output should be different from when predicate pushdown is disabled. If predicate pushdown works for non-partition columns but not for partition columns, please create a JIRA stating that predicate pushdown on UDTF's doesn't work with partition columns. If it doesn't work for both partition and non-partition columns, then obviously Hive-1056 is not working for you. We can take it up on the mailing list from there. Thanks for your input, Jan. Mark On Tue, Jun 5, 2012 at 1:19 AM, Jan Dolinár <[EMAIL PROTECTED]> wrote: > > > On Mon, Jun 4, 2012 at 7:20 PM, Mark Grover <[EMAIL PROTECTED]> wrote: > >> Hi Jan, >> Glad you found something workable. >> >> What version of Hive are you using? Could you also please check what the >> value of the property hive.optimize.ppd is for you? >> >> Thanks, >> Mark >> >> > Hi Mark, > > Thanks for reply. I'm using hive 0.7.1 distributed from Cloudera as > cdh3u4. The property hive.optimize.ppd is set true, but I have tried to > turn it off and it doesn't effect the behavior of the problematic query at > all. Any other ideas? :-) > > Also could some of you good guys try to check this on hadoop 0.8 or newer? > It would be nice to know if it is worth to go through all the hassle of > upgrading or if it won't help. Also, if it is not fixed already, it might > be good idea to report it as a bug. > > Jan >
-
Re: Multi-group-by select always scans entire tableJan Dolinár 2012-06-06, 07:15
Hi Mark,
Thanks for all your help. I tried to run a series of test with various settings of hive.optimize.ppd and various queries ( see it here http://pastebin.com/E89p9Ubx ) and now I'm even more confused than before. In all cases, regardless if the WHERE clause asks about partitioned or regular column, the result with ppd=true and ppd=false differ only in file paths, but the structure is the same. Even if I run the query without the LATERAL VIEW... Either there is something terribly wrong with hive and/or my setup and/or I'm completely dumb. Do I understand it right that ppd should push common criteria from where clauses into earlier stage, so that there is less data and processing in the following stages? I'm quite convinced it doesn't really happen here... Jan On 6/6/12, Mark Grover <[EMAIL PROTECTED]> wrote: > Hi Jan, > The quick answer is I don't know but may be someone else on the mailing > list does:-) > > Looking at the wiki page for Lateral view( > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView), > there was a problem related to predicate pushdown on UDTF's ( > https://issues.apache.org/jira/browse/HIVE-1056). However, that seemed to > have been fixed in Hive 0.6.0 so it shouldn't have any impact on you. > > The fix for above ticket introduced a unit test (at > ql/src/test/results/clientpositive/lateral_view_ppd.q) that tests predicate > pushdown on UDTF's. Now, all the subsequent releases should have had that > test pass (otherwise they wouldn't have been released, I hope). The test > checks for a non-partition column for predicate pushdown. I wonder if it > makes a difference with a partition column being used. > > Can you verify if your query with predicate pushdown enabled works as > expected with a non-partition column in the where clause? In that case, the > explain/explain extended output should be different from when predicate > pushdown is disabled. If predicate pushdown works for non-partition columns > but not for partition columns, please create a JIRA stating that predicate > pushdown on UDTF's doesn't work with partition columns. > > If it doesn't work for both partition and non-partition columns, then > obviously Hive-1056 is not working for you. We can take it up on the > mailing list from there. > > Thanks for your input, Jan. > > Mark > > On Tue, Jun 5, 2012 at 1:19 AM, Jan Dolinár <[EMAIL PROTECTED]> wrote: > >> >> >> On Mon, Jun 4, 2012 at 7:20 PM, Mark Grover <[EMAIL PROTECTED]> wrote: >> >>> Hi Jan, >>> Glad you found something workable. >>> >>> What version of Hive are you using? Could you also please check what the >>> value of the property hive.optimize.ppd is for you? >>> >>> Thanks, >>> Mark >>> >>> >> Hi Mark, >> >> Thanks for reply. I'm using hive 0.7.1 distributed from Cloudera as >> cdh3u4. The property hive.optimize.ppd is set true, but I have tried to >> turn it off and it doesn't effect the behavior of the problematic query >> at >> all. Any other ideas? :-) >> >> Also could some of you good guys try to check this on hadoop 0.8 or >> newer? >> It would be nice to know if it is worth to go through all the hassle of >> upgrading or if it won't help. Also, if it is not fixed already, it might >> be good idea to report it as a bug. >> >> Jan >> >
-
Re: Multi-group-by select always scans entire tableMark Grover 2012-06-07, 12:26
Hi Jan,
Thanks for the analysis. Yes, it's true that optimize ppd will push predicates to be evaluated earlier. The only catch there is that predicates cannot be pushed across constructs that change the data in the query. An example of this is having a predicate (say of the form 'where Col is not NULL') on the right table in left outer join query. Such a predicate will lead to different results when executed (on the right table) before or after the left join. Therefore, predicate push down wouldn't make sense in such a case. However, if you have an inner join on the same tables, predicates can be pushed down since executing the predicate before or after the inner join leads to the same results. Can you please check if predicate push down enabled changes the explain plan on a simple inner join query like: select a.* from a inner join b on(a.key=b.key) where a.some_col=blah; Thank you for your input, Jan! Mark On Jun 6, 2012 3:15 AM, "Jan Dolinár" <[EMAIL PROTECTED]> wrote: > Hi Mark, > > Thanks for all your help. I tried to run a series of test with various > settings of hive.optimize.ppd and various queries ( see it here > http://pastebin.com/E89p9Ubx ) and now I'm even more confused than > before. In all cases, regardless if the WHERE clause asks about > partitioned or regular column, the result with ppd=true and ppd=false > differ only in file paths, but the structure is the same. Even if I > run the query without the LATERAL VIEW... > > Either there is something terribly wrong with hive and/or my setup > and/or I'm completely dumb. Do I understand it right that ppd should > push common criteria from where clauses into earlier stage, so that > there is less data and processing in the following stages? I'm quite > convinced it doesn't really happen here... > > Jan > > On 6/6/12, Mark Grover <[EMAIL PROTECTED]> wrote: > > Hi Jan, > > The quick answer is I don't know but may be someone else on the mailing > > list does:-) > > > > Looking at the wiki page for Lateral view( > > > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView > ), > > there was a problem related to predicate pushdown on UDTF's ( > > https://issues.apache.org/jira/browse/HIVE-1056). However, that seemed > to > > have been fixed in Hive 0.6.0 so it shouldn't have any impact on you. > > > > The fix for above ticket introduced a unit test (at > > ql/src/test/results/clientpositive/lateral_view_ppd.q) that tests > predicate > > pushdown on UDTF's. Now, all the subsequent releases should have had that > > test pass (otherwise they wouldn't have been released, I hope). The test > > checks for a non-partition column for predicate pushdown. I wonder if it > > makes a difference with a partition column being used. > > > > Can you verify if your query with predicate pushdown enabled works as > > expected with a non-partition column in the where clause? In that case, > the > > explain/explain extended output should be different from when predicate > > pushdown is disabled. If predicate pushdown works for non-partition > columns > > but not for partition columns, please create a JIRA stating that > predicate > > pushdown on UDTF's doesn't work with partition columns. > > > > If it doesn't work for both partition and non-partition columns, then > > obviously Hive-1056 is not working for you. We can take it up on the > > mailing list from there. > > > > Thanks for your input, Jan. > > > > Mark > > > > On Tue, Jun 5, 2012 at 1:19 AM, Jan Dolinár <[EMAIL PROTECTED]> wrote: > > > >> > >> > >> On Mon, Jun 4, 2012 at 7:20 PM, Mark Grover <[EMAIL PROTECTED]> wrote: > >> > >>> Hi Jan, > >>> Glad you found something workable. > >>> > >>> What version of Hive are you using? Could you also please check what > the > >>> value of the property hive.optimize.ppd is for you? > >>> > >>> Thanks, > >>> Mark > >>> > >>> > >> Hi Mark, > >> > >> Thanks for reply. I'm using hive 0.7.1 distributed from Cloudera as > >> cdh3u4. The property hive.optimize.ppd is set true, but I have tried to
-
Re: Multi-group-by select always scans entire tableJan Dolinár 2012-06-07, 14:03
On 6/7/12, Mark Grover <[EMAIL PROTECTED]> wrote:
> Can you please check if predicate push down enabled changes the explain > plan on a simple inner join query like: > > select a.* from a inner join b on(a.key=b.key) where a.some_col=blah; No problem, I ran following as you suggested (INNER JOIN didn't work for me, so I used just JOIN): create table a (key int, some_col string); create table b (key int, some_col string); set hive.optimize.ppd=true; explain select a.* from a join b on(a.key=b.key) where a.some_col='blah'; set hive.optimize.ppd=false; explain select a.* from a join b on(a.key=b.key) where a.some_col='blah'; There is a difference in the explains, the first one has a Filter operator on some_col, quite high in the tree. So I guess here it is working, although I still see another Filter operator in reduce deeper down in both, I'm not sure if that is correct or not, but I believe that it should be only executed once. I put the results at pastebin so you can see yourself: http://pastebin.com/gquMksqE and http://pastebin.com/0FPx7KKG. Jan
-
Re: Multi-group-by select always scans entire tableMark Grover 2012-06-08, 04:10
Hi Jan,
I did some testing for this on Apache Hive 0.9 and I have boiled it down the following: Predicate pushdown seems to work for single-insert queries using LATERAL VIEW. It also seems to work for multi-insert queries NOT using LATERAL VIEW. However, it doesn't work for multi-insert queries using LATERAL VIEW. Here are some examples. In the below examples, I make use of the fact that a query with no partition filtering when run under "hive.mapred.mode=strict" fails. For example, --Table creation and population DROP TABLE IF EXISTS test; CREATE TABLE test (col1 array<int>, col2 int) PARTITIONED BY (part_col int); INSERT OVERWRITE TABLE test PARTITION (part_col=1) SELECT array(1,2), count(*) FROM test; INSERT OVERWRITE TABLE test PARTITION (part_col=2) SELECT array(2,4,6), count(*) FROM test; -- Query 1 -- This succeeds (using LATERAL VIEW with single insert) set hive.mapred.mode=strict; FROM partition_test LATERAL VIEW explode(col1) tmp AS exp_col1 INSERT OVERWRITE DIRECTORY '/test/1' SELECT exp_col1 WHERE (part_col=2); -- Query 2 -- This succeeds (NOT using LATERAL VIEW with multi-insert) set hive.mapred.mode=strict; FROM partition_test INSERT OVERWRITE DIRECTORY '/test/1' SELECT col1 WHERE (part_col=2) INSERT OVERWRITE DIRECTORY '/test/2' SELECT col1 WHERE (part_col=2); -- Query 3 -- This fails (using LATERAL VIEW with multi-insert) set hive.mapred.mode=strict; FROM partition_test LATERAL VIEW explode(col1) tmp AS exp_col1 INSERT OVERWRITE DIRECTORY '/test/1' SELECT exp_col1 WHERE (part_col=2) INSERT OVERWRITE DIRECTORY '/test/2' SELECT exp_col1 WHERE (part_col=2); I have created a JIRA for this: https://issues.apache.org/jira/browse/HIVE-3104 Now, in your previous email, you said that the explain plan for a query just like Query 2 didn't change based on the value of hive.optimize.ppd. It seems to me that predicate pushdown was implemented in various phases and some initial phases didn't have support for multi-insert queries. However, in a later version of Hive, it was added. Seems like my version (Apache Hive 9.0) has that support while yours (Cloudera 0.7.1) doesn't. Hence the cause of confusion. Anyways, hope this clears things up. If you have any further questions/comments, please let me know. Thanks again for all your input. Mark On Thu, Jun 7, 2012 at 10:03 AM, Jan Dolinár <[EMAIL PROTECTED]> wrote: > On 6/7/12, Mark Grover <[EMAIL PROTECTED]> wrote: > > Can you please check if predicate push down enabled changes the explain > > plan on a simple inner join query like: > > > > select a.* from a inner join b on(a.key=b.key) where a.some_col=blah; > > No problem, I ran following as you suggested (INNER JOIN didn't work > for me, so I used just JOIN): > > create table a (key int, some_col string); > create table b (key int, some_col string); > > set hive.optimize.ppd=true; > explain select a.* from a join b on(a.key=b.key) where a.some_col='blah'; > > set hive.optimize.ppd=false; > explain select a.* from a join b on(a.key=b.key) where a.some_col='blah'; > > There is a difference in the explains, the first one has a Filter > operator on some_col, quite high in the tree. So I guess here it is > working, although I still see another Filter operator in reduce deeper > down in both, I'm not sure if that is correct or not, but I believe > that it should be only executed once. I put the results at pastebin so > you can see yourself: http://pastebin.com/gquMksqE and > http://pastebin.com/0FPx7KKG. > > Jan >
-
Re: Multi-group-by select always scans entire tableJan Dolinár 2012-06-08, 05:42
Thank you very much Mark for your investigation and explanations.
I'm well aware of the fact that hadoop 0.7.1 is quite an old code and that newer version might perform better - that is the main reason I discussed it here instead of reporting it as a bug. For now it doesn't bother me, as I have the workaround, but I will keep an eye on that JIRA. Jan |