Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Filtering


+
Peter Marron 2013-05-15, 10:38
+
Owen OMalley 2013-05-15, 17:35
Copy link to this message
-
RE: Filtering
Peter Marron 2013-05-16, 14:08
>>On Wed, May 15, 2013 at 3:38 AM, Peter Marron <[EMAIL PROTECTED]> wrote:

>I've started doing similar work for the ORC reader.

I guess that I’m glad that I’m not completely alone here.

>>
>>Firstly although that page mentions InputFormat there doesn’t seem to be any way (that I can find)
>>to perform filter passing to InputFormats and so I gave up on that approach.
>>
>There is. You just need to set  hive.optimize.index.filter to true. See https://issues.apache.org/jira/browse/HIVE-4242.

This is a little confusing. When I look through the code for the use of this configuration
I see that it’s effectively used in two places.
Firstly it’s used on line 55 of file PhysicalOptimizer.java to add a “IndexWhereResolver”
Secondly it’s used on line 766 of file OpProcFactory.java to set a filter expression

But I don’t see any point where the predicate is passed to the InputFormat class.
I guess that you’re saying that there’s some way that the InputFormat can retrieve the
predicate once it’s been stored. But it’s not clear to me how I do that.

>>
>>That said, we really need to create a better interface that allows inputformats to negotiate what parts of the predicate they can process.

Ah, yes, sorry. I really want to be able to remove part of the predicate and subsume the filtering into the InputFormat class.
There’s little point in me going down this route if I can’t do that.

>>
>>-- Owen
>>

Thanks for prodding me into looking at the code, because now I see a big problem.

To recap what I really want to do is to be able to effect filtering on the case where I do a
                select * from table;
query. This is the only query that I’m interested in because it seems to run without any
Map/Reduce overhead (either locally or in the cluster) it’s effectively just performing
some HDFS calls and that’s what I desire.

What I really want to be able to do is to issue a query like this:
                select * from table where <predicate>
where I filter out the predicate and do the filtering in the InputFormat and then hive
effectively sees the query
                select * from table;
and runs it directly (no Map/Reduce) and I’m a happy bunny.

Now, as I say, I can’t see any way to effect this in the InputFormat directly.
If I use a storage handler then I am in “non-native table” terrority and I
can’t LOAD my tables with data.

However I have just noticed that line 111 of file IndexWhereProcessor.java
seems to suggest that indexes are only ever user when the query is going
to run Map/Reduce. Is this so? So I seem to be in the position where I
can’t use InputFormat, StorageHandler or Indexes. What can I do?

Is there any way to filter the query without having to run Map/Reduce?

Any suggestions welcomed.

Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
+
Peter Marron 2013-05-19, 22:11
+
Owen OMalley 2013-05-20, 04:36