Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Filtering


+
Peter Marron 2013-05-15, 10:38
+
Owen OMalley 2013-05-15, 17:35
>>On Wed, May 15, 2013 at 3:38 AM, Peter Marron <[EMAIL PROTECTED]> wrote:

>I've started doing similar work for the ORC reader.

I guess that I’m glad that I’m not completely alone here.

>>
>>Firstly although that page mentions InputFormat there doesn’t seem to be any way (that I can find)
>>to perform filter passing to InputFormats and so I gave up on that approach.
>>
>There is. You just need to set  hive.optimize.index.filter to true. See https://issues.apache.org/jira/browse/HIVE-4242.

This is a little confusing. When I look through the code for the use of this configuration
I see that it’s effectively used in two places.
Firstly it’s used on line 55 of file PhysicalOptimizer.java to add a “IndexWhereResolver”
Secondly it’s used on line 766 of file OpProcFactory.java to set a filter expression

But I don’t see any point where the predicate is passed to the InputFormat class.
I guess that you’re saying that there’s some way that the InputFormat can retrieve the
predicate once it’s been stored. But it’s not clear to me how I do that.

>>
>>That said, we really need to create a better interface that allows inputformats to negotiate what parts of the predicate they can process.

Ah, yes, sorry. I really want to be able to remove part of the predicate and subsume the filtering into the InputFormat class.
There’s little point in me going down this route if I can’t do that.

>>
>>-- Owen
>>

Thanks for prodding me into looking at the code, because now I see a big problem.

To recap what I really want to do is to be able to effect filtering on the case where I do a
                select * from table;
query. This is the only query that I’m interested in because it seems to run without any
Map/Reduce overhead (either locally or in the cluster) it’s effectively just performing
some HDFS calls and that’s what I desire.

What I really want to be able to do is to issue a query like this:
                select * from table where <predicate>
where I filter out the predicate and do the filtering in the InputFormat and then hive
effectively sees the query
                select * from table;
and runs it directly (no Map/Reduce) and I’m a happy bunny.

Now, as I say, I can’t see any way to effect this in the InputFormat directly.
If I use a storage handler then I am in “non-native table” terrority and I
can’t LOAD my tables with data.

However I have just noticed that line 111 of file IndexWhereProcessor.java
seems to suggest that indexes are only ever user when the query is going
to run Map/Reduce. Is this so? So I seem to be in the position where I
can’t use InputFormat, StorageHandler or Indexes. What can I do?

Is there any way to filter the query without having to run Map/Reduce?

Any suggestions welcomed.

Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
+
Peter Marron 2013-05-19, 22:11
+
Owen OMalley 2013-05-20, 04:36
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB