Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Filtering


Hi Owen,

Firstly I want to say a huge thank you. You have really helped me enormously.
I realize that you have been busy with other things (like release 0.11.0) and
so I can understand that it must have been a pain to take time out to help me.

>The critical piece is in OpProcFactory where the setFilterExpression is called.
>
>OpProcFactory.pushFilterToStorageHandler
>  calls tableScanDesc.setFilterExpr
>    passes to TableScanDesc.getFilterExpr
>    which is called by HiveInputFormat.pushFilters
>
>HiveInputFormat.pushFilters uses Utilities.serializeExpression to put it into the configuration.
>
>Unless something is screwing it up, it looks like it hangs together.

OK. I think that I get it now. In my custom InputFormat I can read the config settings

JobConf .get(“"hive.io.filter.text"”);
JobConf .get(“"hive.io.filter.expr.serialized"”);

And so I can then find the predicate that I need to do the filtering.
In particular I can set the input splits so that it just reads the right records.

>Really? With ORC, allowing the reader to skip over rows that don't matter is very important. Keeping Hive from rechecking the predicate is a nice to have.

Of course, you’re right. It doesn’t matter if the predicate is applied again to the records that are
already filtered. I meant that I couldn’t afford to leave the filter in place as it would mean that
a Map/Reduce would occur. But…

>There has been some work to add additional queries (https://issues.apache.org/jira/browse/HIVE-2925),
> but if what you want is to run locally without MR, yeah, getting the predicate into the RecordReader isn't enough.
>I haven't looked through HIVE-2925 to see what is supported, but that is where I'd start.
>-- Owen

You’re right! HIVE-2925 is exactly what I want and now that I have found out how to make it work
set hive.fetch.task.conversion=more;
I am really in good shape. Thanks.

There a couple of quick questions that I would like to know the answers to though.
1)      I didn’t know about HIVE-2925 and I would never have thought that suppressing the

Map/Reduce would be controlled by something called “hive.fetch.task.conversion”

So maybe I’m missing a trick. How should I have found out about HIVE-2925?

2)      I would like to parse the filter.expr.serialized XML and I assume that there’s some
SAX, DOM or even XLST already in HIVE. Could you give me a pointer to which classes
are used (JAXP, Xerces, Xalan?) or where they are being used? Not important,
I’m just being lazy.

3)      I really want to do my filtering in the getSplits of my custom InputFormat. However
I have found that my getSplits is not being called. (And I asked about this on the list
before.) I have found that if I do this
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
then my method is invoked. It seems to be something to do with avoiding
the use of the org.apache.hadoop.hive.ql.io.CombineHiveInputFormat class.
However I don’t know whether there are any other bad things that will happen
if I make this change as I don’t really know what I’m doing.
Is this a safe thing to do?
There are some other (less important) problems which I will ask about under separate cover.

However I would like to say thanks again. If we ever meet in the real world
I’ll stand you a beer (or equivalent).

Congratulations on version 0.11.0.

Z
aka
Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>