Peter Marron 2013-05-19, 22:11
Firstly I want to say a huge thank you. You have really helped me enormously.
I realize that you have been busy with other things (like release 0.11.0) and
so I can understand that it must have been a pain to take time out to help me.
>The critical piece is in OpProcFactory where the setFilterExpression is called.
> calls tableScanDesc.setFilterExpr
> passes to TableScanDesc.getFilterExpr
> which is called by HiveInputFormat.pushFilters
>HiveInputFormat.pushFilters uses Utilities.serializeExpression to put it into the configuration.
>Unless something is screwing it up, it looks like it hangs together.
OK. I think that I get it now. In my custom InputFormat I can read the config settings
And so I can then find the predicate that I need to do the filtering.
In particular I can set the input splits so that it just reads the right records.
>Really? With ORC, allowing the reader to skip over rows that don't matter is very important. Keeping Hive from rechecking the predicate is a nice to have.
Of course, you’re right. It doesn’t matter if the predicate is applied again to the records that are
already filtered. I meant that I couldn’t afford to leave the filter in place as it would mean that
a Map/Reduce would occur. But…
>There has been some work to add additional queries (https://issues.apache.org/jira/browse/HIVE-2925),
> but if what you want is to run locally without MR, yeah, getting the predicate into the RecordReader isn't enough.
>I haven't looked through HIVE-2925 to see what is supported, but that is where I'd start.
You’re right! HIVE-2925 is exactly what I want and now that I have found out how to make it work
I am really in good shape. Thanks.
There a couple of quick questions that I would like to know the answers to though.
1) I didn’t know about HIVE-2925 and I would never have thought that suppressing the
Map/Reduce would be controlled by something called “hive.fetch.task.conversion”
So maybe I’m missing a trick. How should I have found out about HIVE-2925?
2) I would like to parse the filter.expr.serialized XML and I assume that there’s some
SAX, DOM or even XLST already in HIVE. Could you give me a pointer to which classes
are used (JAXP, Xerces, Xalan?) or where they are being used? Not important,
I’m just being lazy.
3) I really want to do my filtering in the getSplits of my custom InputFormat. However
I have found that my getSplits is not being called. (And I asked about this on the list
before.) I have found that if I do this
then my method is invoked. It seems to be something to do with avoiding
the use of the org.apache.hadoop.hive.ql.io.CombineHiveInputFormat class.
However I don’t know whether there are any other bad things that will happen
if I make this change as I don’t really know what I’m doing.
Is this a safe thing to do?
There are some other (less important) problems which I will ask about under separate cover.
However I would like to say thanks again. If we ever meet in the real world
I’ll stand you a beer (or equivalent).
Congratulations on version 0.11.0.
Trillium Software UK Limited
Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>