Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Filtering


Hi Owen,

Firstly I want to say a huge thank you. You have really helped me enormously.
I realize that you have been busy with other things (like release 0.11.0) and
so I can understand that it must have been a pain to take time out to help me.

>The critical piece is in OpProcFactory where the setFilterExpression is called.
>
>OpProcFactory.pushFilterToStorageHandler
>  calls tableScanDesc.setFilterExpr
>    passes to TableScanDesc.getFilterExpr
>    which is called by HiveInputFormat.pushFilters
>
>HiveInputFormat.pushFilters uses Utilities.serializeExpression to put it into the configuration.
>
>Unless something is screwing it up, it looks like it hangs together.

OK. I think that I get it now. In my custom InputFormat I can read the config settings

JobConf .get(“"hive.io.filter.text"”);
JobConf .get(“"hive.io.filter.expr.serialized"”);

And so I can then find the predicate that I need to do the filtering.
In particular I can set the input splits so that it just reads the right records.

>Really? With ORC, allowing the reader to skip over rows that don't matter is very important. Keeping Hive from rechecking the predicate is a nice to have.

Of course, you’re right. It doesn’t matter if the predicate is applied again to the records that are
already filtered. I meant that I couldn’t afford to leave the filter in place as it would mean that
a Map/Reduce would occur. But…

>There has been some work to add additional queries (https://issues.apache.org/jira/browse/HIVE-2925),
> but if what you want is to run locally without MR, yeah, getting the predicate into the RecordReader isn't enough.
>I haven't looked through HIVE-2925 to see what is supported, but that is where I'd start.
>-- Owen

You’re right! HIVE-2925 is exactly what I want and now that I have found out how to make it work
set hive.fetch.task.conversion=more;
I am really in good shape. Thanks.

There a couple of quick questions that I would like to know the answers to though.
1)      I didn’t know about HIVE-2925 and I would never have thought that suppressing the

Map/Reduce would be controlled by something called “hive.fetch.task.conversion”

So maybe I’m missing a trick. How should I have found out about HIVE-2925?

2)      I would like to parse the filter.expr.serialized XML and I assume that there’s some
SAX, DOM or even XLST already in HIVE. Could you give me a pointer to which classes
are used (JAXP, Xerces, Xalan?) or where they are being used? Not important,
I’m just being lazy.

3)      I really want to do my filtering in the getSplits of my custom InputFormat. However
I have found that my getSplits is not being called. (And I asked about this on the list
before.) I have found that if I do this
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
then my method is invoked. It seems to be something to do with avoiding
the use of the org.apache.hadoop.hive.ql.io.CombineHiveInputFormat class.
However I don’t know whether there are any other bad things that will happen
if I make this change as I don’t really know what I’m doing.
Is this a safe thing to do?
There are some other (less important) problems which I will ask about under separate cover.

However I would like to say thanks again. If we ever meet in the real world
I’ll stand you a beer (or equivalent).

Congratulations on version 0.11.0.

Z
aka
Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB