Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Filtering


On Sun, May 19, 2013 at 3:11 PM, Peter Marron <
[EMAIL PROTECTED]> wrote:

>    Hi Owen,****
>
> ** **
>
> Firstly I want to say a huge thank you. You have really helped me
> enormously.
>

You're welcome.

****
>
> OK. I think that I get it now. In my custom InputFormat I can read the
> config settings
>
** **
>
> JobConf .get(“"hive.io.filter.text"”);****
>
> JobConf .get(“"hive.io.filter.expr.serialized"”);
>

well, you don't need double quotes, but yes.
> ****
>
> ** **
>
> And so I can then find the predicate that I need to do the filtering.****
>
> In particular I can set the input splits so that it just reads the right
> records.
>

Right. You want the serialized one, because there is an API to convert it
back to a data structure.
> ****
>
> 1)      **I didn’t know about HIVE-2925 and I would never have thought
> that suppressing the
>
> Map/Reduce would be controlled by something called
> “hive.fetch.task.conversion”****
>
> So maybe I’m missing a trick. How should I have found out about HIVE-2925?
>
There isn't a "trick" other than being willing to ask on the user lists and
use your favorite search engine. As Hive developers, we absolutely need to
make more things happen automatically and reduce the need to know specific
magic incantations. Or at least document the magic incantations. *smile*

> ****
>
> **2)      **I would like to parse the filter.expr.serialized XML and I
> assume that there’s some
> SAX, DOM or even XLST already in HIVE. Could you give me a pointer to
> which classes
> are used (JAXP, Xerces, Xalan?) or where they are being used? Not
> important,
> I’m just being lazy.
>
 If you look at pushFilters, it is using Utilities.serializeExpression, so
Utilities.deserializeExpression will reverse it.

> ****
>
> **3)      **I really want to do my filtering in the getSplits of my
> custom InputFormat. However
> I have found that my getSplits is not being called. (And I asked about
> this on the list
> before.) I have found that if I do this
> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
> then my method is invoked. It seems to be something to do with avoiding
> the use of the org.apache.hadoop.hive.ql.io.CombineHiveInputFormat class.
> However I don’t know whether there are any other bad things that will
> happen
> if I make this change as I don’t really know what I’m doing.
> Is this a safe thing to do?
>
Yes, that is a fine thing to do. It does mean that you'll need to ensure
you don't have too many maps, but other than that you should be ok. The
primary purpose of CombineHiveInputFormat is to allow Mappers to read from
multiple files.

> However I would like to say thanks again. If we ever meet in the real world
>
> I’ll stand you a beer (or equivalent).
>

Sounds good, although I'll take the equivalent, since I don't enjoy alcohol.
> ****
>
> ** **
>
> Congratulations on version 0.11.0.
>

Thanks!

-- Owen
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB