Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # dev >> LIKE filter pushdown for tables and partitions


Copy link to this message
-
Re: LIKE filter pushdown for tables and partitions
Thanks Sergey for that.  Good stuff.

I can't speak for everybody here obviously but Hive partition elimination
is critical - its gotta happen somehow.  However, if JDOQL method isn't
robust around the edges i'm fine with finding something better.

So if I get you right you're saying by removing the "optimized path"
(getPartitionsByFiltr/JDOQL) the partition elimination logic will default
to the "normal path" which is some other kind of filtering.  To that i
guess i'd have to say what's the risk? It's a little slower?

Thanks for your patience, Sergey!
On Tue, Aug 27, 2013 at 10:35 AM, Sergey Shelukhin
<[EMAIL PROTECTED]>wrote:

> This method is used to prune partitions for the job (separately from
> actually processing data).
> There are a few ways to get partitions from Hive for a query (to avoid
> reading all partitions when filtering involves partition columns)  -
> get-by-filter that I want to modify is one of them. Hive itself uses it as
> a perf optimization; the normal path gets all partition column values (via
> partition names) and applies the filter locally, whereas the optimized path
> converts the filter to JDOQL for DataNucleus (that Hive metastore uses
> internally), which converts it to SQL queries for e.g. MySQL. This normally
> happens before MR job is even run.
>
> Hive uses the latter (JDOQL pushdown) path for a restricted set of filters.
> These are enforced in Hive metastore client, not server; the server
> supports a wider set of filters, but Hive itself doesn't use them. While
> trying to enable Hive to use a wider set I noticed that the LIKE filter
> doesn't work properly - both regex and indexOf/... functions in DN seem to
> have some weird edge cases. It may be sending some things directly to
> datastore which would not actually work.
> However they would work for simple regexes (definition of simple is not
> clear and may be not the same for all datastores).
>
> Given that there's normal path to filter partitions in hive client and
> pre-job perf optimization for like is not that important, I want to remove
> this for Hive,
> I assume that other products using this path must apply filtering on client
> too sometimes (because getPartitionsByFilter doesn't support all filters
> even on server, e.g. such  operators as not, between, etc.).
>
> On Tue, Aug 27, 2013 at 9:13 AM, Stephen Sprague <[EMAIL PROTECTED]>
> wrote:
>
> > sorry to be dumb-ass but what does that translate into in the HSQL
> dialect?
> >
> > Judging from the name you use, getPartitionsByFilter, you're saying you
> > want to remove the use case of using like clause on a partition column?
> >
> > if so, um, yeah, i would think that's surely used.
> >
> >
> >
> > On Mon, Aug 26, 2013 at 7:48 PM, Sergey Shelukhin <
> [EMAIL PROTECTED]
> > >wrote:
> >
> > > Adding user list. Any objections to removing LIKE support from
> > > getPartitionsByFilter?
> > >
> > > On Mon, Aug 26, 2013 at 2:54 PM, Ashutosh Chauhan <
> [EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > Couple of questions:
> > > >
> > > > 1. What about LIKE operator for Hive itself? Will that continue to
> work
> > > > (presumably because there is an alternative path for that).
> > > > 2. This will nonetheless break other direct consumers of metastore
> > client
> > > > api (like HCatalog).
> > > >
> > > > I see your point that we have a buggy implementation, so whats out
> > there
> > > is
> > > > not safe to use. Question than really is shall we remove this code,
> > > thereby
> > > > breaking people for whom current buggy implementation is good enough
> > (or
> > > > you can say salvage them from breaking in future). Or shall we try to
> > fix
> > > > it now?
> > > > My take is if there are no users of this anyways, then there is no
> > point
> > > > fixing it for non-existing users, but if there are we probably have
> > to. I
> > > > will suggest you to send an email to users@hive to ask if there are
> > > users
> > > > for this.
> > > >
> > > > Thanks,
> > > > Ashutosh
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB