-Re: LIKE filter pushdown for tables and partitions
Stephen Sprague 2013-08-27, 19:50
Thanks Sergey for that. Good stuff.
I can't speak for everybody here obviously but Hive partition elimination
is critical - its gotta happen somehow. However, if JDOQL method isn't
robust around the edges i'm fine with finding something better.
So if I get you right you're saying by removing the "optimized path"
(getPartitionsByFiltr/JDOQL) the partition elimination logic will default
to the "normal path" which is some other kind of filtering. To that i
guess i'd have to say what's the risk? It's a little slower?
Thanks for your patience, Sergey!
On Tue, Aug 27, 2013 at 10:35 AM, Sergey Shelukhin
> This method is used to prune partitions for the job (separately from
> actually processing data).
> There are a few ways to get partitions from Hive for a query (to avoid
> reading all partitions when filtering involves partition columns) -
> get-by-filter that I want to modify is one of them. Hive itself uses it as
> a perf optimization; the normal path gets all partition column values (via
> partition names) and applies the filter locally, whereas the optimized path
> converts the filter to JDOQL for DataNucleus (that Hive metastore uses
> internally), which converts it to SQL queries for e.g. MySQL. This normally
> happens before MR job is even run.
> Hive uses the latter (JDOQL pushdown) path for a restricted set of filters.
> These are enforced in Hive metastore client, not server; the server
> supports a wider set of filters, but Hive itself doesn't use them. While
> trying to enable Hive to use a wider set I noticed that the LIKE filter
> doesn't work properly - both regex and indexOf/... functions in DN seem to
> have some weird edge cases. It may be sending some things directly to
> datastore which would not actually work.
> However they would work for simple regexes (definition of simple is not
> clear and may be not the same for all datastores).
> Given that there's normal path to filter partitions in hive client and
> pre-job perf optimization for like is not that important, I want to remove
> this for Hive,
> I assume that other products using this path must apply filtering on client
> too sometimes (because getPartitionsByFilter doesn't support all filters
> even on server, e.g. such operators as not, between, etc.).
> On Tue, Aug 27, 2013 at 9:13 AM, Stephen Sprague <[EMAIL PROTECTED]>
> > sorry to be dumb-ass but what does that translate into in the HSQL
> > Judging from the name you use, getPartitionsByFilter, you're saying you
> > want to remove the use case of using like clause on a partition column?
> > if so, um, yeah, i would think that's surely used.
> > On Mon, Aug 26, 2013 at 7:48 PM, Sergey Shelukhin <
> [EMAIL PROTECTED]
> > >wrote:
> > > Adding user list. Any objections to removing LIKE support from
> > > getPartitionsByFilter?
> > >
> > > On Mon, Aug 26, 2013 at 2:54 PM, Ashutosh Chauhan <
> [EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > Couple of questions:
> > > >
> > > > 1. What about LIKE operator for Hive itself? Will that continue to
> > > > (presumably because there is an alternative path for that).
> > > > 2. This will nonetheless break other direct consumers of metastore
> > client
> > > > api (like HCatalog).
> > > >
> > > > I see your point that we have a buggy implementation, so whats out
> > there
> > > is
> > > > not safe to use. Question than really is shall we remove this code,
> > > thereby
> > > > breaking people for whom current buggy implementation is good enough
> > (or
> > > > you can say salvage them from breaking in future). Or shall we try to
> > fix
> > > > it now?
> > > > My take is if there are no users of this anyways, then there is no
> > point
> > > > fixing it for non-existing users, but if there are we probably have
> > to. I
> > > > will suggest you to send an email to users@hive to ask if there are
> > > users
> > > > for this.
> > > >
> > > > Thanks,
> > > > Ashutosh