Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - View Partition Pruning not Occurring during transform


+
John Omernik 2012-10-10, 19:04
+
shrikanth shankar 2012-10-10, 20:24
+
John Omernik 2012-10-11, 01:08
+
Edward Capriolo 2012-10-11, 13:32
Copy link to this message
-
Re: View Partition Pruning not Occurring during transform
John Omernik 2012-10-11, 19:04
I did try nesting, the problem is that I am trying to do it in a view and I
think something gets lost in translation...

On Thu, Oct 11, 2012 at 8:32 AM, Edward Capriolo <[EMAIL PROTECTED]>wrote:

> Have you considered rewriting the query using nested from clauses.
> Generally if hive is not 'pushing down' as you would assume nesting froms
> make the query happen in a specific way.
>
>
> On Wednesday, October 10, 2012, John Omernik <[EMAIL PROTECTED]> wrote:
> > Agreed. That's the conclusion we came to as well. So it's less of a bug
> and more of a feature request. I think one of the main advantages of hive
> is the flexibility in allowing non-technical users to run basic queries
> without having to think about the transform stuff. (i.e. we in the IT shop
> can setup the transform)  I like the annotation idea that some how the
> partition specs can be pushed through (identified in some other way etc).
>  I am new to the Apache/JIRA world, what would you recommend for getting
> this into a feature request for consideration? I am not a Java programmer,
> so my idea may need to be paired with a champion to help implement it :)
> >
> >
> > On Wed, Oct 10, 2012 at 3:24 PM, shrikanth shankar <[EMAIL PROTECTED]>
> wrote:
> >>
> >> I assume the reason for this is that the Hive compiler has no way of
> determining that the 'day' that is input into the transform script is the
> same 'day' that is output from the transform script. Even if it did, its
> unclear if pushing down would be legal without knowing the semantics of the
> transformation. Any optimization to be done here will likely need an
> annotation somewhere to say that certain columns in the output of a
> transform refer to specific columns in the input of a transform for
> predicate push down purposes (and that such pushdown is legal for this
> transformation)
> >>
> >> thanks,
> >> Shrikanth
> >> On Oct 10, 2012, at 12:04 PM, John Omernik wrote:
> >>
> >> > Greetings all, I am trying to incorporate a TRANSFORM into a view (so
> we can abstract the transform script away from the user)
> >> >
> >> >
> >> >
> >> > As a Test, I have a table partitioned on day (in YYYY-MM-DD formated)
> with lots of partitions
> >> >
> >> > and I tried this
> >> >
> >> > CREATE VIEW view_transform as
> >> > Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_table;
> >> >
> >> > The reason I used 'cat' in my test is if this works, I will
> distribute my transform scripts to each node manually, I know each node has
> cat, so this works as a test.
> >> >
> >> > When run
> >> >
> >> > SELECT * from view_transform where day = '2012-10-08'  10,432 map
> tasks get spun up.
> >> >
> >> > If I rewrite the view to be
> >> >
> >> > CREATE VIEW view_transform as
> >> > Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_table
> where day = '2012-10-08';
> >> >
> >> > Then only 16 map tasks get spun up (the desired behavior, but the
> pruning is happening in the view not in the query)
> >> >
> >> > Thus I wanted input on whether this should be considered a bug.  I.e.
> Should we be able to define a partition spec in a view that uses a
> transform that allows normal pruning to occur even though the partition
> spec will be passed to the transfrom script?  I think we should, and it's
> likely doable some how. This would be awesome for a number of situations
> where you may want to expose "transformed" data to analysis without the
> mess of having them format their script for transform.
> >> >
> >> >
> >>
> >
> >
>