you dont really need subqueries to join the tables which have common
columns. Its an additional overhead
best way to filter your data and speed up your data processing is how you
layout your data
When you have larger table I will use partitioning and bucketing to trim
down the data and improve the performances over joins
distribute by is mainly used when you have your custom map reduce scripts
and you want to use transform functionality in hive. I have not used it a
lot so not sure on that part. also its helpful to write where clauses in
join statements to reduce the dataset you want to join.
On Thu, Apr 4, 2013 at 5:53 PM, Gabi D <[EMAIL PROTECTED]> wrote:
> Hi all,
> I have two tables I need to join and then summarize.
> They are both huge (about 1B rows each, in the relevant partitions) and
> the query runs for over 2 hours creating 5T intermediate data.
> The current query looks like this:
> select t1.b,t1.c,t2.d,t2.e, count(*)
> from (select a,b,c from table baseTB1 where ... ) t1 -- filter by
> partition as well
> (select a,d,e from baseTB2 where ...) t2 -- filter by partition
> as well
> on t1.a=t2.a
> group by t1.b,t1.c,t2.d,t2.e
> two questions:
> 1. would joining baseTB1 and baseTB2 directly (instead of subqueries) be
> better in any way?
> (I know subqueries cause a lot of writes of the intermediate
> data but we also understand it's best to filter down the data that is being
> joined, which is "more" correct?)
> 2. can I use 'distribute by ' and/or 'sort by' in some way that would help
> this? my understanding at the moment is that the problem lies in the fact
> that the reduces are on column a while the group by is on column b ...
> Any thoughts would be appreciated.