Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Huge join performance issue


Copy link to this message
-
Re: Huge join performance issue
In order for us to understand the performance and identify the bottlenecks,
could you do two things:

1) run the EXPLAIN command and share with us the output
2) share with us the hadoop job histories generated by the query. They can
be collected following
http://www.cs.duke.edu/starfish/tutorial/job_history.html

Jie
On Mon, Apr 8, 2013 at 11:39 AM, Igor Tatarinov <[EMAIL PROTECTED]> wrote:

> Did you verify that all your available mappers are running (and reducers
> too)? If you have a small number of partitions with huge files, you might
> me underutilizing mappers (check that the files are being split). Also, it
> might be optimal to have a single "wave" of reducers by setting the number
> of reduce tasks appropriately.
>
> You might also consider optimizing a simpler query first:
>
> select t1.a, count(*)
> from (select a from table baseTB1 where ... ) t1  -- filter by partition
> as well
>   join
>         (select a  from baseTB2 where ...) t2    -- filter by partition as
> well
> on t1.a=t2.a
> group by t1.a
>
> just to give you an idea how much overhead the extra columns are adding.
> If the columns are pretty big they could be causing the slowdown.
>
> igor
> decide.com
>
>
> On Sat, Apr 6, 2013 at 2:30 PM, Gabi D <[EMAIL PROTECTED]> wrote:
>
>> Thank you for your answer Nitin.
>> Does anyone have additional insight into this? will be
>> greatly appreciated.
>>
>>
>> On Thu, Apr 4, 2013 at 3:39 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>
>>> you dont really need subqueries to join the tables which have common
>>> columns. Its an additional overhead
>>> best way to filter your data and speed up your data processing is how
>>> you layout your data
>>> When you have larger table I will use partitioning and bucketing to trim
>>> down the data and improve the performances over joins
>>>
>>> distribute by is mainly used when you have your custom map reduce
>>> scripts and you want to use transform functionality in hive. I have not
>>> used it a lot so not sure on that part. also its helpful to write where
>>> clauses in join statements to reduce the dataset you want to join.
>>>
>>>
>>>
>>> On Thu, Apr 4, 2013 at 5:53 PM, Gabi D <[EMAIL PROTECTED]> wrote:
>>>
>>>> Hi all,
>>>> I have two tables I need to join and then summarize.
>>>> They are both huge (about 1B rows each, in the relevant partitions) and
>>>> the query runs for over 2 hours creating 5T intermediate data.
>>>>
>>>> The current query looks like this:
>>>>
>>>> select t1.b,t1.c,t2.d,t2.e, count(*)
>>>> from (select a,b,c    from table baseTB1 where ... ) t1  -- filter by
>>>> partition as well
>>>>   join
>>>>         (select a,d,e from baseTB2 where ...) t2    -- filter by
>>>> partition as well
>>>> on t1.a=t2.a
>>>> group by t1.b,t1.c,t2.d,t2.e
>>>>
>>>>
>>>> two questions:
>>>> 1. would joining baseTB1 and baseTB2 directly (instead of subqueries)
>>>> be better in any way?
>>>>           (I know subqueries cause a lot of writes of the intermediate
>>>> data but we also understand it's best to filter down the data that is being
>>>> joined, which is "more" correct?)
>>>> 2. can I use 'distribute by ' and/or 'sort by' in some way that would
>>>> help this? my understanding at the moment is that the problem lies in the
>>>> fact
>>>> that the reduces are on column a while the group by is on column b ...
>>>>
>>>> Any thoughts would be appreciated.
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>