Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Huge join performance issue


+
Gabi D 2013-04-04, 12:23
+
Nitin Pawar 2013-04-04, 12:39
+
Gabi D 2013-04-06, 21:30
+
Igor Tatarinov 2013-04-08, 18:39
Copy link to this message
-
Re: Huge join performance issue
In order for us to understand the performance and identify the bottlenecks,
could you do two things:

1) run the EXPLAIN command and share with us the output
2) share with us the hadoop job histories generated by the query. They can
be collected following
http://www.cs.duke.edu/starfish/tutorial/job_history.html

Jie
On Mon, Apr 8, 2013 at 11:39 AM, Igor Tatarinov <[EMAIL PROTECTED]> wrote:

> Did you verify that all your available mappers are running (and reducers
> too)? If you have a small number of partitions with huge files, you might
> me underutilizing mappers (check that the files are being split). Also, it
> might be optimal to have a single "wave" of reducers by setting the number
> of reduce tasks appropriately.
>
> You might also consider optimizing a simpler query first:
>
> select t1.a, count(*)
> from (select a from table baseTB1 where ... ) t1  -- filter by partition
> as well
>   join
>         (select a  from baseTB2 where ...) t2    -- filter by partition as
> well
> on t1.a=t2.a
> group by t1.a
>
> just to give you an idea how much overhead the extra columns are adding.
> If the columns are pretty big they could be causing the slowdown.
>
> igor
> decide.com
>
>
> On Sat, Apr 6, 2013 at 2:30 PM, Gabi D <[EMAIL PROTECTED]> wrote:
>
>> Thank you for your answer Nitin.
>> Does anyone have additional insight into this? will be
>> greatly appreciated.
>>
>>
>> On Thu, Apr 4, 2013 at 3:39 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>
>>> you dont really need subqueries to join the tables which have common
>>> columns. Its an additional overhead
>>> best way to filter your data and speed up your data processing is how
>>> you layout your data
>>> When you have larger table I will use partitioning and bucketing to trim
>>> down the data and improve the performances over joins
>>>
>>> distribute by is mainly used when you have your custom map reduce
>>> scripts and you want to use transform functionality in hive. I have not
>>> used it a lot so not sure on that part. also its helpful to write where
>>> clauses in join statements to reduce the dataset you want to join.
>>>
>>>
>>>
>>> On Thu, Apr 4, 2013 at 5:53 PM, Gabi D <[EMAIL PROTECTED]> wrote:
>>>
>>>> Hi all,
>>>> I have two tables I need to join and then summarize.
>>>> They are both huge (about 1B rows each, in the relevant partitions) and
>>>> the query runs for over 2 hours creating 5T intermediate data.
>>>>
>>>> The current query looks like this:
>>>>
>>>> select t1.b,t1.c,t2.d,t2.e, count(*)
>>>> from (select a,b,c    from table baseTB1 where ... ) t1  -- filter by
>>>> partition as well
>>>>   join
>>>>         (select a,d,e from baseTB2 where ...) t2    -- filter by
>>>> partition as well
>>>> on t1.a=t2.a
>>>> group by t1.b,t1.c,t2.d,t2.e
>>>>
>>>>
>>>> two questions:
>>>> 1. would joining baseTB1 and baseTB2 directly (instead of subqueries)
>>>> be better in any way?
>>>>           (I know subqueries cause a lot of writes of the intermediate
>>>> data but we also understand it's best to filter down the data that is being
>>>> joined, which is "more" correct?)
>>>> 2. can I use 'distribute by ' and/or 'sort by' in some way that would
>>>> help this? my understanding at the moment is that the problem lies in the
>>>> fact
>>>> that the reduces are on column a while the group by is on column b ...
>>>>
>>>> Any thoughts would be appreciated.
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB