Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> map side join with group by

Chen Song 2012-12-12, 23:32
Mark Grover 2012-12-13, 01:41
Nitin Pawar 2012-12-13, 05:30
Chen Song 2012-12-13, 14:56
Nitin Pawar 2012-12-13, 16:04
Chen Song 2012-12-13, 18:24
Nitin Pawar 2012-12-13, 18:42
Chen Song 2012-12-13, 19:12
Copy link to this message
Re: map side join with group by
to improve the speed of the job they created map only joins so that all the
records associated with a key fall to a map .. reducers slows it down. If
the reducer has to do some more job then they launch another job.

bear in mind, when we say map only join we are absolutely sure that speed
will increase in case data in one of the tables is in the few hundred MB
ranges. If this has to do with reduce in hand, the processing logic
completely changes and it also slows down.

Launching a new job for group by is a neat way to measure how much time you
spent on just join and another on group by so you can easily see two
different things.

There is no way you can ask a mapjoin to launch a reducer as it is not
supposed to do.

If you have such case (may be if you think that it will improve
performance), please feel free to raise a jira and get it reviewed. if its
valid I think people will provide more ideas
On Fri, Dec 14, 2012 at 12:42 AM, Chen Song <[EMAIL PROTECTED]> wrote:

> Nitin
> Yeah. My original question is that is there a way to force Hive (or rather
> to say, is it possible) to execute map side join at mapper phase and group
> by in reduce phase. So instead of launching a map only job (join) and map
> reduce job (group by), doing it altogether in a single MR job. This is
> obviously not what Hive does but I am wondering if it is a nice feature to
> have.
> The point you made (different keys in join and group by) only matters when
> it is the time in reduce phase, right? As map side join takes care of join
> at mapper phase, it sounds to me natural that group by can be done in the
> reduce phase in the same job. The only hassle that I can think of is that
> map output have to be resorted (based on group by keys).
> Chen
> On Thu, Dec 13, 2012 at 1:42 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>> chen in mapside join .. there are no reducers .. its MAP ONLY job
>> On Thu, Dec 13, 2012 at 11:54 PM, Chen Song <[EMAIL PROTECTED]>wrote:
>>> Understood that fact that it is impossible in the same MR job if both
>>> join and group by are gonna happen in the reduce phase (because the join
>>> keys and group by keys are different). But for map side join, the joins
>>> would be complete by the end of the map phase, and outputs should be ready
>>> to be distributed to reducers based on group by keys.
>>> Chen
>>> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>>> Thats because for the first job the join keys are different and second
>>>> job group by keys are different, you just cant assume join keys and group
>>>> keys will be same so they are two different jobs
>>>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song <[EMAIL PROTECTED]>wrote:
>>>>> Yeah, my abridged version of query might be a little broken but my
>>>>> point is that when a query has a map join and group by, even in its
>>>>> simplified incarnation, it will launch two jobs. I was just wondering why
>>>>> map join and group by cannot be accomplished in one MR job.
>>>>> Best,
>>>>> Chen
>>>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar <[EMAIL PROTECTED]
>>>>> > wrote:
>>>>>> I think Chen wanted to know why this is two phased query if I
>>>>>> understood it correctly
>>>>>> When you run a mapside join .. it just performs the join query ..
>>>>>> after that to execute the group by part it launches the second job.
>>>>>> I may be wrong but this is how I saw it whenever I executed group by
>>>>>> queries
>>>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>> Hi Chen,
>>>>>>> I think we would need some more information.
>>>>>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>>>>>> there is not such table in the query. Moreover, Map joins only make
>>>>>>> sense when the right table is the one being "mapped" (in other words,
>>>>>>> being kept in memory) in case of a Left Outer Join, similarly if the
Nitin Pawar
Chen Song 2012-12-13, 19:50