Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> map side join with group by


+
Chen Song 2012-12-12, 23:32
+
Mark Grover 2012-12-13, 01:41
+
Nitin Pawar 2012-12-13, 05:30
+
Chen Song 2012-12-13, 14:56
+
Nitin Pawar 2012-12-13, 16:04
+
Chen Song 2012-12-13, 18:24
+
Nitin Pawar 2012-12-13, 18:42
+
Chen Song 2012-12-13, 19:12
+
Nitin Pawar 2012-12-13, 19:30
Copy link to this message
-
Re: map side join with group by
Thanks Nitin. This is all I want to clarify :)

Chen

On Thu, Dec 13, 2012 at 2:30 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:

> to improve the speed of the job they created map only joins so that all
> the records associated with a key fall to a map .. reducers slows it down.
> If the reducer has to do some more job then they launch another job.
>
> bear in mind, when we say map only join we are absolutely sure that speed
> will increase in case data in one of the tables is in the few hundred MB
> ranges. If this has to do with reduce in hand, the processing logic
> completely changes and it also slows down.
>
> Launching a new job for group by is a neat way to measure how much time
> you spent on just join and another on group by so you can easily see two
> different things.
>
> There is no way you can ask a mapjoin to launch a reducer as it is not
> supposed to do.
>
> If you have such case (may be if you think that it will improve
> performance), please feel free to raise a jira and get it reviewed. if its
> valid I think people will provide more ideas
>
>
> On Fri, Dec 14, 2012 at 12:42 AM, Chen Song <[EMAIL PROTECTED]>wrote:
>
>> Nitin
>>
>> Yeah. My original question is that is there a way to force Hive (or
>> rather to say, is it possible) to execute map side join at mapper phase and
>> group by in reduce phase. So instead of launching a map only job (join) and
>> map reduce job (group by), doing it altogether in a single MR job. This is
>> obviously not what Hive does but I am wondering if it is a nice feature to
>> have.
>>
>> The point you made (different keys in join and group by) only matters
>> when it is the time in reduce phase, right? As map side join takes care of
>> join at mapper phase, it sounds to me natural that group by can be done in
>> the reduce phase in the same job. The only hassle that I can think of is
>> that map output have to be resorted (based on group by keys).
>>
>> Chen
>>
>> On Thu, Dec 13, 2012 at 1:42 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>
>>> chen in mapside join .. there are no reducers .. its MAP ONLY job
>>>
>>>
>>> On Thu, Dec 13, 2012 at 11:54 PM, Chen Song <[EMAIL PROTECTED]>wrote:
>>>
>>>> Understood that fact that it is impossible in the same MR job if both
>>>> join and group by are gonna happen in the reduce phase (because the join
>>>> keys and group by keys are different). But for map side join, the joins
>>>> would be complete by the end of the map phase, and outputs should be ready
>>>> to be distributed to reducers based on group by keys.
>>>>
>>>> Chen
>>>>
>>>>
>>>> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> Thats because for the first job the join keys are different and second
>>>>> job group by keys are different, you just cant assume join keys and group
>>>>> keys will be same so they are two different jobs
>>>>>
>>>>>
>>>>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>> Yeah, my abridged version of query might be a little broken but my
>>>>>> point is that when a query has a map join and group by, even in its
>>>>>> simplified incarnation, it will launch two jobs. I was just wondering why
>>>>>> map join and group by cannot be accomplished in one MR job.
>>>>>>
>>>>>> Best,
>>>>>> Chen
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar <
>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>
>>>>>>> I think Chen wanted to know why this is two phased query if I
>>>>>>> understood it correctly
>>>>>>>
>>>>>>> When you run a mapside join .. it just performs the join query ..
>>>>>>> after that to execute the group by part it launches the second job.
>>>>>>> I may be wrong but this is how I saw it whenever I executed group by
>>>>>>> queries
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>
>>>>>>>> Hi Chen,
>>>>>>>> I think we would need some more information.
>>>>>>>>
>>>>>>>> The query is referring to a table called "d" in the MAPJOIN hint but

Chen Song