Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Map side join


Copy link to this message
-
Re: Map side join
Souvik Banerjee 2012-12-27, 23:05
Hi,

To conclude this thread I am summarizing my experiences. Correct me if
think // observed otherwise.

1) For Map side join you need to set the flag hive.auto.convert.join=true;
Map side join works well with multiple table and multiple Join condition.
2) You can change the size of the small table according to the RAM
available.
3) If you observe huge volume expansion during join operation the mappers
will take long time. I observed that mappers don't always report status, so
set timeout to high value so that the framework doesn't kill the ongoing
tasks. The mappers eventually completes and job ends successfully.
4) Bringing down the HDFS block size do launches more mappers and very
helpful in such cases where you observer real volume expansion during join.
But it might cause problem to other queries / hadoop jobs.

Thanks and regards,
Souvik.

On Thu, Dec 13, 2012 at 12:36 PM, Souvik Banerjee
<[EMAIL PROTECTED]>wrote:

> Thanks for the help.
> What I did earlier is that I changed the configuration in HDFS and created
> the table. I expected that the block size of the new Table to be of 32 MB.
> But I found that while using Cloudera Manager you need to deploy Change in
> Configuration of both the HDFS and Mapreduce. (I did it only for HDFS)
> Now I deleted the old table and recreated the same. Now I could launch
> more mappers.
> Thanks a lot once again. Will post you what happens with more mappers.
>
> Thanks and regards,
> Souvik.
>
>
> On Thu, Dec 13, 2012 at 12:06 PM, <[EMAIL PROTECTED]> wrote:
>
>> **
>> Hi Souvik
>>
>> To have the new hdfs block size in effect on the already existing files,
>> you need to re copy them into hdfs.
>>
>> To play with the number of mappers you can set lesser value like 64mb for
>> min and max split size.
>>
>> Mapred.min.split.size and mapred.max.split.size
>>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> ------------------------------
>> *From: * Souvik Banerjee <[EMAIL PROTECTED]>
>> *Date: *Thu, 13 Dec 2012 12:00:16 -0600
>> *To: *<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
>> *Subject: *Re: Map side join
>>
>> Hi Bejoy,
>>
>> The input files are non-compressed text file.
>> There are enough free slots in the cluster.
>>
>> Can you please let me know can I increase the no of mappers?
>> I tried reducing the HDFS block size to 32 MB from 128 MB. I was
>> expecting to get more mappers. But still it's launching same no of mappers
>> like it was doing while the HDFS block size was 128 MB. I have enough map
>> slots available, but not being able to utilize those.
>>
>>
>> Thanks and regards,
>> Souvik.
>>
>>
>> On Thu, Dec 13, 2012 at 11:12 AM, <[EMAIL PROTECTED]> wrote:
>>
>>> **
>>> Hi Souvik
>>>
>>> Is your input files compressed using some non splittable compression
>>> codec?
>>>
>>> Do you have enough free slots while this job is running?
>>>
>>> Make sure that the job is not running locally.
>>>
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> ------------------------------
>>> *From: * Souvik Banerjee <[EMAIL PROTECTED]>
>>> *Date: *Wed, 12 Dec 2012 14:27:27 -0600
>>> *To: *<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
>>> *ReplyTo: * [EMAIL PROTECTED]
>>> *Subject: *Re: Map side join
>>>
>>> Hi Bejoy,
>>>
>>> Yes I ran the pi example. It was fine.
>>> Regarding the HIVE Job what I found is that it took 4 hrs for the first
>>> map job to get completed.
>>> Those map tasks were doing their job and only reported status after
>>> completion. It is indeed taking too long time to finish. Nothing I could
>>> find relevant in the logs.
>>>
>>> Thanks and regards,
>>> Souvik.
>>>
>>> On Wed, Dec 12, 2012 at 8:04 AM, <[EMAIL PROTECTED]> wrote:
>>>
>>>> **
>>>> Hi Souvik
>>>>
>>>> Apart from hive jobs is the normal mapreduce jobs like the wordcount
>>>> running fine on your cluster?
>>>>
>>>> If it is working, for the hive jobs are you seeing anything skeptical
>>>> in task, Tasktracker or jobtracker logs?