Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Strange behavior during Hive queries


Copy link to this message
-
RE: Strange behavior during Hive queries
Ashish Thusoo 2009-09-14, 18:29
How is your data stored - sequencefiles, textfiles, compressed?? and what are the value of mapred.min.split.size? Hive does not usually make a decision on the number of mappers but it does try to make an estimate of the number of reducers to use. Also if you send out the plan that would be great.

Ashish

________________________________
From: Brad Heintz [mailto:[EMAIL PROTECTED]]
Sent: Sunday, September 13, 2009 9:36 AM
To: [EMAIL PROTECTED]
Subject: Re: Strange behavior during Hive queries

Edward -

Yeah, I figured Hive had some decisions it made internally about how many mappers & reducers it used, but this is acting on almost 1TB of data - I don't see why it would use fewer mappers.  Also, this isn't a sort (which would of course use only 1 reducer) - it's a straight count.

Thanks,
- Brad

On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
> Hrm... sorry, I didn't read your original query closely enough.
>
> I'm not sure what could be causing this. The map.tasks.maximum parameter
> shouldn't affect it at all - it only affects the number of slots on the
> trackers.
>
> By any chance do you have mapred.max.maps.per.node set? This is a
> configuration parameter added by HADOOP-5170 - it's not in trunk or the
> vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this
> parameter could cause the behavior you're seeing. However, it would
> certainly not default to 2, so I'd be surprised if that were it.
>
> -Todd
>
> On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
>>
>> Todd -
>>
>> Of course; it makes sense that it would be that way.  But I'm still left
>> wondering why, then, my Hive queries are only using 2 mappers per task
>> tracker when other jobs use 7.  I've gone so far as to diff the job.xml
>> files from a regular job and a Hive query, and didn't turn up anything -
>> though clearly, it has to be something Hive is doing.
>>
>> Thanks,
>> - Brad
>>
>>
>> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
>>>
>>> Hi Brad,
>>>
>>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
>>> TaskTracker when it starts up. It cannot be changed per-job.
>>>
>>> Hope that helps
>>> -Todd
>>>
>>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
>>> wrote:
>>>>
>>>> TIA if anyone can point me in the right direction on this.
>>>>
>>>> I'm running a simple Hive query (a count on an external table comprising
>>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
>>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker
>>>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7
>>>> mappers spawned on each worker.
>>>>
>>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
>>>> worker.
>>>>
>>>> When I do "set -v;" from the Hive command line, I see
>>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>>
>>>> The job.xml for the Hive query shows
>>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>>
>>>> The only lead I have is that the default for
>>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden
>>>> in the cluster's mapred-site.xml I've tried redundanltly overriding this
>>>> variable everyplace I can think of (Hive command line with "-hiveconf",
>>>> using set from the Hive prompt, et al) and nothing works.  I've combed the
>>>> docs & mailing list, but haven't run across the answer.
>>>>
>>>> Does anyone have any ideas what (if anything) I'm missing?  Is this some
>>>> quirk of Hive, where it decides that 2 mappers per tasktracker is enough,
>>>> and I should just leave it alone?  Or is there some knob I can fiddle to get
>>>> it to use my cluster at full power?
>>>>
>>>> Many thanks in advance,

Hive does adjust some map/reduce settings based on the job size. Some
tasks like a sort might only require one map/reduce to work as well.

Brad Heintz
[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>