Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Strange behavior during Hive queries


Copy link to this message
-
RE: Strange behavior during Hive queries
Ashish Thusoo 2009-09-15, 23:23
Can't seem to make head or tail of this. How many mappers does the job spaws? The explain plan seems to be fine. Can you also do a

describe extended

on both the input and the output table.

Also what is the block size and how many hdfs nodes is this data spread over.

Ashish
________________________________
From: Brad Heintz [mailto:[EMAIL PROTECTED]]
Sent: Monday, September 14, 2009 1:23 PM
To: [EMAIL PROTECTED]
Subject: Re: Strange behavior during Hive queries

436 files, each about 2GB.
On Mon, Sep 14, 2009 at 4:02 PM, Namit Jain <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:

Currently, hive uses 1 mapper per file - does your table have lots of small files ? If yes, it might be a good idea to concatenate them into fewer files

From: Ravi Jagannathan [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>]
Sent: Monday, September 14, 2009 12:17 PM
To: Brad Heintz; [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: RE: Strange behavior during Hive queries

http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers

Related issue , hive used too many mappers for very small table.

________________________________

From: Brad Heintz [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>]
Sent: Monday, September 14, 2009 11:51 AM
To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: Re: Strange behavior during Hive queries

Ashish -

mapred.min.split.size is set to 0 (according to the job.xml).  The data are stored as uncompressed text files.

Plan is attached.  I've been over it and didn't find anything useful, but I'm also new to Hive and don't claim to understand everything I'm looking at.  If you have any insight, I'd be most grateful.

Many thanks,
- Brad

On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:

How is your data stored - sequencefiles, textfiles, compressed?? and what are the value of mapred.min.split.size? Hive does not usually make a decision on the number of mappers but it does try to make an estimate of the number of reducers to use. Also if you send out the plan that would be great.

Ashish

________________________________

From: Brad Heintz [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>]
Sent: Sunday, September 13, 2009 9:36 AM
To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: Re: Strange behavior during Hive queries

Edward -

Yeah, I figured Hive had some decisions it made internally about how many mappers & reducers it used, but this is acting on almost 1TB of data - I don't see why it would use fewer mappers.  Also, this isn't a sort (which would of course use only 1 reducer) - it's a straight count.

Thanks,
- Brad

On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:

On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
> Hrm... sorry, I didn't read your original query closely enough.
>
> I'm not sure what could be causing this. The map.tasks.maximum parameter
> shouldn't affect it at all - it only affects the number of slots on the
> trackers.
>
> By any chance do you have mapred.max.maps.per.node set? This is a
> configuration parameter added by HADOOP-5170 - it's not in trunk or the
> vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this
> parameter could cause the behavior you're seeing. However, it would
> certainly not default to 2, so I'd be surprised if that were it.
>
> -Todd
>
> On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
>>
>> Todd -
>>
>> Of course; it makes sense that it would be that way.  But I'm still left
>> wondering why, then, my Hive queries are only using 2 mappers per task
>> tracker when other jobs use 7.  I've gone so far as to diff the job.xml
>> files from a regular job and a Hive query, and didn't turn up anything -

Hive does adjust some map/reduce settings based on the job size. Some
tasks like a sort might only require one map/reduce to work as well.

Brad Heintz
[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>

Brad Heintz
[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>

Brad Heintz
[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>