|
Brad Heintz
2009-09-11, 21:06
Todd Lipcon
2009-09-11, 21:16
Brad Heintz
2009-09-11, 21:20
Todd Lipcon
2009-09-11, 21:28
Edward Capriolo
2009-09-11, 21:30
Brad Heintz
2009-09-13, 16:32
Brad Heintz
2009-09-13, 16:35
Ashish Thusoo
2009-09-14, 18:29
Brad Heintz
2009-09-14, 18:50
Ravi Jagannathan
2009-09-14, 19:16
Namit Jain
2009-09-14, 20:02
Brad Heintz
2009-09-14, 20:23
Ashish Thusoo
2009-09-15, 23:23
Brad Heintz
2009-09-16, 13:50
Zheng Shao
2009-09-17, 04:56
Brad Heintz
2009-09-17, 14:36
|
-
Strange behavior during Hive queriesBrad Heintz 2009-09-11, 21:06
TIA if anyone can point me in the right direction on this.
I'm running a simple Hive query (a count on an external table comprising 436 files, each of ~2GB). The cluster's mapred-site.xml specifies mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker node. When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7 mappers spawned on each worker. The problem: When I run my Hive query, I see 2 mappers spawned per worker. When I do "set -v;" from the Hive command line, I see mapred.tasktracker.map.tasks.maximum = 7. The job.xml for the Hive query shows mapred.tasktracker.map.tasks.maximum 7. The only lead I have is that the default for mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden in the cluster's mapred-site.xml I've tried redundanltly overriding this variable everyplace I can think of (Hive command line with "-hiveconf", using set from the Hive prompt, et al) and nothing works. I've combed the docs & mailing list, but haven't run across the answer. Does anyone have any ideas what (if anything) I'm missing? Is this some quirk of Hive, where it decides that 2 mappers per tasktracker is enough, and I should just leave it alone? Or is there some knob I can fiddle to get it to use my cluster at full power? Many thanks in advance, - Brad -- Brad Heintz [EMAIL PROTECTED]
-
Re: Strange behavior during Hive queriesTodd Lipcon 2009-09-11, 21:16
Hi Brad,
mapred.tasktracker.map.tasks.maximum is a parameter read by the TaskTracker when it starts up. It cannot be changed per-job. Hope that helps -Todd On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[EMAIL PROTECTED]> wrote: > TIA if anyone can point me in the right direction on this. > > I'm running a simple Hive query (a count on an external table comprising > 436 files, each of ~2GB). The cluster's mapred-site.xml specifies > mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker > node. When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7 > mappers spawned on each worker. > > The problem: When I run my Hive query, I see 2 mappers spawned per worker. > > When I do "set -v;" from the Hive command line, I see > mapred.tasktracker.map.tasks.maximum = 7. > > The job.xml for the Hive query shows mapred.tasktracker.map.tasks.maximum > 7. > > The only lead I have is that the default for > mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden > in the cluster's mapred-site.xml I've tried redundanltly overriding this > variable everyplace I can think of (Hive command line with "-hiveconf", > using set from the Hive prompt, et al) and nothing works. I've combed the > docs & mailing list, but haven't run across the answer. > > Does anyone have any ideas what (if anything) I'm missing? Is this some > quirk of Hive, where it decides that 2 mappers per tasktracker is enough, > and I should just leave it alone? Or is there some knob I can fiddle to get > it to use my cluster at full power? > > Many thanks in advance, > - Brad > > -- > Brad Heintz > [EMAIL PROTECTED] >
-
Re: Strange behavior during Hive queriesBrad Heintz 2009-09-11, 21:20
Todd -
Of course; it makes sense that it would be that way. But I'm still left wondering why, then, my Hive queries are only using 2 mappers per task tracker when other jobs use 7. I've gone so far as to diff the job.xml files from a regular job and a Hive query, and didn't turn up anything - though clearly, it has to be something Hive is doing. Thanks, - Brad On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > Hi Brad, > > mapred.tasktracker.map.tasks.maximum is a parameter read by the TaskTracker > when it starts up. It cannot be changed per-job. > > Hope that helps > -Todd > > > On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[EMAIL PROTECTED]>wrote: > >> TIA if anyone can point me in the right direction on this. >> >> I'm running a simple Hive query (a count on an external table comprising >> 436 files, each of ~2GB). The cluster's mapred-site.xml specifies >> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker >> node. When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7 >> mappers spawned on each worker. >> >> The problem: When I run my Hive query, I see 2 mappers spawned per >> worker. >> >> When I do "set -v;" from the Hive command line, I see >> mapred.tasktracker.map.tasks.maximum = 7. >> >> The job.xml for the Hive query shows mapred.tasktracker.map.tasks.maximum >> = 7. >> >> The only lead I have is that the default for >> mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden >> in the cluster's mapred-site.xml I've tried redundanltly overriding this >> variable everyplace I can think of (Hive command line with "-hiveconf", >> using set from the Hive prompt, et al) and nothing works. I've combed the >> docs & mailing list, but haven't run across the answer. >> >> Does anyone have any ideas what (if anything) I'm missing? Is this some >> quirk of Hive, where it decides that 2 mappers per tasktracker is enough, >> and I should just leave it alone? Or is there some knob I can fiddle to get >> it to use my cluster at full power? >> >> Many thanks in advance, >> - Brad >> >> -- >> Brad Heintz >> [EMAIL PROTECTED] >> > > -- Brad Heintz [EMAIL PROTECTED]
-
Re: Strange behavior during Hive queriesTodd Lipcon 2009-09-11, 21:28
Hrm... sorry, I didn't read your original query closely enough.
I'm not sure what could be causing this. The map.tasks.maximum parameter shouldn't affect it at all - it only affects the number of slots on the trackers. By any chance do you have mapred.max.maps.per.node set? This is a configuration parameter added by HADOOP-5170 - it's not in trunk or the vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this parameter could cause the behavior you're seeing. However, it would certainly not default to 2, so I'd be surprised if that were it. -Todd On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[EMAIL PROTECTED]> wrote: > Todd - > > Of course; it makes sense that it would be that way. But I'm still left > wondering why, then, my Hive queries are only using 2 mappers per task > tracker when other jobs use 7. I've gone so far as to diff the job.xml > files from a regular job and a Hive query, and didn't turn up anything - > though clearly, it has to be something Hive is doing. > > Thanks, > - Brad > > > > On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > >> Hi Brad, >> >> mapred.tasktracker.map.tasks.maximum is a parameter read by the >> TaskTracker when it starts up. It cannot be changed per-job. >> >> Hope that helps >> -Todd >> >> >> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[EMAIL PROTECTED]>wrote: >> >>> TIA if anyone can point me in the right direction on this. >>> >>> I'm running a simple Hive query (a count on an external table comprising >>> 436 files, each of ~2GB). The cluster's mapred-site.xml specifies >>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker >>> node. When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7 >>> mappers spawned on each worker. >>> >>> The problem: When I run my Hive query, I see 2 mappers spawned per >>> worker. >>> >>> When I do "set -v;" from the Hive command line, I see >>> mapred.tasktracker.map.tasks.maximum = 7. >>> >>> The job.xml for the Hive query shows mapred.tasktracker.map.tasks.maximum >>> = 7. >>> >>> The only lead I have is that the default for >>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden >>> in the cluster's mapred-site.xml I've tried redundanltly overriding this >>> variable everyplace I can think of (Hive command line with "-hiveconf", >>> using set from the Hive prompt, et al) and nothing works. I've combed the >>> docs & mailing list, but haven't run across the answer. >>> >>> Does anyone have any ideas what (if anything) I'm missing? Is this some >>> quirk of Hive, where it decides that 2 mappers per tasktracker is enough, >>> and I should just leave it alone? Or is there some knob I can fiddle to get >>> it to use my cluster at full power? >>> >>> Many thanks in advance, >>> - Brad >>> >>> -- >>> Brad Heintz >>> [EMAIL PROTECTED] >>> >> >> > > > -- > Brad Heintz > [EMAIL PROTECTED] >
-
Re: Strange behavior during Hive queriesEdward Capriolo 2009-09-11, 21:30
On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote:
> Hrm... sorry, I didn't read your original query closely enough. > > I'm not sure what could be causing this. The map.tasks.maximum parameter > shouldn't affect it at all - it only affects the number of slots on the > trackers. > > By any chance do you have mapred.max.maps.per.node set? This is a > configuration parameter added by HADOOP-5170 - it's not in trunk or the > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this > parameter could cause the behavior you're seeing. However, it would > certainly not default to 2, so I'd be surprised if that were it. > > -Todd > > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[EMAIL PROTECTED]> wrote: >> >> Todd - >> >> Of course; it makes sense that it would be that way. But I'm still left >> wondering why, then, my Hive queries are only using 2 mappers per task >> tracker when other jobs use 7. I've gone so far as to diff the job.xml >> files from a regular job and a Hive query, and didn't turn up anything - >> though clearly, it has to be something Hive is doing. >> >> Thanks, >> - Brad >> >> >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: >>> >>> Hi Brad, >>> >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the >>> TaskTracker when it starts up. It cannot be changed per-job. >>> >>> Hope that helps >>> -Todd >>> >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[EMAIL PROTECTED]> >>> wrote: >>>> >>>> TIA if anyone can point me in the right direction on this. >>>> >>>> I'm running a simple Hive query (a count on an external table comprising >>>> 436 files, each of ~2GB). The cluster's mapred-site.xml specifies >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker >>>> node. When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7 >>>> mappers spawned on each worker. >>>> >>>> The problem: When I run my Hive query, I see 2 mappers spawned per >>>> worker. >>>> >>>> When I do "set -v;" from the Hive command line, I see >>>> mapred.tasktracker.map.tasks.maximum = 7. >>>> >>>> The job.xml for the Hive query shows >>>> mapred.tasktracker.map.tasks.maximum = 7. >>>> >>>> The only lead I have is that the default for >>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden >>>> in the cluster's mapred-site.xml I've tried redundanltly overriding this >>>> variable everyplace I can think of (Hive command line with "-hiveconf", >>>> using set from the Hive prompt, et al) and nothing works. I've combed the >>>> docs & mailing list, but haven't run across the answer. >>>> >>>> Does anyone have any ideas what (if anything) I'm missing? Is this some >>>> quirk of Hive, where it decides that 2 mappers per tasktracker is enough, >>>> and I should just leave it alone? Or is there some knob I can fiddle to get >>>> it to use my cluster at full power? >>>> >>>> Many thanks in advance, >>>> - Brad >>>> >>>> -- >>>> Brad Heintz >>>> [EMAIL PROTECTED] >>> >> >> >> >> -- >> Brad Heintz >> [EMAIL PROTECTED] > > Hive does adjust some map/reduce settings based on the job size. Some tasks like a sort might only require one map/reduce to work as well.
-
Re: Strange behavior during Hive queriesBrad Heintz 2009-09-13, 16:32
No, I'm using vanilla 0.20.0. Other, non-Hive jobs are also running with
more mappers, so I don't think it'd be that setting even if I had it available. On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > Hrm... sorry, I didn't read your original query closely enough. > > I'm not sure what could be causing this. The map.tasks.maximum parameter > shouldn't affect it at all - it only affects the number of slots on the > trackers. > > By any chance do you have mapred.max.maps.per.node set? This is a > configuration parameter added by HADOOP-5170 - it's not in trunk or the > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this > parameter could cause the behavior you're seeing. However, it would > certainly not default to 2, so I'd be surprised if that were it. > > -Todd > > > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[EMAIL PROTECTED]>wrote: > >> Todd - >> >> Of course; it makes sense that it would be that way. But I'm still left >> wondering why, then, my Hive queries are only using 2 mappers per task >> tracker when other jobs use 7. I've gone so far as to diff the job.xml >> files from a regular job and a Hive query, and didn't turn up anything - >> though clearly, it has to be something Hive is doing. >> >> Thanks, >> - Brad >> >> >> >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: >> >>> Hi Brad, >>> >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the >>> TaskTracker when it starts up. It cannot be changed per-job. >>> >>> Hope that helps >>> -Todd >>> >>> >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[EMAIL PROTECTED]>wrote: >>> >>>> TIA if anyone can point me in the right direction on this. >>>> >>>> I'm running a simple Hive query (a count on an external table comprising >>>> 436 files, each of ~2GB). The cluster's mapred-site.xml specifies >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker >>>> node. When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7 >>>> mappers spawned on each worker. >>>> >>>> The problem: When I run my Hive query, I see 2 mappers spawned per >>>> worker. >>>> >>>> When I do "set -v;" from the Hive command line, I see >>>> mapred.tasktracker.map.tasks.maximum = 7. >>>> >>>> The job.xml for the Hive query shows >>>> mapred.tasktracker.map.tasks.maximum = 7. >>>> >>>> The only lead I have is that the default for >>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden >>>> in the cluster's mapred-site.xml I've tried redundanltly overriding this >>>> variable everyplace I can think of (Hive command line with "-hiveconf", >>>> using set from the Hive prompt, et al) and nothing works. I've combed the >>>> docs & mailing list, but haven't run across the answer. >>>> >>>> Does anyone have any ideas what (if anything) I'm missing? Is this some >>>> quirk of Hive, where it decides that 2 mappers per tasktracker is enough, >>>> and I should just leave it alone? Or is there some knob I can fiddle to get >>>> it to use my cluster at full power? >>>> >>>> Many thanks in advance, >>>> - Brad >>>> >>>> -- >>>> Brad Heintz >>>> [EMAIL PROTECTED] >>>> >>> >>> >> >> >> -- >> Brad Heintz >> [EMAIL PROTECTED] >> > > -- Brad Heintz [EMAIL PROTECTED]
-
Re: Strange behavior during Hive queriesBrad Heintz 2009-09-13, 16:35
Edward -
Yeah, I figured Hive had some decisions it made internally about how many mappers & reducers it used, but this is acting on almost 1TB of data - I don't see why it would use fewer mappers. Also, this isn't a sort (which would of course use only 1 reducer) - it's a straight count. Thanks, - Brad On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > > Hrm... sorry, I didn't read your original query closely enough. > > > > I'm not sure what could be causing this. The map.tasks.maximum parameter > > shouldn't affect it at all - it only affects the number of slots on the > > trackers. > > > > By any chance do you have mapred.max.maps.per.node set? This is a > > configuration parameter added by HADOOP-5170 - it's not in trunk or the > > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release > this > > parameter could cause the behavior you're seeing. However, it would > > certainly not default to 2, so I'd be surprised if that were it. > > > > -Todd > > > > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[EMAIL PROTECTED]> > wrote: > >> > >> Todd - > >> > >> Of course; it makes sense that it would be that way. But I'm still left > >> wondering why, then, my Hive queries are only using 2 mappers per task > >> tracker when other jobs use 7. I've gone so far as to diff the job.xml > >> files from a regular job and a Hive query, and didn't turn up anything - > >> though clearly, it has to be something Hive is doing. > >> > >> Thanks, > >> - Brad > >> > >> > >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > >>> > >>> Hi Brad, > >>> > >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the > >>> TaskTracker when it starts up. It cannot be changed per-job. > >>> > >>> Hope that helps > >>> -Todd > >>> > >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[EMAIL PROTECTED]> > >>> wrote: > >>>> > >>>> TIA if anyone can point me in the right direction on this. > >>>> > >>>> I'm running a simple Hive query (a count on an external table > comprising > >>>> 436 files, each of ~2GB). The cluster's mapred-site.xml specifies > >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per > worker > >>>> node. When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I > see 7 > >>>> mappers spawned on each worker. > >>>> > >>>> The problem: When I run my Hive query, I see 2 mappers spawned per > >>>> worker. > >>>> > >>>> When I do "set -v;" from the Hive command line, I see > >>>> mapred.tasktracker.map.tasks.maximum = 7. > >>>> > >>>> The job.xml for the Hive query shows > >>>> mapred.tasktracker.map.tasks.maximum = 7. > >>>> > >>>> The only lead I have is that the default for > >>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's > overridden > >>>> in the cluster's mapred-site.xml I've tried redundanltly overriding > this > >>>> variable everyplace I can think of (Hive command line with > "-hiveconf", > >>>> using set from the Hive prompt, et al) and nothing works. I've combed > the > >>>> docs & mailing list, but haven't run across the answer. > >>>> > >>>> Does anyone have any ideas what (if anything) I'm missing? Is this > some > >>>> quirk of Hive, where it decides that 2 mappers per tasktracker is > enough, > >>>> and I should just leave it alone? Or is there some knob I can fiddle > to get > >>>> it to use my cluster at full power? > >>>> > >>>> Many thanks in advance, > >>>> - Brad > >>>> > >>>> -- > >>>> Brad Heintz > >>>> [EMAIL PROTECTED] > >>> > >> > >> > >> > >> -- > >> Brad Heintz > >> [EMAIL PROTECTED] > > > > > > Hive does adjust some map/reduce settings based on the job size. Some > tasks like a sort might only require one map/reduce to work as well. > -- Brad Heintz [EMAIL PROTECTED]
-
RE: Strange behavior during Hive queriesAshish Thusoo 2009-09-14, 18:29
How is your data stored - sequencefiles, textfiles, compressed?? and what are the value of mapred.min.split.size? Hive does not usually make a decision on the number of mappers but it does try to make an estimate of the number of reducers to use. Also if you send out the plan that would be great.
Ashish ________________________________ From: Brad Heintz [mailto:[EMAIL PROTECTED]] Sent: Sunday, September 13, 2009 9:36 AM To: [EMAIL PROTECTED] Subject: Re: Strange behavior during Hive queries Edward - Yeah, I figured Hive had some decisions it made internally about how many mappers & reducers it used, but this is acting on almost 1TB of data - I don't see why it would use fewer mappers. Also, this isn't a sort (which would of course use only 1 reducer) - it's a straight count. Thanks, - Brad On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Hrm... sorry, I didn't read your original query closely enough. > > I'm not sure what could be causing this. The map.tasks.maximum parameter > shouldn't affect it at all - it only affects the number of slots on the > trackers. > > By any chance do you have mapred.max.maps.per.node set? This is a > configuration parameter added by HADOOP-5170 - it's not in trunk or the > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this > parameter could cause the behavior you're seeing. However, it would > certainly not default to 2, so I'd be surprised if that were it. > > -Todd > > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >> >> Todd - >> >> Of course; it makes sense that it would be that way. But I'm still left >> wondering why, then, my Hive queries are only using 2 mappers per task >> tracker when other jobs use 7. I've gone so far as to diff the job.xml >> files from a regular job and a Hive query, and didn't turn up anything - >> though clearly, it has to be something Hive is doing. >> >> Thanks, >> - Brad >> >> >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >>> >>> Hi Brad, >>> >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the >>> TaskTracker when it starts up. It cannot be changed per-job. >>> >>> Hope that helps >>> -Todd >>> >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> >>> wrote: >>>> >>>> TIA if anyone can point me in the right direction on this. >>>> >>>> I'm running a simple Hive query (a count on an external table comprising >>>> 436 files, each of ~2GB). The cluster's mapred-site.xml specifies >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker >>>> node. When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7 >>>> mappers spawned on each worker. >>>> >>>> The problem: When I run my Hive query, I see 2 mappers spawned per >>>> worker. >>>> >>>> When I do "set -v;" from the Hive command line, I see >>>> mapred.tasktracker.map.tasks.maximum = 7. >>>> >>>> The job.xml for the Hive query shows >>>> mapred.tasktracker.map.tasks.maximum = 7. >>>> >>>> The only lead I have is that the default for >>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden >>>> in the cluster's mapred-site.xml I've tried redundanltly overriding this >>>> variable everyplace I can think of (Hive command line with "-hiveconf", >>>> using set from the Hive prompt, et al) and nothing works. I've combed the >>>> docs & mailing list, but haven't run across the answer. >>>> >>>> Does anyone have any ideas what (if anything) I'm missing? Is this some >>>> quirk of Hive, where it decides that 2 mappers per tasktracker is enough, >>>> and I should just leave it alone? Or is there some knob I can fiddle to get >>>> it to use my cluster at full power? >>>> >>>> Many thanks in advance, Hive does adjust some map/reduce settings based on the job size. Some tasks like a sort might only require one map/reduce to work as well. Brad Heintz [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
-
Re: Strange behavior during Hive queriesBrad Heintz 2009-09-14, 18:50
Ashish -
mapred.min.split.size is set to 0 (according to the job.xml). The data are stored as uncompressed text files. Plan is attached. I've been over it and didn't find anything useful, but I'm also new to Hive and don't claim to understand everything I'm looking at. If you have any insight, I'd be most grateful. Many thanks, - Brad On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <[EMAIL PROTECTED]> wrote: > How is your data stored - sequencefiles, textfiles, compressed?? and what > are the value of mapred.min.split.size? Hive does not usually make a > decision on the number of mappers but it does try to make an estimate of the > number of reducers to use. Also if you send out the plan that would be > great. > > Ashish > > ------------------------------ > *From:* Brad Heintz [mailto:[EMAIL PROTECTED]] > *Sent:* Sunday, September 13, 2009 9:36 AM > *To:* [EMAIL PROTECTED] > *Subject:* Re: Strange behavior during Hive queries > > Edward - > > Yeah, I figured Hive had some decisions it made internally about how many > mappers & reducers it used, but this is acting on almost 1TB of data - I > don't see why it would use fewer mappers. Also, this isn't a sort (which > would of course use only 1 reducer) - it's a straight count. > > Thanks, > - Brad > > On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > >> On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: >> > Hrm... sorry, I didn't read your original query closely enough. >> > >> > I'm not sure what could be causing this. The map.tasks.maximum parameter >> > shouldn't affect it at all - it only affects the number of slots on the >> > trackers. >> > >> > By any chance do you have mapred.max.maps.per.node set? This is a >> > configuration parameter added by HADOOP-5170 - it's not in trunk or the >> > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release >> this >> > parameter could cause the behavior you're seeing. However, it would >> > certainly not default to 2, so I'd be surprised if that were it. >> > >> > -Todd >> > >> > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[EMAIL PROTECTED]> >> wrote: >> >> >> >> Todd - >> >> >> >> Of course; it makes sense that it would be that way. But I'm still >> left >> >> wondering why, then, my Hive queries are only using 2 mappers per task >> >> tracker when other jobs use 7. I've gone so far as to diff the job.xml >> >> files from a regular job and a Hive query, and didn't turn up anything >> - >> >> though clearly, it has to be something Hive is doing. >> >> >> >> Thanks, >> >> - Brad >> >> >> >> >> >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[EMAIL PROTECTED]> >> wrote: >> >>> >> >>> Hi Brad, >> >>> >> >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the >> >>> TaskTracker when it starts up. It cannot be changed per-job. >> >>> >> >>> Hope that helps >> >>> -Todd >> >>> >> >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[EMAIL PROTECTED]> >> >>> wrote: >> >>>> >> >>>> TIA if anyone can point me in the right direction on this. >> >>>> >> >>>> I'm running a simple Hive query (a count on an external table >> comprising >> >>>> 436 files, each of ~2GB). The cluster's mapred-site.xml specifies >> >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per >> worker >> >>>> node. When I run regular MR jobs via "bin/hadoop jar myJob.jar...", >> I see 7 >> >>>> mappers spawned on each worker. >> >>>> >> >>>> The problem: When I run my Hive query, I see 2 mappers spawned per >> >>>> worker. >> >>>> >> >>>> When I do "set -v;" from the Hive command line, I see >> >>>> mapred.tasktracker.map.tasks.maximum = 7. >> >>>> >> >>>> The job.xml for the Hive query shows >> >>>> mapred.tasktracker.map.tasks.maximum = 7. >> >>>> >> >>>> The only lead I have is that the default for >> >>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's >> overridden >> >>>> in the cluster's mapred-site.xml I've tried redundanltly overriding Brad Heintz [EMAIL PROTECTED]
-
RE: Strange behavior during Hive queriesRavi Jagannathan 2009-09-14, 19:16
http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers
Related issue , hive used too many mappers for very small table. ________________________________ From: Brad Heintz [mailto:[EMAIL PROTECTED]] Sent: Monday, September 14, 2009 11:51 AM To: [EMAIL PROTECTED] Subject: Re: Strange behavior during Hive queries Ashish - mapred.min.split.size is set to 0 (according to the job.xml). The data are stored as uncompressed text files. Plan is attached. I've been over it and didn't find anything useful, but I'm also new to Hive and don't claim to understand everything I'm looking at. If you have any insight, I'd be most grateful. Many thanks, - Brad On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: How is your data stored - sequencefiles, textfiles, compressed?? and what are the value of mapred.min.split.size? Hive does not usually make a decision on the number of mappers but it does try to make an estimate of the number of reducers to use. Also if you send out the plan that would be great. Ashish ________________________________ From: Brad Heintz [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Sent: Sunday, September 13, 2009 9:36 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: Strange behavior during Hive queries Edward - Yeah, I figured Hive had some decisions it made internally about how many mappers & reducers it used, but this is acting on almost 1TB of data - I don't see why it would use fewer mappers. Also, this isn't a sort (which would of course use only 1 reducer) - it's a straight count. Thanks, - Brad On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Hrm... sorry, I didn't read your original query closely enough. > > I'm not sure what could be causing this. The map.tasks.maximum parameter > shouldn't affect it at all - it only affects the number of slots on the > trackers. > > By any chance do you have mapred.max.maps.per.node set? This is a > configuration parameter added by HADOOP-5170 - it's not in trunk or the > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this > parameter could cause the behavior you're seeing. However, it would > certainly not default to 2, so I'd be surprised if that were it. > > -Todd > > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >> >> Todd - >> >> Of course; it makes sense that it would be that way. But I'm still left >> wondering why, then, my Hive queries are only using 2 mappers per task >> tracker when other jobs use 7. I've gone so far as to diff the job.xml >> files from a regular job and a Hive query, and didn't turn up anything - >> though clearly, it has to be something Hive is doing. >> >> Thanks, >> - Brad >> >> >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >>> >>> Hi Brad, >>> >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the >>> TaskTracker when it starts up. It cannot be changed per-job. >>> >>> Hope that helps >>> -Todd >>> >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> >>> wrote: >>>> >>>> TIA if anyone can point me in the right direction on this. >>>> >>>> I'm running a simple Hive query (a count on an external table comprising >>>> 436 files, each of ~2GB). The cluster's mapred-site.xml specifies >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker >>>> node. When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7 >>>> mappers spawned on each worker. >>>> >>>> The problem: When I run my Hive query, I see 2 mappers spawned per >>>> worker. >>>> >>>> When I do "set -v;" from the Hive command line, I see Hive does adjust some map/reduce settings based on the job size. Some tasks like a sort might only require one map/reduce to work as well. Brad Heintz [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Brad Heintz [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
-
RE: Strange behavior during Hive queriesNamit Jain 2009-09-14, 20:02
Currently, hive uses 1 mapper per file - does your table have lots of small files ? If yes, it might be a good idea to concatenate them into fewer files
From: Ravi Jagannathan [mailto:[EMAIL PROTECTED]] Sent: Monday, September 14, 2009 12:17 PM To: Brad Heintz; [EMAIL PROTECTED] Subject: RE: Strange behavior during Hive queries http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers Related issue , hive used too many mappers for very small table. ________________________________ From: Brad Heintz [mailto:[EMAIL PROTECTED]] Sent: Monday, September 14, 2009 11:51 AM To: [EMAIL PROTECTED] Subject: Re: Strange behavior during Hive queries Ashish - mapred.min.split.size is set to 0 (according to the job.xml). The data are stored as uncompressed text files. Plan is attached. I've been over it and didn't find anything useful, but I'm also new to Hive and don't claim to understand everything I'm looking at. If you have any insight, I'd be most grateful. Many thanks, - Brad On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: How is your data stored - sequencefiles, textfiles, compressed?? and what are the value of mapred.min.split.size? Hive does not usually make a decision on the number of mappers but it does try to make an estimate of the number of reducers to use. Also if you send out the plan that would be great. Ashish ________________________________ From: Brad Heintz [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Sent: Sunday, September 13, 2009 9:36 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: Strange behavior during Hive queries Edward - Yeah, I figured Hive had some decisions it made internally about how many mappers & reducers it used, but this is acting on almost 1TB of data - I don't see why it would use fewer mappers. Also, this isn't a sort (which would of course use only 1 reducer) - it's a straight count. Thanks, - Brad On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Hrm... sorry, I didn't read your original query closely enough. > > I'm not sure what could be causing this. The map.tasks.maximum parameter > shouldn't affect it at all - it only affects the number of slots on the > trackers. > > By any chance do you have mapred.max.maps.per.node set? This is a > configuration parameter added by HADOOP-5170 - it's not in trunk or the > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this > parameter could cause the behavior you're seeing. However, it would > certainly not default to 2, so I'd be surprised if that were it. > > -Todd > > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >> >> Todd - >> >> Of course; it makes sense that it would be that way. But I'm still left >> wondering why, then, my Hive queries are only using 2 mappers per task >> tracker when other jobs use 7. I've gone so far as to diff the job.xml >> files from a regular job and a Hive query, and didn't turn up anything - >> though clearly, it has to be something Hive is doing. >> >> Thanks, >> - Brad >> >> >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >>> >>> Hi Brad, >>> >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the >>> TaskTracker when it starts up. It cannot be changed per-job. >>> >>> Hope that helps >>> -Todd >>> >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> >>> wrote: >>>> >>>> TIA if anyone can point me in the right direction on this. >>>> >>>> I'm running a simple Hive query (a count on an external table comprising >>>> 436 files, each of ~2GB). The cluster's mapred-site.xml specifies Hive does adjust some map/reduce settings based on the job size. Some tasks like a sort might only require one map/reduce to work as well. Brad Heintz [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Brad Heintz [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
-
Re: Strange behavior during Hive queriesBrad Heintz 2009-09-14, 20:23
436 files, each about 2GB.
On Mon, Sep 14, 2009 at 4:02 PM, Namit Jain <[EMAIL PROTECTED]> wrote: > Currently, hive uses 1 mapper per file – does your table have lots of > small files ? If yes, it might be a good idea to concatenate them into fewer > files > > > > > > *From:* Ravi Jagannathan [mailto:[EMAIL PROTECTED]] > *Sent:* Monday, September 14, 2009 12:17 PM > *To:* Brad Heintz; [EMAIL PROTECTED] > *Subject:* RE: Strange behavior during Hive queries > > > > > http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers > > Related issue , hive used too many mappers for very small table. > > > ------------------------------ > > *From:* Brad Heintz [mailto:[EMAIL PROTECTED]] > *Sent:* Monday, September 14, 2009 11:51 AM > *To:* [EMAIL PROTECTED] > *Subject:* Re: Strange behavior during Hive queries > > > > Ashish - > > mapred.min.split.size is set to 0 (according to the job.xml). The data are > stored as uncompressed text files. > > Plan is attached. I've been over it and didn't find anything useful, but > I'm also new to Hive and don't claim to understand everything I'm looking > at. If you have any insight, I'd be most grateful. > > Many thanks, > - Brad > > On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <[EMAIL PROTECTED]> > wrote: > > How is your data stored - sequencefiles, textfiles, compressed?? and what > are the value of mapred.min.split.size? Hive does not usually make a > decision on the number of mappers but it does try to make an estimate of the > number of reducers to use. Also if you send out the plan that would be > great. > > > > Ashish > > > ------------------------------ > > *From:* Brad Heintz [mailto:[EMAIL PROTECTED]] > *Sent:* Sunday, September 13, 2009 9:36 AM > *To:* [EMAIL PROTECTED] > *Subject:* Re: Strange behavior during Hive queries > > Edward - > > Yeah, I figured Hive had some decisions it made internally about how many > mappers & reducers it used, but this is acting on almost 1TB of data - I > don't see why it would use fewer mappers. Also, this isn't a sort (which > would of course use only 1 reducer) - it's a straight count. > > Thanks, > - Brad > > On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[EMAIL PROTECTED]> > wrote: > > On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > > Hrm... sorry, I didn't read your original query closely enough. > > > > I'm not sure what could be causing this. The map.tasks.maximum parameter > > shouldn't affect it at all - it only affects the number of slots on the > > trackers. > > > > By any chance do you have mapred.max.maps.per.node set? This is a > > configuration parameter added by HADOOP-5170 - it's not in trunk or the > > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release > this > > parameter could cause the behavior you're seeing. However, it would > > certainly not default to 2, so I'd be surprised if that were it. > > > > -Todd > > > > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[EMAIL PROTECTED]> > wrote: > >> > >> Todd - > >> > >> Of course; it makes sense that it would be that way. But I'm still left > >> wondering why, then, my Hive queries are only using 2 mappers per task > >> tracker when other jobs use 7. I've gone so far as to diff the job.xml > >> files from a regular job and a Hive query, and didn't turn up anything - > >> though clearly, it has to be something Hive is doing. > >> > >> Thanks, > >> - Brad > >> > >> > >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > >>> > >>> Hi Brad, > >>> > >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the > >>> TaskTracker when it starts up. It cannot be changed per-job. > >>> > >>> Hope that helps > >>> -Todd > >>> > >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <[EMAIL PROTECTED]> > >>> wrote: > >>>> > >>>> TIA if anyone can point me in the right direction on this. > >>>> > >>>> I'm running a simple Hive query (a count on an external table Brad Heintz [EMAIL PROTECTED]
-
RE: Strange behavior during Hive queriesAshish Thusoo 2009-09-15, 23:23
Can't seem to make head or tail of this. How many mappers does the job spaws? The explain plan seems to be fine. Can you also do a
describe extended on both the input and the output table. Also what is the block size and how many hdfs nodes is this data spread over. Ashish ________________________________ From: Brad Heintz [mailto:[EMAIL PROTECTED]] Sent: Monday, September 14, 2009 1:23 PM To: [EMAIL PROTECTED] Subject: Re: Strange behavior during Hive queries 436 files, each about 2GB. On Mon, Sep 14, 2009 at 4:02 PM, Namit Jain <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Currently, hive uses 1 mapper per file - does your table have lots of small files ? If yes, it might be a good idea to concatenate them into fewer files From: Ravi Jagannathan [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Sent: Monday, September 14, 2009 12:17 PM To: Brad Heintz; [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: RE: Strange behavior during Hive queries http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers Related issue , hive used too many mappers for very small table. ________________________________ From: Brad Heintz [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Sent: Monday, September 14, 2009 11:51 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: Strange behavior during Hive queries Ashish - mapred.min.split.size is set to 0 (according to the job.xml). The data are stored as uncompressed text files. Plan is attached. I've been over it and didn't find anything useful, but I'm also new to Hive and don't claim to understand everything I'm looking at. If you have any insight, I'd be most grateful. Many thanks, - Brad On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: How is your data stored - sequencefiles, textfiles, compressed?? and what are the value of mapred.min.split.size? Hive does not usually make a decision on the number of mappers but it does try to make an estimate of the number of reducers to use. Also if you send out the plan that would be great. Ashish ________________________________ From: Brad Heintz [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Sent: Sunday, September 13, 2009 9:36 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: Strange behavior during Hive queries Edward - Yeah, I figured Hive had some decisions it made internally about how many mappers & reducers it used, but this is acting on almost 1TB of data - I don't see why it would use fewer mappers. Also, this isn't a sort (which would of course use only 1 reducer) - it's a straight count. Thanks, - Brad On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Hrm... sorry, I didn't read your original query closely enough. > > I'm not sure what could be causing this. The map.tasks.maximum parameter > shouldn't affect it at all - it only affects the number of slots on the > trackers. > > By any chance do you have mapred.max.maps.per.node set? This is a > configuration parameter added by HADOOP-5170 - it's not in trunk or the > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this > parameter could cause the behavior you're seeing. However, it would > certainly not default to 2, so I'd be surprised if that were it. > > -Todd > > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >> >> Todd - >> >> Of course; it makes sense that it would be that way. But I'm still left >> wondering why, then, my Hive queries are only using 2 mappers per task >> tracker when other jobs use 7. I've gone so far as to diff the job.xml >> files from a regular job and a Hive query, and didn't turn up anything - Hive does adjust some map/reduce settings based on the job size. Some tasks like a sort might only require one map/reduce to work as well. Brad Heintz [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Brad Heintz [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Brad Heintz [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
-
Re: Strange behavior during Hive queriesBrad Heintz 2009-09-16, 13:50
There are 14 mappers spawned when I do a Hive query - over 7 nodes. Other
jobs spawn 7 nodes per mapper (total of 49), rather than 2. Block size is default. I'll try the "describe extended" as soon as I get a chance. Thanks, - Brad On Tue, Sep 15, 2009 at 7:23 PM, Ashish Thusoo <[EMAIL PROTECTED]> wrote: > Can't seem to make head or tail of this. How many mappers does the job > spaws? The explain plan seems to be fine. Can you also do a > > describe extended > > on both the input and the output table. > > Also what is the block size and how many hdfs nodes is this data spread > over. > > Ashish > ------------------------------ > *From:* Brad Heintz [mailto:[EMAIL PROTECTED]] > *Sent:* Monday, September 14, 2009 1:23 PM > > *To:* [EMAIL PROTECTED] > *Subject:* Re: Strange behavior during Hive queries > > 436 files, each about 2GB. > > > On Mon, Sep 14, 2009 at 4:02 PM, Namit Jain <[EMAIL PROTECTED]> wrote: > >> Currently, hive uses 1 mapper per file – does your table have lots of >> small files ? If yes, it might be a good idea to concatenate them into fewer >> files >> >> >> >> >> >> *From:* Ravi Jagannathan [mailto:[EMAIL PROTECTED]] >> *Sent:* Monday, September 14, 2009 12:17 PM >> *To:* Brad Heintz; [EMAIL PROTECTED] >> *Subject:* RE: Strange behavior during Hive queries >> >> >> >> >> http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers >> >> Related issue , hive used too many mappers for very small table. >> >> >> ------------------------------ >> >> *From:* Brad Heintz [mailto:[EMAIL PROTECTED]] >> *Sent:* Monday, September 14, 2009 11:51 AM >> *To:* [EMAIL PROTECTED] >> *Subject:* Re: Strange behavior during Hive queries >> >> >> >> Ashish - >> >> mapred.min.split.size is set to 0 (according to the job.xml). The data >> are stored as uncompressed text files. >> >> Plan is attached. I've been over it and didn't find anything useful, but >> I'm also new to Hive and don't claim to understand everything I'm looking >> at. If you have any insight, I'd be most grateful. >> >> Many thanks, >> - Brad >> >> On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <[EMAIL PROTECTED]> >> wrote: >> >> How is your data stored - sequencefiles, textfiles, compressed?? and what >> are the value of mapred.min.split.size? Hive does not usually make a >> decision on the number of mappers but it does try to make an estimate of the >> number of reducers to use. Also if you send out the plan that would be >> great. >> >> >> >> Ashish >> >> >> ------------------------------ >> >> *From:* Brad Heintz [mailto:[EMAIL PROTECTED]] >> *Sent:* Sunday, September 13, 2009 9:36 AM >> *To:* [EMAIL PROTECTED] >> *Subject:* Re: Strange behavior during Hive queries >> >> Edward - >> >> Yeah, I figured Hive had some decisions it made internally about how many >> mappers & reducers it used, but this is acting on almost 1TB of data - I >> don't see why it would use fewer mappers. Also, this isn't a sort (which >> would of course use only 1 reducer) - it's a straight count. >> >> Thanks, >> - Brad >> >> On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[EMAIL PROTECTED]> >> wrote: >> >> On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: >> > Hrm... sorry, I didn't read your original query closely enough. >> > >> > I'm not sure what could be causing this. The map.tasks.maximum parameter >> > shouldn't affect it at all - it only affects the number of slots on the >> > trackers. >> > >> > By any chance do you have mapred.max.maps.per.node set? This is a >> > configuration parameter added by HADOOP-5170 - it's not in trunk or the >> > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release >> this >> > parameter could cause the behavior you're seeing. However, it would >> > certainly not default to 2, so I'd be surprised if that were it. >> > >> > -Todd >> > >> > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <[EMAIL PROTECTED]> Brad Heintz [EMAIL PROTECTED]
-
Re: Strange behavior during Hive queriesZheng Shao 2009-09-17, 04:56
You mean 14 mappers running concurrently, correct?
How many mappers in total for the hive query? Zheng On Wed, Sep 16, 2009 at 6:50 AM, Brad Heintz <[EMAIL PROTECTED]> wrote: > There are 14 mappers spawned when I do a Hive query - over 7 nodes. Other > jobs spawn 7 nodes per mapper (total of 49), rather than 2. > > Block size is default. > > I'll try the "describe extended" as soon as I get a chance. > > Thanks, > - Brad > > > On Tue, Sep 15, 2009 at 7:23 PM, Ashish Thusoo <[EMAIL PROTECTED]>wrote: > >> Can't seem to make head or tail of this. How many mappers does the job >> spaws? The explain plan seems to be fine. Can you also do a >> >> describe extended >> >> on both the input and the output table. >> >> Also what is the block size and how many hdfs nodes is this data spread >> over. >> >> Ashish >> ------------------------------ >> *From:* Brad Heintz [mailto:[EMAIL PROTECTED]] >> *Sent:* Monday, September 14, 2009 1:23 PM >> >> *To:* [EMAIL PROTECTED] >> *Subject:* Re: Strange behavior during Hive queries >> >> 436 files, each about 2GB. >> >> >> On Mon, Sep 14, 2009 at 4:02 PM, Namit Jain <[EMAIL PROTECTED]> wrote: >> >>> Currently, hive uses 1 mapper per file – does your table have lots of >>> small files ? If yes, it might be a good idea to concatenate them into fewer >>> files >>> >>> >>> >>> >>> >>> *From:* Ravi Jagannathan [mailto:[EMAIL PROTECTED]] >>> *Sent:* Monday, September 14, 2009 12:17 PM >>> *To:* Brad Heintz; [EMAIL PROTECTED] >>> *Subject:* RE: Strange behavior during Hive queries >>> >>> >>> >>> >>> http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers >>> >>> Related issue , hive used too many mappers for very small table. >>> >>> >>> ------------------------------ >>> >>> *From:* Brad Heintz [mailto:[EMAIL PROTECTED]] >>> *Sent:* Monday, September 14, 2009 11:51 AM >>> *To:* [EMAIL PROTECTED] >>> *Subject:* Re: Strange behavior during Hive queries >>> >>> >>> >>> Ashish - >>> >>> mapred.min.split.size is set to 0 (according to the job.xml). The data >>> are stored as uncompressed text files. >>> >>> Plan is attached. I've been over it and didn't find anything useful, but >>> I'm also new to Hive and don't claim to understand everything I'm looking >>> at. If you have any insight, I'd be most grateful. >>> >>> Many thanks, >>> - Brad >>> >>> On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <[EMAIL PROTECTED]> >>> wrote: >>> >>> How is your data stored - sequencefiles, textfiles, compressed?? and what >>> are the value of mapred.min.split.size? Hive does not usually make a >>> decision on the number of mappers but it does try to make an estimate of the >>> number of reducers to use. Also if you send out the plan that would be >>> great. >>> >>> >>> >>> Ashish >>> >>> >>> ------------------------------ >>> >>> *From:* Brad Heintz [mailto:[EMAIL PROTECTED]] >>> *Sent:* Sunday, September 13, 2009 9:36 AM >>> *To:* [EMAIL PROTECTED] >>> *Subject:* Re: Strange behavior during Hive queries >>> >>> Edward - >>> >>> Yeah, I figured Hive had some decisions it made internally about how many >>> mappers & reducers it used, but this is acting on almost 1TB of data - I >>> don't see why it would use fewer mappers. Also, this isn't a sort (which >>> would of course use only 1 reducer) - it's a straight count. >>> >>> Thanks, >>> - Brad >>> >>> On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[EMAIL PROTECTED]> >>> wrote: >>> >>> On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: >>> > Hrm... sorry, I didn't read your original query closely enough. >>> > >>> > I'm not sure what could be causing this. The map.tasks.maximum >>> parameter >>> > shouldn't affect it at all - it only affects the number of slots on the >>> > trackers. >>> > >>> > By any chance do you have mapred.max.maps.per.node set? This is a >>> > configuration parameter added by HADOOP-5170 - it's not in trunk or the Yours, Zheng
-
Re: Strange behavior during Hive queriesBrad Heintz 2009-09-17, 14:36
No - 2 mappers per node, 7 nodes = 14 mappers total. Most jobs use 7 per
node (49 total). On Thu, Sep 17, 2009 at 12:56 AM, Zheng Shao <[EMAIL PROTECTED]> wrote: > You mean 14 mappers running concurrently, correct? > How many mappers in total for the hive query? > > Zheng > > > On Wed, Sep 16, 2009 at 6:50 AM, Brad Heintz <[EMAIL PROTECTED]>wrote: > >> There are 14 mappers spawned when I do a Hive query - over 7 nodes. Other >> jobs spawn 7 nodes per mapper (total of 49), rather than 2. >> >> Block size is default. >> >> I'll try the "describe extended" as soon as I get a chance. >> >> Thanks, >> - Brad >> >> >> On Tue, Sep 15, 2009 at 7:23 PM, Ashish Thusoo <[EMAIL PROTECTED]>wrote: >> >>> Can't seem to make head or tail of this. How many mappers does the job >>> spaws? The explain plan seems to be fine. Can you also do a >>> >>> describe extended >>> >>> on both the input and the output table. >>> >>> Also what is the block size and how many hdfs nodes is this data spread >>> over. >>> >>> Ashish >>> ------------------------------ >>> *From:* Brad Heintz [mailto:[EMAIL PROTECTED]] >>> *Sent:* Monday, September 14, 2009 1:23 PM >>> >>> *To:* [EMAIL PROTECTED] >>> *Subject:* Re: Strange behavior during Hive queries >>> >>> 436 files, each about 2GB. >>> >>> >>> On Mon, Sep 14, 2009 at 4:02 PM, Namit Jain <[EMAIL PROTECTED]> wrote: >>> >>>> Currently, hive uses 1 mapper per file – does your table have lots of >>>> small files ? If yes, it might be a good idea to concatenate them into fewer >>>> files >>>> >>>> >>>> >>>> >>>> >>>> *From:* Ravi Jagannathan [mailto:[EMAIL PROTECTED]] >>>> *Sent:* Monday, September 14, 2009 12:17 PM >>>> *To:* Brad Heintz; [EMAIL PROTECTED] >>>> *Subject:* RE: Strange behavior during Hive queries >>>> >>>> >>>> >>>> >>>> http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers >>>> >>>> Related issue , hive used too many mappers for very small table. >>>> >>>> >>>> ------------------------------ >>>> >>>> *From:* Brad Heintz [mailto:[EMAIL PROTECTED]] >>>> *Sent:* Monday, September 14, 2009 11:51 AM >>>> *To:* [EMAIL PROTECTED] >>>> *Subject:* Re: Strange behavior during Hive queries >>>> >>>> >>>> >>>> Ashish - >>>> >>>> mapred.min.split.size is set to 0 (according to the job.xml). The data >>>> are stored as uncompressed text files. >>>> >>>> Plan is attached. I've been over it and didn't find anything useful, >>>> but I'm also new to Hive and don't claim to understand everything I'm >>>> looking at. If you have any insight, I'd be most grateful. >>>> >>>> Many thanks, >>>> - Brad >>>> >>>> On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>> How is your data stored - sequencefiles, textfiles, compressed?? and >>>> what are the value of mapred.min.split.size? Hive does not usually make a >>>> decision on the number of mappers but it does try to make an estimate of the >>>> number of reducers to use. Also if you send out the plan that would be >>>> great. >>>> >>>> >>>> >>>> Ashish >>>> >>>> >>>> ------------------------------ >>>> >>>> *From:* Brad Heintz [mailto:[EMAIL PROTECTED]] >>>> *Sent:* Sunday, September 13, 2009 9:36 AM >>>> *To:* [EMAIL PROTECTED] >>>> *Subject:* Re: Strange behavior during Hive queries >>>> >>>> Edward - >>>> >>>> Yeah, I figured Hive had some decisions it made internally about how >>>> many mappers & reducers it used, but this is acting on almost 1TB of data - >>>> I don't see why it would use fewer mappers. Also, this isn't a sort (which >>>> would of course use only 1 reducer) - it's a straight count. >>>> >>>> Thanks, >>>> - Brad >>>> >>>> On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>> On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: >>>> > Hrm... sorry, I didn't read your original query closely enough. >>>> > >>>> > I'm not sure what could be causing this. The map.tasks.maximum Brad Heintz [EMAIL PROTECTED] |